By Yogi Schulz
Yogi Schulz has more than 40 years of information technology experience in various industries. His specialties include IT strategy, web strategy, and systems project management. His forthcoming book, co-authored by Jocelyn Schulz Lapointe, is “A Project Sponsor’s Guide for Projects: Managing Risk and Improving Performance.”
If your organization builds a data lakehouse, will business end-users come?
Unfortunately, some chief information officers (CIOs) ultimately responsible for data lakehouses forget they’re not working with Kevin Costner on a sequel to the Field of Dreams movie. Instead, they are sucked into sponsoring an enterprise data lakehouse project by their IT staff.
First some technical terms:
A data warehouse is a core component of a business intelligence system. Also known as an enterprise data warehouse, it is a reporting and data analysis system.
A data lakehouse combines the low operating cost of a data lake — a repository of data stored in its original structure and format — with a data warehouse’s data management and structural features on a single platform.
CIOs are genuinely shocked when almost no one cares or wants to come and use the shiny new data lakehouse for business intelligence (BI) applications. They are more astounded when the organization complains about wasted money. CIOs expected the organization would sing their praises for the initiative to improve data integration, accessibility, and analytics.
What could possibly have gone wrong?
IT sponsorship vs. business sponsorship
When a well-intentioned CIO sponsors a data lakehouse project, the project typically will operate without the following:
A data lakehouse project dominated by IT leadership will lose momentum as development costs climb, with no valuable end-user deliverables such as reports and charts ever being produced. Eventually, the project is cancelled, and the reputation of the IT leadership takes a hit.
A superior approach is to build BI applications with business sponsorship supported by IT leadership. Now the priority is to address specific business problems or priorities, not IT’s assumptions about business data and requirements.
The stakeholders understand that the underlying data lakehouse is critical supporting infrastructure. However, that infrastructure does not dominate the project.
Technology focus or business benefit focus?
A data lakehouse project dominated by IT staff will tend to use the latest technology for developing and operating a data lakehouse, data lake, or data warehouse.
This focus occurs because the staff:
A dramatically cheaper approach to building BI applications is to leave as much of the data in the operational data store (ODS) where it resides. You only need to copy and transform data to a data lakehouse if the ODS structure is seriously unworkable in a BI context. This approach leaves more project budget to develop BI reports and charts — the desired business benefits.
Simple data sources vs. valuable data sources
A data lakehouse project dominated by IT staff will tend to import simple internal data sources into the data lakehouse, because the development effort is low. Also, the IT staff is typically unaware of the full range of useful external data sources.
A superior approach to building BI applications is collaborating with business analysts to rank data sources in decreasing order of business value. Then add the internal or external data sources to the BI environment one at a time, as new releases.
You should only add another data source once most of the previous release’s BI reports and charts have been completed. This approach minimizes time to value, ensures the most business value is achieved, and maintains stakeholder support for the BI project.
Advantages of minimal architecture
Domineering IT architects will be prone to design a data lakehouse using an idealized framework. The resulting architecture is often too elaborate to understand easily, challenging to load, and expensive to maintain.
A superior approach to building a data lakehouse environment is carefully balancing trade-offs among important design goals. These goals include querying performance, development complexity, and operating and maintenance costs. Another goal is to minimize the amount of data copied and transformed from operational data stores.
Every design idea that improves query performance, even if it adds complexity to the data lakehouse load, is worth implementing. Allowing idealized frameworks, though widely admired, to dominate the design is always a bad idea.
Data quality = business value
A data lakehouse project sponsored by the CIO will gravitate toward data quantity for the data lakehouse, because the team simply does not know which data sources are most helpful.
Data quantity, however, can obscure data quality. Poor quality will slow or inhibit the acceptance of the lakehouse as a functional BI environment, no matter how much data is there. This problem will also impede the development of enterprise and departmental BI applications.
Poor data quality first manifests itself through several IT technical issues. It hinders data integration from multiple sources, creates summation errors, and causes software crashes and system performance problems.
Poor data quality also leads to these business issues:
A data lakehouse project sponsored by the CIO may be unable to address these data quality shortcomings. The project will fail, because the end-user-visible deliverables are sparse and not helpful.
A superior approach to building business intelligence (BI) applications is prioritizing data sources for inclusion in the BI project, based on business value. Also, expect data quality challenges, and allocate business resources to meet them.
Finally, rank data sources based on quality, which will reduce time to value. This approach ensures that the BI reports and charts are accurate and will build confidence in the BI applications.
CIOs should quit listening to their ambitious techies and champion building BI applications with business sponsorship, supported by IT leadership to ensure that business end-users will come to a data lakehouse.
R$