Organizations need a powerful infrastructure to realize the full value of their data. The purpose of this infrastructure is to organize data, ensure its quality, manage metadata and create a central catalog where the organization’s data can be queried. This infrastructure, called the data foundation, enables organizations to have clean, organized, and easily accessible data for better decision-making and business insights.
Data is the new oil.
—Clive Robert Humby OBE, Mathematician
Humby brought awareness to big data by declaring it the “new oil.” This metaphor set the stage for data-driven innovation, AI/ML, and generative AI. Many organizations began storing structured and unstructured data at scale—sometimes obsessively. “We might need this someday” was (and still is) an oft-repeated mantra. Organizations created indiscriminate collections of data stored in file systems, databases, data warehouses, and data lakes.
Data is the new milk: you need to use it quickly; otherwise it goes bad
—Emily Gorcenski, Data Scientist
Unfortunately, data stores often mimic flea markets: you can find many treasures there if you know what you are looking for, but you can also spend a lot of money on worthless things. Data collected without a purpose or specific use case is quickly viewed with skepticism by consumers who perceive it as a second-rate product. The origin is unclear, the quality is uncertain, and the documentation is missing. This problem is often a result of the data being managed by a separate team that lacks sufficient knowledge about the data’s origin, quality and meaning instead of the original producer.
In these cases, the data foundation is not as strong as it should be from technical and organizational perspectives. That is a problem.
It generates a lot of extra work. In my experience (at least at the companies I have worked at), up to 60% of data scientists’ time is spent organizing, cleaning, and reformatting data instead of solving business problems.
Additionally, your stored data may or may not comply with the data protection regulations of your country. Organizations must know these regulations and be able to prove their compliance. As an IT manager, I once received a seven-figure penalty notice from the data protection authorities. The reason was an employee report that we were in breach of data protection, which, thank goodness, was not the case. The fine was imposed because the data protection authority found that we hadn’t clearly documented why we were storing certain data and for how long. Fortunately we were able to refute the allegation, but having to deal with it in the first place was a lot of unnecessary and avoidable work.
Data quality is particularly important with generative AI. These foundation models produce generic data and fail to create competitive advantages because your competitors are likely using the same models and generating the same results. You have to train or customize the models with your own data, but doing this with low-quality data can generate poor results or reinforce existing biases in the model.
These data foundation issues are often underestimated and overlooked by managers for several reasons:
First, most managers and employees lack data literacy. Gartner defines data literacy “as the ability to read, write and communicate data in context, including an understanding of data sources and constructs, analytical methods and techniques applied, and the ability to describe the use case, application, and resulting value.” Poor data literacy is ranked as the second-biggest internal roadblock to the success of the CDO’s office, according to the Gartner Annual Chief Data Officer Survey.
Second, there are rarely processes put in place to regularly assess and monitor the probability and impact of data storage and use risks.
Third, there are rarely data inventory overviews that managers can understand. If there is a data inventory, it is made for data scientists using very specific technical information.
Do you know the state, risk, and value of the data in your company? And if not, who could provide you with an evaluation at the push of a button?
A strong data foundation consists of four dimensions:
- Strategy: Define a clear data strategy that follows your business strategy and supports strategic initiatives. Avoid being too technical ; it is intended to provide direction, not detailed instructions. Effective data strategies consist of clear and concise principles that describe how data is handled technically and organizationally. Some organizations, like the German property website Scout24, call it a data manifesto.
- Culture: A hefty number (69%) of CDOs spend most of their time on data-driven culture initiatives, and 55% view the lack of a data-driven culture as a top challenge to meeting business objectives. My colleague Ishit Vachhrajani has written a highly recommended e-book about this topic.
- Organization: Define clear business domain-oriented responsibilities for your analytical data. In central data teams, this responsibility is often weakly defined. These teams did not generate the data; they extracted it from transactional applications and now do their best to manage it for other units in the company.I advise transferring control of analytical data from the central data team to the organizational units that generate this data with their applications. This practice is called an organizational data mesh. These teams store data based on specific use cases and business issues that align with the needs of internal and external customers. The responsibility for the data is thus transferred organizationally to the producers in a distributed approach. Technically, they can store the data centrally in a data lake or distribute it in a data mesh. AWS offers services to build both of these modern data architectures.Because competence and control go hand in hand, you need to invest in your staff’s data literacy. Data producers are often competent in handling transactional data but lack analytical skills. AWS can help you with data analytics training.Furthermore, have appropriate access policies in place. Not everyone needs access to all data by default, but everyone should be able to discover available data in a data catalog and get access via APIs if needed. AWS Lake Formation easily creates secure data lakes, making data available for wide-ranging analytics. Use Amazon DataZone to discover and share data at scale across organizational boundaries with governance and access controls.
- Technology: A one-size-fits-all solution might not be the best choice for a strong data foundation that has to support different analytical use cases—especially when they are the responsibilities of different organizational units. I recommend applying a best-of-breed approach to use the best tool for each context and use case.These tools must be well-integrated and aligned with your overall tech strategy from an architectural point of view. AWS provides a comprehensive set of services to store and query, integrate, catalog, govern, and act on data. With these services, organizations can build centralized or distributed data architectures at scale. I generally recommend accelerating your cloud transformation and realizing the full potential of the AWS Cloud.It is important to apply modern and proven software development practices such as versioning, CI/CD, and automated testing to develop and operate analytical data systems. This increases productivity and quality while reducing development times and improving the traceability of changes.
Generative AI can make a valuable contribution to a future-proof data foundation. Large Language Models (LLMs) like the Amazon Titan models can assist in profiling your data, extracting and enriching metadata, maintaining your data catalog, and enhancing search with natural language. However, as with all generative AI applications, you still need to critically review the AI’s results and suggestions (e.g., is the generated metadata correct?).
Data and data infrastructures may seem complicated and confusing, but they can be used clearly and securely. Your organization’s data creates many opportunities; you just need to use them.
Data Is the New Wine
If you process, store, and refine data properly, you can achieve amazing results that get even better over time. If you don’t handle it carefully, it quickly loses quality and becomes useless.
What are your experiences with data foundations? I would be interested in hearing about some of them.
How to Build Data Capabilities, Ishit Vachhrajani
How to Create a Data-Driven Culture, Ishit Vachhrajani
Unmasking Your Organization’s Data Problem, Joe Chung