Insights Blog

Data Lake Challenges: Is Data Mesh the Answer?

Can data mesh solve self-serve analytics

It’s good enough for Netflix and Exasol customer Zalando, but is data mesh architecture the right approach for your organization and its data democratization journey? Exasol CTO Mathias Golombek investigates:

– What data mesh architecture is
– Why you might implement a data mesh
– The pros and cons of data mesh
– The pioneers of data mesh architecture

In my recent blog series, I delved into one of 2021’s hottest data topics – data democratization – exploring how it can fit into a business’ overarching data strategy along with some practical advice on how to implement data democratization in your own organization. 

For today’s follow-up, I’m introducing another contemporary data concept – the data mesh. I’ll explore the link between data democratization and data mesh as a means to connect siloed data and create a self-service data infrastructure that makes data highly available and easily discoverable for the people who need it. 

To be clear, I’m not advocating data mesh as a silver bullet to all the issues people experience with data lakes. It’s a concept that works for some, but not everyone. Ultimately, you’ll need to make up your own mind.

So, let’s get started.

What is a Data Lake?

A data lake is a centralized repository for raw data where it is selected and organized needed. Every data set in the data lake receives a unique identifier and metadata tags. The primary purpose of this is to enable quicker data access. Unlike typical databases and data warehouses, data lakes can store all data types (including images, video, and audio).

Data Lake Challenges

Data lakes have gained popularity due to their ability to handle large volumes of data. This is essential for Loading...big data analytics and Loading...machine learning, but implementing and managing a data lake comes with four unique challenges.

Scaling

After the initial deployment, a data lake will go through several stages of development. As the data lake grows, it becomes more difficult to find specific data and to manage permissions for users to access that data. A data lake must be paired with scalable data architecture.

Training

Many data engineers lack experience with data lakes and may lack the knowledge to maintain them longterm. Training engineers is time-consuming and expensive and already trained data lake engineers are highly competitive and in short supply.

Unstructured and semi-structured data

Unstructured and semi-structured data (such as images, videos, text, and audio files) could present a challenge because they are not easily interpreted or stored logically. Therefore, it’s important to know an organization’s requirements and goals, so that the data ingestion and storage pipelines can be appropriately designed and implemented.

Metadata management

Metadata gives data sets context and plays an important role in making data comprehensible and usable in applications. However, data lakes don’t categorize or apply definitions to data. Because data lakes are typically loaded with raw data, organizations may neglect to include validation steps or neglect to apply organizational data standards, making the stored data more difficult to access and organize.

These data lake challenges have frustrated organizations for too long. It’s time for a solution and data mesh architecture is the answer.

What is Data Mesh Architecture?

A data mesh architecture is a decentralized approach to data management that organizes data by specific business domains such as marketing, sales, and customer service. This approach grants more ownership to the producers of a given dataset who better understand the domain data, allowing them to set governance policies that focus on documentation, quality, and access. This leads to self-service use across the organization, eliminating operational bottlenecks associated with centralized systems.

Ananth Packkildurai’s article in “Data Mesh Simplified” contains a great analogy for the sad state of data infrastructure in many organizations. He likens the modern data generation process to the equivalent of writing a dictionary without any definitions, shuffling the words up randomly and then hiring expensive and analysts to try and make sense of it all. While this analogy certainly doesn’t apply to every organization it definitely resonates – and is at the core of why the data mesh principle has gained such a following over the last few years.  

To write about data mesh and not acknowledge the ground-breaking work of its creator, ThoughtWorks consultant Zhamak Dehghani, would be unforgivable. Her papers: How to Move Beyond a Monolithic Data Lake to a Distributed Data Mesh and Data Mesh Principles and Logical Architecture have become required reading on the topic and I urge you to check them out, if you haven’t already.

Why Implement a Data Mesh?

Prevailing data mesh theory argues that data platforms based on traditional data warehouse or data lake models have common failure modes that don’t scale well. Instead of centralized lakes, or warehouses, data mesh advocates the shift to a more decentralized and distributed architecture that fuels a self-serve data infrastructure and treats data more as a self-contained product.

As your data lakes grow, so too does the complexity of the data management involved. In a traditional lake architecture, you’ve typically got producers of data who generate it and send it into the data lake. However, this presents another data lake challenge as the data consumers down the line don’t necessarily have the same domain knowledge as the data producer and therefore struggle to understand it. The consumers then have to go back to the data producer to try and understand the data. Depending on whether the producer is a person or a machine the required level of human domain expertise may or may not be available.

By treating data as a product data mesh pushes data ownership responsibility to the team with the domain understanding to create, catalog, and store the data. The theory is that doing this during the data creation phase brings more visibility to the data and makes it easier to consume. As well as stopping any human knowledge siloes forming, it helps to truly democratize the data because data consumers don’t have to worry about data discovery and can focus on experimentation, innovation, and producing more value from the data.

Approach the Data Mesh with Caution 

Despite data mesh architecture gaining a lot of traction there are concerns in the industry about its application. And of course, there are plenty of strong advocates for the benefits of data warehouses and lakes. Going a stage further, my colleague Helena Schwenk recently blogged on the new concept of the data ‘lakehouse’ as a means to increase the flexibility of modern data infrastructures. 

As I said at the start, data mesh isn’t a panacea. But if you do go down this route, getting your tech stack right – or as right as possible – will be crucial to data mesh efforts. You need a very powerful central system that can handle all this diverse access, which is the beauty of the simplicity and performance of the Exasol database.

Learning from the Pioneers 

If you’re looking to implement the data mesh architecture, let me share a few examples of companies you can learn from, who’ve been very open and transparent about their journeys. 

Netflix processes trillions of events and petabytes of data a day. As it has scaled up original productions, data integration across the streaming service and the studio has become a priority. So Netflix turned to data mesh as a way to integrate data across hundreds of different data stores in a way that enables it to holistically optimize cost, performance and operational concerns presented a significant challenge. This great YouTube video explains more.

Europe’s biggest online fashion retailer – and Exasol customer – Zalando has also been on a journey from a centralized data lake towards embracing a distributed data mesh architecture. Here’s another great YouTube video from NDC Oslo where Max Schultze outlines Zalando’s ongoing efforts to make creation of data products simple.

I’d love to hear your thoughts on data mesh architecture as well, so get in touch on social media and let us know what you think!

Mathias Golombek is Exasol’s CTO. He joined in 2004 as a software developer, led the database optimization team, and became a member of the executive board in 2013. Although Mathias is primarily responsible for Exasol’s technology, his most important role is to create a great environment in which smart people enjoy building powerful products.

Solve Your Data Lake Challenges with Exasol

If you’re struggling to manage and scale your data lake, let Exasol be your guide. Our analytics database and data mesh architecture can improve scalability and accessibility so your data is more valuable to everyone within your organization.

Additional Reading on Data Mesh

To write about data mesh and not acknowledge the ground-breaking work of its creator, ThoughtWorks consultant Zhamak Dehghani, would be unforgivable. Her papers: How to Move Beyond a Monolithic Data Lake to a Distributed Data Mesh and Data Mesh Principles and Logical Architecture have become required reading on the topic and I urge you to check them out if you haven’t already.