Data Noob - Data Warehouse Vs Data Lake

Mel O'Sullivan - October 22, 2020

When I  stepped in to the Data Analytics world I encountered a mind boggling array of confusing, apparently ambiguous and difficult to decipher terminology. All sorts of cute acronyms, dazzling visualisations, pretty dashboards and "snake oil" solutions to business pain points.

Be gentle with me - I'm a Noob in this space but I'll welcome conversations in the interest of learning more.

What on earth are Data Lakes, Data Warehouses, Data Clouds, Data Deserts, Data Silos, Data Swamps? Data is the new oil? Yes, Data Ecosystems are a thing. I haven't found evidence of a Data Porcupine yet - but if one exists it will probably be in the Cyber Security world.

Let's start with a Data Warehouse. A country woman, I've found a surprising real world equivalent to a Data Warehouse. I've included in this article a picture of a fairly complex Bulk Grain Silo Installation which works fairly well as a rough visualisation of how a Data Warehouse operates.

Grain handled in bulk is stored in silos which generally only contain one type of grain - for example, wheat, corn, or barley. Types of grain for storage are designated to silos before loading. Grain is loaded and unloaded in a highly organised, linear process. One truck lines up behind another to load to, or out of the designated silo. Grain is moved around the installation using either augurs or conveyor belts. Grains can only move in pre-determined directions - up, down, in, or out. Grain has to move in pre determined paths before it can be mixed or processed. Grains may be mixed or processed in some way to produce a designated value added product. At some point a human must be involved to ensure that the correct raw materials are mixed together to produce a saleable, quality assured, digestible product. When bulk grain moves it has viscosity - it flows and it gets blocked. Movement, age and use degrade quality. Grain has to be cleaned and preserved against invaders.

If you call a grain a piece of data you can get an idea of how a data warehouse stores and processes data. Designate the wheat silo "Sales Data". Designate the corn silo as "Costs". Call a conveyor a "Search" and call a feed mixer a "Report, or an Insight". Call the retail packaging a "Visualisation".

Can you see the similarity?

By contrast a Data Lake is a much more fluid thing. First of all you can store a lot more in it and the potential for scaleability is a lot higher. They can exist anywhere the "landscape" allows for them. You can draw a cup of water, or if you have the right tool you can empty the whole lake and wait for it to refill. You can draw water from any location in the lake. The water will have trickled or flowed into the lake from any number of sources - eg clouds, rain, runoff, soakage from underground, illegal industrial dumping, rivers, or some kid emptied their lunch water bottle in to it. The water may have been in the lake for a hundred years, or it may have filled yesterday from flooding. Yes - people do talk about data swamps - where the water is muddy, sludgy and slow moving.

Call a water molecule a piece of data. Clearly in a Data Lake you need special tools to refine, package and move the water. The data does not become actionable information/insight until it is packaged in a digestible fashion. On the upside you get access to datasets only limited in size by the amount of storage you have available.

Do you see the connection?

Data warehouses are structured and organized but have limitations in terms of size and scaleability. In a data warehouse the interrelationships between data types must be pre determined - not so in a Data Lake. Data lakes are huge, flexible, and scaleable. You get access to a huge dataset but it takes more effort to structure the data in order to produce insight and understanding. In some ways a Data Warehouse is often needed within a Data Lake to structure information so it can be interpreted by a human being. Humans beings generally struggle to reduce unstructured data to a digestible form by ourselves - there are simply too many things to look at. Data Lake analytic tools basically take unstructured data and turn it into structured data so human beings can interpret it.

Of course this is a very over simplified version of the technicalities of how Data Warehouses and Data Lakes operate. We all need start somewhere right?

At least that's how this Data Noob sees things.

Join our community

We have a Discord server where you’ll be able to chat with your instructors and cohort. Stay active in your learning!
Join discord