Datamarts, Data Vault and the Data Lake


Before there were Data Lakes and big data tools, there were Datamarts and the Data Vault.

This is a beginners overview of some of the terminology used by Business Intelligence teams in their data architecture.

 


Datamarts

This is the data that’s aggregated, modelled and structured so that you can build a report over the top of it. Your friendly BI Consultant has gathered the requirements from the business, agreed on the business rules and logic, then modelled the data to be user-friendly.

The Datamarts tables are organised in one of two forms. A ‘Star’ schema and a ‘Snowflake’ schema and are made of two types of tables.


Dimension Tables for Attributes

The Dimension table contains attributes or thing that describe. In the case of a User, this could be their first name, last name, age, etc. In the case of a Product, this could be it’s weight, colour or date of manufacture. These attributes should not be repeated in other tables around the database.


Fact Tables for Measures

The Fact table is used for counting how many times something has happened. How many orders, pay runs etc. They are used to capture an action that can then be joined back to a Dimension table.

 

Snowflake-schema-example
Example of a Snowflake schema by SqlPac at English Wikipedia, CC BY-SA 3.0

Star-schema-example
Example of a Star schema by SqlPac at English Wikipedia, CC BY-SA 3.0

Data Vault

The Data Vault is used to store source data at a more granular level. Generally, the data is not changed in any way, other than to add load date keys to track changes. These are the tables that build the Datamart tables.


Data Vault tables come in three types:

Sats (Satellites)

These are the complete source tables that contain descriptive information and time attributes so we can track changes and perform point-in-time analysis.


Hubs

These contain the business keys and any metadata. Nothing descriptive is written to a Hub.


Links

Links connect one or more Hubs together.


Data_Vault_Example
Example of a simple data vault model by Carbidfischer at English Wikipedia, CC BY-SA, CC BY-SA 4.0

Data Lake

Unlike the Data Vault, a Data Lake is designed to contain raw data at a low cost. Data can be:

Data stored here is retrieved and read when required and organised based on need, making it popular with data scientists who would rather keep the quirks and ‘noise’ in, rather than having it cleaned and aggregated.


Beware the Data Swamp

Even though the Data Lake concept gives more freedom to load data of all shapes and sizes, without a catalogue of entities, things get messy.

This isn’t just a matter of not knowing what’s in your Data Lake. If there are no controls over what kind of personal data is landed there could also be privacy issues.


Photo by Pixabay on Pexels

 

# # #

October 24, 2018

Bitnami