Data Modelling Uncovered: Exploring Inmon, Kimball, and Data Vault with Real-World Examples

Mar 28

9 min read

In the previous post, we briefly introduced the three main data modelling approaches—Inmon, Kimball, and Data Vault—using a fun and relatable analogy. If you missed it, you can catch up [here].

In this post, we’ll take a deeper dive into each approach, exploring their unique methods with real-world data and demonstrating how to transform that data using each technique.

Inmon Data Modelling

A subject-oriented, integrated, non-volatile, and time-variant collection of data in support of management's decisions.
The data warehouse contains granular corporate data.
Data in the data warehouse can be used for many different purposes, including sitting and waiting for future requirements that are unknown today.

The subject-oriented and granularity aspects of the definition means that an Inmon model organizes data into major subject areas of the business and includes all the details related to those subjects.

In an e-commerce example:

Data from orders, inventory, and marketing is ingested from separate systems.
The data is stored in a highly normalized third normal form in the data warehouse.
To meet departmental needs, the data is modeled into star schemas or other formats.
The modeled data is placed in separate data marts for sales, marketing, and purchasing.
Each department can query its own optimized data structure for specific use cases.

The strict normalization in the data warehouse minimizes data duplication, resulting in fewer analytical errors, improved data integrity, and greater consistency.

But how exactly do we normalize data?

To better understand Codd's normalization principles, let's examine the example sales order data below. We'll begin with a denormalized version and walk through the process of transforming it into third normal form (3NF) step by step.

This denormalized table shows order details with six columns and the order ID as the primary key. It includes redundant and nested data, such as the order items column, which contains objects with SKU, number, price, quantity, and name.

To convert this denormalized table to the first normal form (1NF), we'll unnest the order items column and replace it with four columns: item SKU, price, quantity, and name.

Now, each row represents a single item in an order. Since an order can have multiple items, the order ID alone is no longer a unique primary key. We'll add an 'ItemNumber' column to ensure a unique primary key, creating a composite key of order ID and item number to uniquely identify each row.

But this form still contains redundant data and can be further normalized by converting it into the second normal form (2NF).

First normal form (1NF) to Second normal form (2NF)

To achieve 2NF, the table must first meet the requirements of 1NF, and all partial dependencies must be removed.

A partial dependency occurs when a subset of non-key columns that depend on some columns in the composite key.

For example, columns like customer ID, customer name, customer address, and order date depend solely on the order ID. Since knowing the order ID uniquely identifies these columns, we can split the sales order table into two: an order items table and an orders table. In this setup, the composite key of order ID and item number is the unique primary key for the order items table, while the order ID serves as the primary key for the orders table.

So now there are no more partial dependencies in these tables, but they have another form of dependency called a transitive dependency.

A transitive dependency occurs when a non-key column depends on another non-key column. While this type of dependency can exist in a table that is in a 2NF, a table in 3NF needs to meet all the requirements of a 2NF and have no transitive dependencies.

For example, in the order items table, the price and name of an item depend on its SKU, while in the orders table, the customer name and address depend on the customer ID.

Second normal form (2NF) to Third normal form (3NF)

To move from 2NF to 3NF, we eliminate transitive dependencies.

For the order items table, we create a separate 'Items' table that contains the name, price, and SKU for each item, with SKU as the unique primary key.

Similarly, for the orders table, we remove the transitive dependencies by creating a 'Customers' table with the customer name and address, using customer ID as the unique primary key.

- A database is typically considered normalized if it's in third normal form (3NF). - The degree of normalization depends on your use case. While denormalization can offer performance benefits by avoiding joins, normalization ensures efficient read/write operations and better data integrity. There's no one-size-fits-all solution.

Kimball Data Modelling

Data is modelled in star schemas:
- Fact table: business measures
- Dimension tables: context information
Enables faster iteration and modelling
Requires a good understanding of business requirements, which might not be well-defined

Here's how a Kimball data warehouse can be applied to e-commerce:

After ingesting data from the orders, inventory, and marketing systems, you model the data into multiple star schemas to represent various business metrics, then store these schemas directly in the data warehouse.

But what exactly is a star schema?

A star schema organizes business measures in a central fact table, surrounded by dimension tables that provide contextual information. This arrangement forms a star-like structure, which is how the schema gets its name.

While normalized models emphasize connecting data entities and reducing redundancy, the star schema (or dimensional model) focuses on organizing data for faster analytical queries and making it more accessible and understandable to business users.

Let's focus on the fact and dimension tables and how they support analytical queries.

The fact table stores quantitative business measures resulting from events, such as trip duration, price, and tip in a rideshare example. Each row represents the facts of a specific event.
When designing a star schema, you must define the "grain" — the level of detail in each fact table row.
- In a rideshare scenario, the grain could be:
- All rides in a day
- A single customer’s rides
- A single completed ride
- The most detailed level, called the atomic grain, corresponds to each ride by a customer.
Fact tables are append-only and immutable, reflecting events that can't be changed after they happen.
Fact tables are typically narrow and long, with fewer columns but many rows representing events.

A fact table is accompanied by dimension tables, which provide reference data and context for the events.
- Dimension tables describe the what, who, where, and when of each event.
- They typically have many columns.
While fact tables are narrow and long, dimension tables are wide and short.
In the rideshare example, dimension tables could include information about:
- Customers
- Drivers
- Trip locations

Fact tables are connected to dimension tables through foreign keys (FK).
Each dimension has a primary key (PK).
The fact table has its PK, typically a surrogate key instead of a natural key.
- Surrogate keys allow data from different source systems to be combined with varying key formats.
- They decouple star schema keys from source database keys, which may change over time.

As a data engineer, you'll often need to transform data from a normalized schema into a star schema. For instance, you may need to extract normalized data from a relational database, model it into a star schema for easier querying, and then load it into department-specific data marts.

Let's walk through an example of converting normalized data in third normal form (3NF) into a star schema.

Here’s the diagram of the normalized data from Inmon's data modelling, consisting of four tables: Customers, Orders, Order Items, and Items. We’ll add a table for the stores where each order was placed.

Now, let's say you're tasked with modelling this data into a star schema for data analysts. We’ll follow Kimball’s four key steps for designing a star schema.

Understand business needs: Identify the business events or processes to model in the fact tables
- Example scenario: After gathering requirements from a data analyst, you learn they want to analyze sales data to understand product sales, store performance, and brand popularity.
Deciding the grain: Whether it’s total sales, individual transactions, or product items within a transaction. For flexibility, choose the atomic grain to capture the lowest level of detail (individual product items in a sales transaction).
Next, select the dimensions based on the data analyst's needs. Since they’re focused on analyzing sales by stores, dates, and brands, create separate dimension tables for stores, item features, and dates.
Finally, create the fact table where each row represents a product in a sales transaction. Facts include quantity sold (item_quantity from the order items table) and price (from the items table).

The fact table must also include foreign keys to link to the dimension tables (store_key, item_key, date_key). It should have a primary key, ideally a surrogate key created from the combination of order_id and item_line_number. The fact_order_items table would look like this:

Each row in the order items fact table is linked to one row in the date dimension table, creating a one-to-one relationship. The date dimension can be associated with zero or many orders, resulting in a one-to-many relationship.

Similarly, each row in the fact table links to one row in both the store and item dimension tables, forming one-to-one relationships. However, each store and item can be associated with multiple orders, creating one-to-many relationships from the dimension tables to the fact table.

Data Vault Modelling

Offers a more flexible design to use in an agile environment
Data Vault Model:
- Hubs: core business concepts
- Links: relationships
- Satellite: context
Requires downstream modelling of the data

After ingesting data from the orders, inventory, and marketing systems, you model the data into multiple star schemas to represent various business metrics and then store these schemas directly in the data warehouse.

Let's take a closer look at the model in the enterprise data warehouse layer.

In the Data Vault model:

Hubs and links don’t store descriptive data; satellites do.
Users query a hub, which links to a satellite table containing relevant attributes.

Let's revisit the normalized data from Inmon's model, which includes four tables: Customers, Orders, Order Items, and Items, along with an additional table for the stores where each order was placed. We will transform this data into a Data Vault model using three key steps:

Hubs contain business keys that uniquely identify core business concepts.
To define business keys, ask: What is the identifiable business element? How do users search for data?
Business keys should not be tied to a specific source system for easy integration across systems.
In this case, the business keys would be:
- Customer Hub: Customer ID
- Order Hub: Order ID
- Store Hub: Store ID
- Item Hub: SKU
These keys represent unique business elements independent of source systems.

In addition to the business keys, a hub must also contain three additional standard fields:
- Hash Key: A calculated hash of the business key, used as the primary key for the hub.
- Load Date: The date the business key was first loaded into the hub.
- Record Source: The source from which the unique row was obtained.

Link tables connect two or more hubs to represent relationships between business keys.
Example Scenario:
- Customer and Order Link: Connect the Order Hub to the Customer Hub to represent the customer who placed the order.
- Item and Order Link: Connect the Item Hub to the Order Hub to represent the items ordered.
- Order and Store Link: Connect the Order Hub to the Store Hub to represent the store where the order was placed.
Link Table Requirements:
- Contain the primary and business keys from its parent hubs, e.g., order_hash_key and customer_hash_key
- Include the load date of the row and the record source.
- The primary key is a hash calculated from the business keys of the parent hubs, e.g., customer_order_hash_key

With link tables, you can easily add new relationships or update a current relationship without having to re-engineer the Data Vault.

Adapting to Business Changes:
- If the company shifts from selling items to selling services, create a new Service Hub and a corresponding Link table to connect it to the Order Hub.
- Discontinue the use of the old Item Hub and Item Order Link.
Modeling Many-to-Many Relationships:
- If the business allows multiple customers to contribute to the same order, use a Link Table to model this many-to-many relationship between the Customer Hub and the Order Hub.
- No changes are needed in the design—just add the new link table.

Satellites provide meaning to the hubs and links.
Example Scenario:
- Create a Customer Satellite Table with attributes like customer name and customer zip code.
- A Satellite Table for the Item Order Link can also be added to capture additional context, such as the quantity of each item in the order.
- Similarly, create satellite tables for Store, Order, and Item hubs.
Each satellite table should include the record source and have a primary key consisting of the hash key from the parent hub and the load date.

About

Benjamin ("Benj") Tabares Jr. is an experienced data practitioner with a strong track record of successfully delivering short- and long-term projects in data engineering, business intelligence, and machine learning. Passionate about solving complex customer challenges, Benj leverages data and technology to create impactful solutions. He collaborates closely with clients and stakeholders to deliver scalable data solutions that unlock business value and drive meaningful insights from data.