MDW - TimeXtender Training

COURSE OVERVIEW

Training Module:

Modern Data Warehouse

Learn the concepts of data warehousing such as Dimensional Modeling, Dimension & Fact tables, and Denormalization etc.

Table of ContentS

Key Concepts: MDW

What is a Data Warehouse
What is Deployment and Execution?

Dimensional Modeling

What is Dimensional Modeling?
Denormalization
Dimension and Fact Tables
The Star Schema

Suggested Reading

Kimball’s Data Warehouse Toolkit (3rd Ed.)

Key Concepts: MDW

go to Table of Contents

What is a data warehouse?

It's a clean and organized collection of company data (most commonly stored in a relational database) which enables integrated reporting across systems to make informed business decisions. For this data to be easily accessible and accurate, three core functions must occur before it's presented to users:

Data Consolidation

Organizations generate data in Enterprise Resource Planning (ERP) software, Customer Relationship Management (CRM) software, Financial planning systems, text and Excel files. Combining these sources into a single, unified view reduces the complexity of working with multiple data sets and provides a comprehensive overview of the data.

Data cleansing

People aren't perfect, and they typically pay closer attention to performing their core job duties than they do on data entry. This leads to typos, duplicates, and inaccuracies. Data cleansing is the process of correcting these inaccuracies to ensure that the data is reliable for analysis and decision-making.

Data Modeling

Software systems store data for computers, not for people. So, to make it more easily consumable, data must be reorganized to more closely model how people categorize and understand the data. How the data is modeled, also has strong implications for how efficiently database systems can read and process large amounts of data.

That all sounds awfully complicated, why can't I just connect

(Power BI | Tableau | Qlik | Excel)

directly to my data source?

Even with automation, building a data warehouse requires significant investment and buy-in from multiple departments. If your source data is simple enough to extract value directly, then by all means, skip the data warehouse. However, as companies grow, data sources are added and source data becomes exceedingly more complex. It's at this point that an additional layer of infrastructure for consolidating, cleaning, and modeling the data become essential to make timely decisions.

what is deployment and execution?

deployment

TimeXtender creates or modifies existing tables, views, and stored procedures in the target storage database. Deployment is required when tables, views, fields, or lookups are added or changed.

execution

TimeXtender transfers and cleans data, resulting in a populated table. After the execution is complete the table can be previewed or queried. Execution can be initiated manually or run on a schedule. TimeXtender can execute multiple threads of data simultaneously (configured by editing the default execution package).

Differential Deployment: When enabled, TimeXtender calculates what steps require deployment (e.g. altering the raw table, or updating the data cleansing procedure.) and selects only those steps for deployment. When disabled, all steps are deployed.
Managed Deployment : When enabled, TimeXtender calculates dependencies and deploys the objects in the optimal order. When disabled, you need to ensure objects are deployed in the correct order.

During the deployment and execution TimeXtender generates multiple objects in SQL and can be viewed in SSMS. The table name without the postfix is the valid table.

The following postfix tables are generated by TimeXtender as seen in the gif.

_R Postfix: The table contains all the raw data.
_T Postfix: The table contains all the transformations.
_L Postfix: The table contains the logs and error.
_M Postfix: The table contains all the messages.

Dimensional Modeling

go to Table of Contents

What is dimensional modeling?

It's a well-established method for modeling large amounts of data to be efficient for analysis, more easily understood, and resistant to changing data source models. In order to understand dimensional modeling, you must first understand how most software stores data.

the normalized model

For software to effectively record data input by its users, it must reduce the number of locations where data is recorded. To do this application developers use a database design known as 3rd Normal form, or a normalized model. This approach reduces data redundancies by dividing data into many, separate tables. This way an update performed by the application only touches the database in a single location.

In the image here, each box represents a separate but related table in the normalized AdventureWorks database.

This model may be efficient for software systems, but much more difficult for humans to understand and analyze.

denormalization

Denormalization is the process of combining related data from multiple tables by repeating lower cardinality values in a single table. If we take the Product and Product Category tables as an example, the normalized model lists Products, Product categories, and Product subcategories only once in their associated tables.

Cardinality is the relationship between the number of rows in one table and the number of rows in another. In the example above, Product has more rows than ProductCategory, therefore Product has a higher cardinality.

During denormalization, tables are "joined" together, inserting the lower cardinality values directly into the final, higher cardinality table. Lower cardinality values are repeated several times and all the contextual information is combined into a single table known as a Dimension. This way we have all the information regarding each individual product in a single organized table with a clearly defined grain.

The Grain or Granularity of a table defines what each row in that table represents. In the example above, The Product table's grain is each individual product sold or consumed. When designing a dimensional model, it's important to first define a table's granularity.

dimension and Fact Tables

A key outcome of the data warehouse is to aggregate measurements of business processes and break the aggregates down into relevant categories. For example, which products create the most revenue? The "products" are the relevant category and "most revenue" is the aggregate measurement. To make analysis of this data as efficient as possible it's best to list all the similar data in the same place and format. So we organize data into two distinct table types:

fact table

The Fact Table is intended to record all the measurements for a specific business process, for example, items sold with the measurements being quantity, price, sale amount. Business often have a lot of transactions, these tables can get very tall, so it's best to only keep them narrow, keeping only the data that we absolutely need to aggregate. To do this, a Foreign Key is used to identify the additional contextual information stored in a separate dimension table.

DImension table

Dimension tables usually contain a vast number of attributes that offer context and relevant categories to segment data, making them wide. On the other hand, since products are sold more often than they are produced, dimension tables usually have fewer records than fact tables, making them short. To enable effortless identification of each entry in a dimension table and establish a link back to the fact table, every entry is assigned a unique Primary Key.

Foreign Key

Primary Key

One-to-many relationships

Primary and Foreign Keys are table values necessary for any relational database to function properly. A Primary Key is used to uniquely identify each row in a table. With those in place, Foreign keys can be used in other "related" tables to identify related information and join two tables together. Often times the fields in each table will have the same name, but not always.

While a valid Primary Key should always be unique, a fact table may refer to the same foreign key several times. For example, if three customers each buy the same product, the same foreign key would be listed three times. This is what's known as a One-to-Many Relationship. With the Product Dimension being the "One" (because there is only one of each product) and the Sales Fact being the "Many" (because the same product might be listed many times).

The star schema

The Star Schema is the successful result of dimensional modelling, in which a central, fact table is connected to several dimension tables through

primary and foreign key relationships.

By structuring tables in this way, the data is:

Easier to locate, since there is roughly one table for each "process" or "entity"
Faster to query, since the dimensional data is typically only one join away.