Training Module:
Modern Data Warehouse
Learn the concepts of data warehousing such as Dimensional Modeling, Dimension & Fact tables, and Denormalization etc.
Table of ContentS
Key Concepts: MDW
go to Table of Contents
What is a data warehouse?
It's a clean and organized collection of company data (most commonly stored in a relational database) which enables integrated reporting across systems to make informed business decisions. For this data to be easily accessible and accurate, three core functions must occur before it's presented to users:
Data Consolidation
Organizations generate data in Enterprise Resource Planning (ERP) software, Customer Relationship Management (CRM) software, Financial planning systems, text and Excel files. Combining these sources into a single, unified view reduces the complexity of working with multiple data sets and provides a comprehensive overview of the data.
Data cleansing
People aren't perfect, and they typically pay closer attention to performing their core job duties than they do on data entry. This leads to typos, duplicates, and inaccuracies. Data cleansing is the process of correcting these inaccuracies to ensure that the data is reliable for analysis and decision-making.
Data Modeling
Software systems store data for computers, not for people. So, to make it more easily consumable, data must be reorganized to more closely model how people categorize and understand the data. How the data is modeled, also has strong implications for how efficiently database systems can read and process large amounts of data.
That all sounds awfully complicated, why can't I just connect
(Power BI | Tableau | Qlik | Excel)
directly to my data source?
Even with automation, building a data warehouse requires significant investment and buy-in from multiple departments. If your source data is simple enough to extract value directly, then by all means, skip the data warehouse. However, as companies grow, data sources are added and source data becomes exceedingly more complex. It's at this point that an additional layer of infrastructure for consolidating, cleaning, and modeling the data become essential to make timely decisions.
what is deployment and execution?
deployment
TimeXtender creates or modifies existing tables, views, and stored procedures in the target storage database. Deployment is required when tables, views, fields, or lookups are added or changed.
execution
TimeXtender transfers and cleans data, resulting in a populated table. After the execution is complete the table can be previewed or queried. Execution can be initiated manually or run on a schedule. TimeXtender can execute multiple threads of data simultaneously (configured by editing the default execution package).
During the deployment and execution TimeXtender generates multiple objects in SQL and can be viewed in SSMS. The table name without the postfix is the valid table.
The following postfix tables are generated by TimeXtender as seen in the gif.
Dimensional Modeling
go to Table of Contents
What is dimensional modeling?
It's a well-established method for modeling large amounts of data to be efficient for analysis, more easily understood, and resistant to changing data source models. In order to understand dimensional modeling, you must first understand how most software stores data.
the normalized model
For software to effectively record data input by its users, it must reduce the number of locations where data is recorded. To do this application developers use a database design known as 3rd Normal form, or a normalized model. This approach reduces data redundancies by dividing data into many, separate tables. This way an update performed by the application only touches the database in a single location.
In the image here, each box represents a separate but related table in the normalized AdventureWorks database.
This model may be efficient for software systems, but much more difficult for humans to understand and analyze.
denormalization
Denormalization is the process of combining related data from multiple tables by repeating lower cardinality values in a single table. If we take the Product and Product Category tables as an example, the normalized model lists Products, Product categories, and Product subcategories only once in their associated tables.
Cardinality is the relationship between the number of rows in one table and the number of rows in another. In the example above, Product has more rows than ProductCategory, therefore Product has a higher cardinality.
During denormalization, tables are "joined" together, inserting the lower cardinality values directly into the final, higher cardinality table. Lower cardinality values are repeated several times and all the contextual information is combined into a single table known as a Dimension. This way we have all the information regarding each individual product in a single organized table with a clearly defined grain.
The Grain or Granularity of a table defines what each row in that table represents. In the example above, The Product table's grain is each individual product sold or consumed. When designing a dimensional model, it's important to first define a table's granularity.
dimension and Fact Tables
A key outcome of the data warehouse is to aggregate measurements of business processes and break the aggregates down into relevant categories. For example, which products create the most revenue? The "products" are the relevant category and "most revenue" is the aggregate measurement. To make analysis of this data as efficient as possible it's best to list all the similar data in the same place and format. So we organize data into two distinct table types:
fact table
The Fact Table is intended to record all the measurements for a specific business process, for example, items sold with the measurements being quantity, price, sale amount. Business often have a lot of transactions, these tables can get very tall, so it's best to only keep them narrow, keeping only the data that we absolutely need to aggregate. To do this, a Foreign Key is used to identify the additional contextual information stored in a separate dimension table.
DImension table
Dimension tables usually contain a vast number of attributes that offer context and relevant categories to segment data, making them wide. On the other hand, since products are sold more often than they are produced, dimension tables usually have fewer records than fact tables, making them short. To enable effortless identification of each entry in a dimension table and establish a link back to the fact table, every entry is assigned a unique Primary Key.
Foreign Key
Primary Key
One-to-many relationships
Primary and Foreign Keys are table values necessary for any relational database to function properly. A Primary Key is used to uniquely identify each row in a table. With those in place, Foreign keys can be used in other "related" tables to identify related information and join two tables together. Often times the fields in each table will have the same name, but not always.
While a valid Primary Key should always be unique, a fact table may refer to the same foreign key several times. For example, if three customers each buy the same product, the same foreign key would be listed three times. This is what's known as a One-to-Many Relationship. With the Product Dimension being the "One" (because there is only one of each product) and the Sales Fact being the "Many" (because the same product might be listed many times).
The star schema
The Star Schema is the successful result of dimensional modelling, in which a central, fact table is connected to several dimension tables through
primary and foreign key relationships.
By structuring tables in this way, the data is:
Suggested Reading
go to Table of Contents
These are just the basics
The above information is only a primer of the most basic data warehousing concepts. However, if you are planning to build an enterprise data warehouse, you should consider supplementing your education on this topic with additional resources.
Kimball's Data Warehouse Toolkit is considered to be one of the most definitive guides on the topic and goes into much greater detail with real world examples.
Section Quiz...
Planning to take the Solution Architect exam? Then you want to be sure these questions are easy to answer.
What are 3 core functions of a Data Warehouse?
What is Denormalization?
What's the difference between Fact and Dimension tables?
When you're ready, see Answers Below
Section Quiz Answers
Planning to take the Solution Architect exam? Then you want to be sure these questions are easy to answer.
What are 3 core functions of a Data Warehouse?
What is Denormalization?
Combining related data from multiple tables by repeating lower cardinality values in a single table.
What's the difference between Fact and Dimension tables?
Fact tables a skinny, long & include measures for aggregation, while dimensions are wide, short, & include attributes for categorization.
want to Learn even more?
Learn even more about dimensional modeling from TimeXtender Tuesdays.
Congratulations! You've completed the training module
Modern data warehouse