Dimension Discovery
In data warehousing, dimensions are parts of the data model that play a huge role in making data understandable and intuitively composable. If
you want to learn more about your users, it’s convenient to have a users
dimension table with all of the attributes that belong
to each user. If your company is expanding into global markets, maintaining a country
dimension table that keeps track of business-relevant
data for each individual country will come in handy. You can think of many more examples but the key role of dimensions in data warehousing
is to categorize, describe, and segment your event data.
A construction company may have a repair_orders
table that contains all repair order requests. It’s possible this table only has unique IDs
for the repair order, the municipality where the work is to be performed, and the dispatcher that will coordinate the repair. To provide meaningful
insights and reporting, you’ll need to pull in more information from other dimension tables such as the requester’s location, repair’s location,
dispatcher’s contact information, estimate price, and actual final price.
[Diagram of a simple repair_orders fact table with a few arrows pointing to the dimension ID fields]
The Power of Dimension Nodes in DJ
At first glance, a dimension node in DJ looks very similar to transform nodes. It’s a named node that includes a query which can select from one or more other nodes. However, a dimension node is different in that it can be referenced by columns on any source node, transform node, and even other dimension nodes. To understand that, let’s look at a simple example of how a fact table is enriched with data from a dimension table.
[Diagram of a simple left join from the repair_orders
table to the dispatchers
table]
The repair_orders
table contains an ID of the dispatcher who’s responsible for coordinating the work. In order to
pull in more information about the dispatcher, a join must be performed to the dispatchers
dimension table which has a dispatcher_id
column that serves as its primary key. As you can see, this requires knowledge of the table’s layout and the proper dimension table to join to.
It’s not uncommon for a request for particular datasets to be “enriched” in this way to go through multiple channels until a data professional
most familiar with the data model performs this join and stores the result in a new table.
In DJ, you can pre-empt these kind of requests by tagging the column with information about which dimension node it is tied to. Instead of actually
joining to the dispatchers
dimension node, you can label the dispatcher_id
column on the repair_orders
node with a
reference to the ID column on the dispatchers
table. Now when someone makes a request asking DJ “What dimension attributes are available for
the metric num_repair_orders
?”, DJ will list all of the attribute columns from the dispatchers
dimension node. If the user requests
the number of repair orders grouped by each dispatcher, including the dispatcher’s contact information, DJ uses the column label to perform the
correct join and retrieve the extra attributes from the dispatchers
table.
[Diagram showing DJ use the column dimension label to join to the dispatchers dimension node]
Normalization & Denormalization Techniques
In data modeling, there are two analogous techniques to organizing tables. Normalization separates data into smaller tables and leans into linking
them through join keys. The benefits of this technique is that it reduces redundant data and provides a more logical layout. For example, instead of
including employee attributes such as name, phone number, and salary in many duplicated columns across multiple tables throughout your data model,
a normalized data model would keep all of that information in a single employees
dimension table with only the employee IDs existing in other
tables. The downside of highly normalized data is that you pretty much always have to perform joins to enrich your event data beyond just IDs. In other
words, when you do need specific employee attributes beyond their unique ID, you always have to join to pull that information in.
The opposite technique is called Denormalization where data is stored in “wider” tables that already include many of the attributes of various dimensions. This technique leans into redundantly including attributes of the dimension on the same table. The benefit of denormalization is that it decreases the frequency with which joins need to be performed to pull in information about various dimensions. This means less work for the query engine to perform and the cost of redundancy in turn leads to better performing queries.
DJ doesn’t have an opinion on either data modeling technique. Both types of data models will work well and you can define metric and dimension nodes on either data model type. That being said, DJ does a great job at easing the burden of knowing which join to do in a highly normalized data model. Using dimension labels on columns, DJ abstracts away a complicated relationship model behind an intuitive API where data or SQL can be retrieved for a set of metrics and dimensions.