Back to top

Podium Pointer #17: Dimension Management in the Data Marketplace

Simplifying Business Transformation: Dimension Management in the Data Marketplace

Welcome to another installment of “Podium Pointers” – a series dedicated to tackling complex topics in as few words as possible, and leading readers to additional resources for information and assistance.

As companies transition from using RDBMSs and data warehouses, to using big data platforms and data lakes to manage data for analytic and reporting projects, some capabilities that were well understood and addressed in the past require a new approach.  One example is the question of how to track a slowly changing dimension (SCD) in the schema of analytic or reporting data environments in a data lake. This is particularly important given that, by default, data lakes don’t include a schema, much less capabilities to ensure schema integrity over time. Given the frequency of data updates and net new data loads into lakes, details on particular field level changes within a dimension can quickly be lost, creating hard to find, hard to diagnose data quality problems.

Here’s how the Podium Data Marketplace can help data teams flag slowly changing dimension events and surface those events to enable correct interpretation of that data before, during and after a change. One of several methods of dimension management is retaining a full history of values by adding a new row when a chosen field or attribute changes – marking the effective and expiration times identifying the period which the record was active (shown in the table below with “null” representing the current version in this case).

Supplier_Code

Supplier_Name

Supplier_State

Start_Date

End_Date

ABC

Acme Supply Co

CA

01-Jan-2000

21-Dec-2004

ABC

Acme Supply Co

IL

22-Dec-2004

NULL

Within Podium, the Type 2 SDC can be executed using one dataflow. Let’s take an example of a table called “policy_profile.”  As depicted below, it doesn’t contain history when on- boarded into the lake, just the current state of the policy associations. However, with point-in-time analysis, a critical aspect of operationalizing data lakes across the enterprise, we need to track the full history of changes. Thus, the creation of the “policy_profile_history” entity is necessary to the business to perform this analysis in addition to:

  • Detect changes
  • Manage change history
  • Add record effective dates
  • Answer the question, “What did the policy record contain on a given date?” 

Picture1.png

The Prepare Module within the Podium Data Marketplace provides a code-free, intuitive designer canvas and built in transformation library to create dataflows with simple drag and drop operations. With the combination of Change Data Capture (CDC), Router, Join, Transform and Union operators, a Podium Prepare dataflow shown below is created to accomplish the functional steps necessary to create the SCD policy profile history table updating only when policy changes are detected since the last build.

Picture2.png

Leveraging automatic validation and profiling ensuring good data quality and accuracy upon loading the policy_profile entity, each operator from the initial CDC to the final Union involved in creating the new target dataset is fully documented in Podium Prepare. For example, the Podium CDC operator generates a comparison with an additional field, called podium_iud, classifying the comparison with a specific value that is translated at the output port of the router operation shown below.

Picture3.png

The remaining dataflow, based on the result of the CDC operation, appropriately comes together populating the policy_profile_history table.  Once executed, a new target dataset is immediately available to be referenced on-demand for point-in-time analysis.  With the speed of data lakes and the dependability of data warehouses, the data marketplace ushers in a new era of capabilities. Driven by automation and operational self-service functionality, more users have data at their fingertips to generate insights faster and with a clear purpose.

Contact us for details of implementing SCDs on your data lake and stayed tuned for more Podium Pointers here on this blog.