Viewpoint: Forrester’s Now Tech: Machine Learning Data Cataloging
Recently Forrester Research defined a new product category (with of course a corresponding new acronym) - Machine Learning Data Catalogs, a.k.a. MLDCs. We are pleased with our positioning in this category from Forrester, and you can enjoy a copy of this research here, courtesy of Podium Data.
Machine Learning Data Catalogs (MLDCs), describe solutions that combine data cataloging with ML/AI to first, build a persistent repository of metadata and then apply ML/AI to ferret out and expose potentially useful insights around underlying data assets. Forrester Principal Analyst Michelle Goetz, the author of this report called Now Tech: Machine Learning Data Catalogs, Q1 2018 - describes a "new class of data governance and metadata management solutions... [that] maintain data's context fidelity and keep it embedded in data systems for fast, adaptive data activation."
In our view, three components are bundled up within this definition. Like three legs on a stool these components are essential individually, and key enablers for one another in an overall MLDC solution.
The first - most obvious - component of a MLDC solution is the data catalog itself. The catalog needs to maintain a robust metadata repository describing a potentially large data collection. It also needs to allow data consumers across the enterprise to use and collaborate around that information to drive specific business needs. It follows therefore that the value of the catalog is absolutely dependent upon the accuracy and completeness of metadata it collects describing the underlying data assets.
The second leg of the MLDC stool is the layer of ML/AI capabilities. These capabilities - deployed alongside the catalog and leveraging metadata therein - convert first order data metrics (such as a baseline statistical profile) into higher level insights. In Forrester's words, “catalog and search capabilities are only as good as their ability to properly introspect the data."
"Introspect data" here means much more than just a copy metadata from other systems into the catalog and assume it’s right. Effective ML/AI requires that the MLDC solution somehow include detailed technical metadata describing the exact content, structure, and quality of each data asset (down to the individual field level). It also needs to collect and curate business metadata from SMEs whose proximity to the business and hands-on experience with specific data sources positions them to uniquely add insight to the catalog.
Effective Data Management
A MLDC built on bad metadata - metadata which does not accurately and completely document underlying data assets - is a bad catalog. Users who erroneously assume the catalog accurately reflects the underlying data assets risk reaching a false conclusion from inaccurate reports or analytic models. ML/AI algorithms are similarly vulnerable. That's why the third leg of the stool for MLDC solutions has to be implementation of rigorous data management practices, which guarantee that all data is thoroughly examined, organization, cleaned and documented before it is on-boarded into the collection. The results of that examination must be captured in the catalog and made available to both users and ML/AI algorithms to make the data transparent, trustworthy and eventually more useful.
Creating order from chaos in a new product segment is a thankless task.
In their report, Forrester applies two dimensions to bucketing vendors who play in the MLDC space once by company size and again based on the functionality segment of the solution. (Forrester put Podium in the midsize category ($25M - $100M annual MLDC revenue) for company size and in the “Embedded” functionality segment in the solution schema.)
Another approach would be to classify vendors based on how well they deliver the three essential components of a MLDC solution.
For example, stand-alone data catalog solutions whose sole focus is on providing data catalogs for big data and traditional data collections, score high on cataloging capability. Some are also moving quickly to integrate more ML/AI capability into their solutions to help expose relationships between data or other higher-level insights. But none of these vendors directly touch or examine data as it comes into the catalog. None of them comprehensively validate, quality check, clean, normalize, profile and document the data as it is on-boarded (and then add those findings into the metadata catalog.) Rather, they take the word of someone else. They assume the metadata they collect from source systems accurately describes that data. A dangerous leap of faith. They assume the data is clean and matches the prescribed schema. Or they rely on a partner to do that work and again assume the data they get is right. As a result, the quality of their catalog can be suspect. They have 2 legs of the stool, not three.
Every Mile in Data’s Journey Matters
At Podium we are fervent that data integrity begins at ingest – that “first mile” of the data journey and beyond. Podium’s attention to data on ingest through automated data validation and profiling covers a spectrum of critical checkpoints that Wranglers take for granted, yet benefit from, with their last-mile toolset. A personal trainer for the big data journey, Podium checks for data errors, incorrect formatting, and other idiosyncrasies common in mainframe and legacy big data sources up front.
And further data preparation and exploration capabilities continue under Podium where Wranglers do not – under a consolidated catalog of all data. This is new territory for Wranglers because their tools were not built for cataloging data and data governance - they were built for data manipulation. This is a realization among the Wranglers now and why you see them attempting to move further down from the first mile in the data path to ingestion, orchestration, preparation, governance, and exploration of data in a variety of modern and traditional repositories.