The Data Management category of products began with a focus on Data Integration, Master Data Management, Data Quality and management of Data Dictionaries. Today, the category has grown in importance and strategic value, with products that enhance discoverability and usability of an organization’s data by its employees. Essentially, Data Management has shifted from a tactical focus on documentation and regulatory compliance to a proactive focus on driving adoption of Analytics and accelerating data-driven thinking. At the center of this change is the modern Data Catalog.
The Importance of the Catalog
Data Catalogs began life as little more than repositories for database schema, sometimes accompanied by business documentation around the database tables and columns. In the present technology environment, Data Catalogs are business-oriented directories that help users find the data they need, quickly. Instead of looking up a table name and reading its description, users can search for business entities, then find data sets related to them, so they can quickly perform analysis and derive insights. That’s a 180-degree turn toward the business and digital transformation.
While this newer, more-business positive role for Data Catalogs is positive and progressive, it is not something that comes without effort. A Data Catalog is powerful only if its content is comprehensive and authoritative. Conversely, Data Catalogs that are missing key business or technical information will see poor adoption and can hinder an organization’s goals around building a data-driven culture. But how can enterprises, with their vast array of databases, applications and – increasingly – Data Lakes, build a catalog that is accurate and complete?
Begin to Build
One way to build a Data Catalog is by teaming business domain experts with technologists and go through the systems to which their expertise applies. Step-by-step, table-by-table and column-by-column, these experts can build out the knowledge base that is the Data Catalog. The problem with this approach is that it’s slow – slower, in fact, than the rate at which most organizations are adding new databases and data sets to their data landscape. As such, this approach is unsustainable.
Adding to the complexity, it’s increasingly the case that subject matter experts’ knowledge won’t cover databases in their entirety, and “tribal knowledge” is what’s really required to make a Data Catalog comprehensive and trustworthy. This then leads to an approach of “crowdsourcing” catalog information across business units and, indeed, the entire enterprise, to build out the catalog.
While the inclusivity of such an approach can be helpful, relying on crowdsourcing to augment business domain experts and build an authoritative catalog won’t get the job done. Crowdsourcing alone is a wing-and-a-prayer approach to Data Management.
Enter AI and ML
In the modern data arena, Artificial Intelligence and Machine Leaning must be used alongside subject matter expertise and crowdsourcing, in order to fully leverage their value, and keep up with today’s explosive growth of data. Business domain expertise and crowdsourcing anchor the catalog. Machine Learning scales that knowledge across an enterprise’s data estate to make the catalog comprehensive.
Artificial Intelligence and Machine learning can be used to discover relationships in databases, or Data Lakes, as well as between multiples of these. While some of these relationships may be contained in metadata, many will not be. Machine Learning, by analyzing the data itself, can find these hidden relationships, allowing experts to confirm the discoveries and make them even more accurate going forward.
Leveraging this relationship discovery helps extrapolate expert and crowd-sourced information in the catalog. When business entities are defined and associated with certain data elements, that same knowledge can be applied to related elements without having to be entered again. When business entities are tagged, the tags from related entities can be applied as well, so that discovered relationships can yield discovered tags.