How To Find And Resolve Blind Spots In Your Data

There’s a growing number oftools that you can use to analyze data for a business. But you may not be overly confident in the results if you don’t take the data’s blind spots into account. There’s no single way to do that, but we’ll look at some possibilities here.

The first thing to keep in mind is that a blind spot generally represents an “unknown unknown.” In other words, it’s a factor you didn’t take into account because you didn’t think, or know, to consider it.

  1. Start Locating Your Dark Data

When many business analysts talk about blind spots in data, dark data comes into the conversation. Dark data is also called “unclassified data,” and it’s information your business has but does not use for analytical purposes or any other reason related to running the business.

If you don’t have any idea how much dark data your company has, what kind of information it entails, and where your company stores it, that unawareness could cause blind spots.

More specifically, having an excessive amount of dark data could mean you spend more time searching for data than analyzing it. Or, dark data could open your company to regulatory risks if you cannot retrieve requested information during an audit.

Similarly, some dark data contains sensitive information that hackers might try to get. If they’re successful, you may not know a data breach took place until months later — if at all.

Fortunately, there are specialized software options that can discover the data your company has — dark or otherwise — and clean it so that you can eventually use the data to meet your business analysis goals.

Instead of being overly concerned about the business investment required for that software, think of the risks to your company if you continue to ignore your unclassified data and the blind spots it causes.

  1. Pay Attention to Data Stored on Mobiles and in the Public Cloud

It’s increasingly common for people to use smartphones and tablets during their workdays. Some of them do it especially frequently if they take part in fieldwork or visit clients at their homes. Vanson Bourne conducted a study for Veritas to find out more about dark data at the company level and ended up looking at mobile data, among other things.

The study results revealed several fascinating conclusions. First, it showed that, on average, 52% of data within organizations is unclassified and untagged. Veritas asserted that this issue constitutes a security risk because it leaves potentially business-critical information up for grabs by hackers.


How to Make a Success Story of your Data Science Team

Data science resounds throughout every industry and has reached the mainstream media. I no longer have to explain what I do for a living as long as I call it AI  —  we are at the peak of data science hype!

As a consequence, more and more companies are looking towards data science with big expectations, ready to invest into a team of their own. Unfortunately, the realities of data science in the enterprise are far from a success story.

NewVantage published a survey in January 2019 which found that 77% of businesses report challenges with business adaptation. This translates into ¾ of all data projects collecting dust rather than providing a return on the investment. Gartner has always been very critical of the data science success and they haven’t gotten more cheerful as of late: According to Gartner January 2019, even analytics insights will not deliver business outcomes through 2022, what’s the hope then for data science? It’s apparent that for some reasons making data science a success is really hard!

Me scaring Execs about their data science investments at the Data Leadership Summit, London 2019.

Regardless of whether you manage an existing data science team or are about to start a new greenfield project in big data or AI, it’s important to acknowledge the inevitable: the Hype Cycle.

Luc Galoppin on Flickr

The increasing visibility of data science and AI comes hand in hand with a peak of inflated expectations. In combination with the current success rate of such projects and teams we are headed straight for the cliff edge towards the trough of disillusionment.

Christopher Conroy summarised it perfectly in a recent interview for Information Age: the renewed hype around AI simply gives a false impression of progress from where businesses were years ago with big data and data science. Did we just find an even higher cliff edge?

Thankfully, it’s not all bad news. Some teams, projects and businesses are indeed successful (around 30% according to the surveys). We simply need a new focus on the requirements for success.


The Data Fabric for Machine Learning – Part 2: Building a Knowledge-Graph

I’ve been talking about the data fabric in general, and giving some concepts of Machine Learning and Deep Learning in the data fabric. And also gave my definition of the data fabric:

The Data Fabric is the platform that supports all the data in the company. How it’s managed, described, combined and universally accessed. This platform is formed from an Enterprise Knowledge Graph to create an uniform and unified data environment.

If you take a look at the definition, it says that the data fabric is formed from an Enterprise Knowledge Graph. So we better know how to create and manage it.



Set up the basis of knowledge-graphs theory and construction.


Explain the concepts of knowledge-graphs related to enterprises.
Give some recommendation about building a successful enterprise knowledge-graph.
Show examples of knowledge-graphs. 

Main theory

The fabric in the data fabric is built from a knowledge-graph, to create a knowledge-graph you need semantics and ontologies to find an useful way of linking your data that uniquely identifies and connects data with common business terms.

Section 1. What is a Knowledge-Graph?

The knowledge graph consists in integrated collections of data and information that also contains huge numbers of links between different data.

The key here is that instead of looking for possible answers, under this new model we’re seeking an answer. We want the facts — where those facts come from is less important. The data here can represent concepts, objects, things, people and actually whatever you have in mind. The graph fills in the relationships, the connections between the concepts.

In this context we can ask this question to our data lake:

What exists here?

We are in a different here. A one where it’s possible to set up a framework to study data and its relation to other data. In a knowledge-graph information represented in a particular formal ontology can be more easily accessible to automated information processing, and how best to do this is an active area of research in computer science like data science.

All data modeling statements (along with everything else) in ontological languages and the world of knowledge-graphs for data are incremental, by their very nature. Enhancing or modifying a data model after the fact can be easily accomplished by modifying the concept.

With a knowledge-graph what we are building is a human-readable representation of data that uniquely identifies and connects data with common business terms. This “layer” helps end users access data autonomously, securely and confidently.


The Data Fabric for Machine Learning. Part 1-b: Deep Learning on Graphs.


We are in the process of defining a new way of doing machine learning, focusing on a new paradigm, the data fabric.

In the past article I gave my new definition of machine learning:

Machine learning is the automatic process of discovering hidden insights in data fabric by using algorithms that are able to find those insights without being specifically programmed for that, to create models that solves a particular (or multiple) problem(s).

The premise for understanding this it’s that we have created a data fabric. For me the best tool out there for me for doing that is Anzo as I mentioned in other articles.

You can build something called “The Enterprise Knowledge Graph” with Anzo, and of course create your data fabric.

But now I want to focus on a topic inside machine learning, deep learning. In another article I gave a definition of deep learning:

Deep learning is a specific subfield of machine learning, a new take on learning representations from data which puts an emphasis on learning successive “layers” [neural nets] of increasingly meaningful representations.

Here we’ll talk about a combination of deep learning and graph theory, and see how it can help move our research forward.

Set the basis of doing deep learning on the data fabric.


Describe the basics of deep learning on graphs.
Explore the library Spektral.
Validate the possibility of doing deep learning on the data fabric.

Main Hypothesis

If we can construct a data fabric that supports all the data in the company, the automatic process of discovering insights through learning increasingly meaningful representations from data using neural nets (deep learning) can run inside the data fabric.


The Data Fabric for Machine Learning. Part 1.


If you search for machine learning online you’ll find around 2,050,000,000 results. Yeah for real. It’s not easy to find that description or definition that fits every use or case, but there are amazing ones. Here I’ll propose a different definition of machine learning, focusing on a new paradigm, the data fabric.

Explain the data fabric connection with machine learning.


Give a description of the data fabric and ecosystems to create it.
Explain in a few words what is machine learning.
Propose a way of visualizing machine learning insights inside of the data fabric.

Main theory

If we can construct a data fabric that supports all the data in the company, then a business insight inside of it can be thought as a dent in it. The automatic process of discovering what that insight is, it’s called machine learning.
Section 1. What is the Data Fabric?

I’ve talked before about the data fabric, and I gave a definition of it (I’ll put it here again bellow).

There are several words we should mention when we talk about the data fabric: graphs, knowledge-graph, ontology, semantics, linked-data. Read the article from above if you want those definitions; and then we can say that:

The Data Fabric is the platform that supports all the data in the company. How it’s managed, described, combined and universally accessed. This platform is formed from an Enterprise Knowledge Graph to create an uniform and unified data environment.

Let’s break that definition in parts. The first thing we need it’s a knowledge graph.

The knowledge graph consists in integrated collections of data and information that also contains huge numbers of links between different data. The key here is that instead of looking for possible answers, under this new model we’re seeking an answer. We want the facts — where those facts come from is less important. The data here can represent concepts, objects, things, people and actually whatever you have in mind. The graph fills in the relationships, the connections between the concepts.

Knowledge graphs also allow you to create structures for the relationships in the graph. With it, it’s possible to set up a framework to study data and its relation to other data (remember ontology?).

In this context we can ask this question to our data lake:

What exists here?

The concept of the data lake it’s important too because we need a place to store our data, govern it and run our jobs. But we need a smart data lake, a place that understand what we have and how to use it, that’s one of the benefits of having a data fabric.

The data fabric should be uniform and unified, meaning that we should make an effort in being able to organize all the data in the organization in one place and really manage and govern it.


Developing a Functional Data Governance Framework

Data Governance practices need to mature. Data Governance Trends in 2019 reports that dissatisfaction with the quality of business data continues in 2019, despite a growing understanding of Data Governance’s value.

Harvard Business Review reports 92 percent of executives say their Big Data and AI investments are accelerating, and 88 percent talk about a greater urgency to invest in Big Data and AI. In order for AI and machine learning to be successful, Data Governance must also be a success. Data Governance remains elusive to the 87 percent of businesses which, according to Gartner, have lower levels of Business Intelligence.

Recent news has also suggested a need to improve Data Governance processes. Data breaches continue to affect customers and the impacts are quite broad, as an organization’s customers (including banks, universities, and pharmaceutical companies) must continually take stock and change their user names and passwords. Effective Data Governance is a fundamental component of data security processes.

Data Governance has to drive improvements in business outcomes. “Implementing Data Governance poorly, with little connection or impact on business operations will just waste resources,” says Anthony Algmin, Principal at Algmin Data Leadership.

To mature, Data Governance needs to be business-led and a continuous process, as Donna Burbank and Nigel Turner emphasize. They recommend, as a first step, creating a Data Strategy, bringing together organization and people, processes and workflows, Data Management and measures, and culture and communication. Then creating and choosing a Data Governance Framework. Most importantly, periodically testing that Data Governance Framework.

To truly be confident in Data Governance structures, organizations need to do the critical testing before a breach or some other unexpected event occurs. It is this notion—implementing some testing—that is missing in much current Data Governance literature. Thinking like a software tester provides an alternative way of learning good Data Governance fundamentals.

Before Testing, Define Data Governance Requirements

Prior to offering feedback on any software developed, a great tester will ask for the product’s requirements to know what is expected and to clarify important ambiguities. Likewise, how does an organization know it has good Data Governance without understanding the agreed-upon specifications and its ultimate end? First, it helps to define the what Data Governance is supposed to do. DATAVERSITY® defines Data Governance as:

“A collection of practices and processes which help to ensure the formal management of data assets within an organization. Data Governance often includes other concepts such as Data Stewardship, Data Quality, and others to help an enterprise gain better control over its data assets.”

How Data Governance is implemented depends on business demands specifically leading to a Data Governance solution in the first place. This means breaking down the data vision and strategy into sub-goals and their components, such as a series of use cases. Nigel Turner and Donna Burbank give the following use case examples:


The Data Catalog Drives Digital Transformation – Artificial Intelligence Drives the Catalog

The Data Management category of products began with a focus on Data Integration, Master Data Management, Data Quality and management of Data Dictionaries. Today, the category has grown in importance and strategic value, with products that enhance discoverability and usability of an organization’s data by its employees. Essentially, Data Management has shifted from a tactical focus on documentation and regulatory compliance to a proactive focus on driving adoption of Analytics and accelerating data-driven thinking. At the center of this change is the modern Data Catalog.

The Importance of the Catalog

Data Catalogs began life as little more than repositories for database schema, sometimes accompanied by business documentation around the database tables and columns. In the present technology environment, Data Catalogs are business-oriented directories that help users find the data they need, quickly. Instead of looking up a table name and reading its description, users can search for business entities, then find data sets related to them, so they can quickly perform analysis and derive insights. That’s a 180-degree turn toward the business and digital transformation.

While this newer, more-business positive role for Data Catalogs is positive and progressive, it is not something that comes without effort. A Data Catalog is powerful only if its content is comprehensive and authoritative. Conversely, Data Catalogs that are missing key business or technical information will see poor adoption and can hinder an organization’s goals around building a data-driven culture. But how can enterprises, with their vast array of databases, applications and – increasingly – Data Lakes, build a catalog that is accurate and complete?

Begin to Build

One way to build a Data Catalog is by teaming business domain experts with technologists and go through the systems to which their expertise applies. Step-by-step, table-by-table and column-by-column, these experts can build out the knowledge base that is the Data Catalog. The problem with this approach is that it’s slow – slower, in fact, than the rate at which most organizations are adding new databases and data sets to their data landscape. As such, this approach is unsustainable.

Adding to the complexity, it’s increasingly the case that subject matter experts’ knowledge won’t cover databases in their entirety, and “tribal knowledge” is what’s really required to make a Data Catalog comprehensive and trustworthy. This then leads to an approach of “crowdsourcing” catalog information across business units and, indeed, the entire enterprise, to build out the catalog.

While the inclusivity of such an approach can be helpful, relying on crowdsourcing to augment business domain experts and build an authoritative catalog won’t get the job done. Crowdsourcing alone is a wing-and-a-prayer approach to Data Management.

Enter AI and ML

In the modern data arena, Artificial Intelligence and Machine Leaning must be used alongside subject matter expertise and crowdsourcing, in order to fully leverage their value, and keep up with today’s explosive growth of data. Business domain expertise and crowdsourcing anchor the catalog. Machine Learning scales that knowledge across an enterprise’s data estate to make the catalog comprehensive.

Artificial Intelligence and Machine learning can be used to discover relationships in databases, or Data Lakes, as well as between multiples of these. While some of these relationships may be contained in metadata, many will not be. Machine Learning, by analyzing the data itself, can find these hidden relationships, allowing experts to confirm the discoveries and make them even more accurate going forward.

Leveraging this relationship discovery helps extrapolate expert and crowd-sourced information in the catalog. When business entities are defined and associated with certain data elements, that same knowledge can be applied to related elements without having to be entered again. When business entities are tagged, the tags from related entities can be applied as well, so that discovered relationships can yield discovered tags.


Pillars of Data Governance Readiness: Enterprise Data Management Methodology

Pillars of Data Governance Readiness: Enterprise Data Management Methodology

Facebook’s data woes continue to dominate the headlines and further highlight the importance of having an enterprise-wide view of data assets. The high-profile case is somewhat different than other prominent data scandals as it wasn’t a “breach,” per se. But questions of negligence persist, and in all cases, data governance is an issue.

This week, the Wall Street Journal ran a story titled “Companies Should Beware Public’s Rising Anxiety Over Data.” It discusses an IBM poll of 10,000 consumers in which 78% of U.S. respondents say a company’s ability to keep their data private is extremely important, yet only 20% completely trust organizations they interact with to maintain data privacy. In fact, 60% indicate they’re more concerned about cybersecurity than a potential war.

The piece concludes with a clear lesson for CIOs: “they must make data governance and compliance with regulations such as the EU’s General Data Protection Regulation [GDPR] an even greater priority, keeping track of data and making sure that the corporation has the ability to monitor its use, and should the need arise, delete it.”

With a more thorough data governance initiative and a better understanding of data assets, their lineage and useful shelf-life, and the privileges behind their access, Facebook likely could have gotten ahead of the problem and quelled it before it became an issue. Sometimes erasure is the best approach if the reward from keeping data onboard is outweighed by the risk.

But perhaps Facebook is lucky the issue arose when it did. Once the GDPR goes into effect, this type of data snare would make the company non-compliant, as the regulation requires direct consent from the data owner (as well as notification within 72 hours if there is an actual breach).

Considering GDPR, as well as the gargantuan PR fallout and governmental inquiries Facebook faced, companies can’t afford such data governance mistakes.

During the past few weeks, we’ve been exploring each of the five pillars of data governance readiness in detail and how they come together to provide a full view of an organization’s data assets. In this blog, we’ll look at enterprise data management methodology as the fourth key pillar.
Enterprise Data Management in Four Steps

Enterprise data management methodology addresses the need for data governance within the wider data management suite, with all components and solutions working together for maximum benefits.

A successful data governance initiative should both improve a business’ understanding of data lineage/history and install a working system of permissions to prevent access by the wrong people. On the flip side, successful data governance makes data more discoverable, with better context so the right people can make better use of it.

This is the nature of Data Governance 2.0 – helping organizations better understand their data assets and making them easier to manage and capitalize on – and it succeeds where Data Governance 1.0 stumbled.

Enterprise Data Management: So where do you start?

Metadata management provides the organization with the contextual information concerning its data assets. Without it, data governance essentially runs blind.

The value of metadata management is the ability to govern common and reference data used across the organization with cross-departmental standards and definitions, allowing data sharing and reuse, reducing data redundancy and storage, avoiding data errors due to incorrect choices or duplications, and supporting data quality and analytics capabilities.