The Data Fabric for Machine Learning. Part 1.


If you search for machine learning online you’ll find around 2,050,000,000 results. Yeah for real. It’s not easy to find that description or definition that fits every use or case, but there are amazing ones. Here I’ll propose a different definition of machine learning, focusing on a new paradigm, the data fabric.

Explain the data fabric connection with machine learning.


Give a description of the data fabric and ecosystems to create it.
Explain in a few words what is machine learning.
Propose a way of visualizing machine learning insights inside of the data fabric.

Main theory

If we can construct a data fabric that supports all the data in the company, then a business insight inside of it can be thought as a dent in it. The automatic process of discovering what that insight is, it’s called machine learning.
Section 1. What is the Data Fabric?

I’ve talked before about the data fabric, and I gave a definition of it (I’ll put it here again bellow).

There are several words we should mention when we talk about the data fabric: graphs, knowledge-graph, ontology, semantics, linked-data. Read the article from above if you want those definitions; and then we can say that:

The Data Fabric is the platform that supports all the data in the company. How it’s managed, described, combined and universally accessed. This platform is formed from an Enterprise Knowledge Graph to create an uniform and unified data environment.

Let’s break that definition in parts. The first thing we need it’s a knowledge graph.

The knowledge graph consists in integrated collections of data and information that also contains huge numbers of links between different data. The key here is that instead of looking for possible answers, under this new model we’re seeking an answer. We want the facts — where those facts come from is less important. The data here can represent concepts, objects, things, people and actually whatever you have in mind. The graph fills in the relationships, the connections between the concepts.

Knowledge graphs also allow you to create structures for the relationships in the graph. With it, it’s possible to set up a framework to study data and its relation to other data (remember ontology?).

In this context we can ask this question to our data lake:

What exists here?

The concept of the data lake it’s important too because we need a place to store our data, govern it and run our jobs. But we need a smart data lake, a place that understand what we have and how to use it, that’s one of the benefits of having a data fabric.

The data fabric should be uniform and unified, meaning that we should make an effort in being able to organize all the data in the organization in one place and really manage and govern it.


Developing a Functional Data Governance Framework

Data Governance practices need to mature. Data Governance Trends in 2019 reports that dissatisfaction with the quality of business data continues in 2019, despite a growing understanding of Data Governance’s value.

Harvard Business Review reports 92 percent of executives say their Big Data and AI investments are accelerating, and 88 percent talk about a greater urgency to invest in Big Data and AI. In order for AI and machine learning to be successful, Data Governance must also be a success. Data Governance remains elusive to the 87 percent of businesses which, according to Gartner, have lower levels of Business Intelligence.

Recent news has also suggested a need to improve Data Governance processes. Data breaches continue to affect customers and the impacts are quite broad, as an organization’s customers (including banks, universities, and pharmaceutical companies) must continually take stock and change their user names and passwords. Effective Data Governance is a fundamental component of data security processes.

Data Governance has to drive improvements in business outcomes. “Implementing Data Governance poorly, with little connection or impact on business operations will just waste resources,” says Anthony Algmin, Principal at Algmin Data Leadership.

To mature, Data Governance needs to be business-led and a continuous process, as Donna Burbank and Nigel Turner emphasize. They recommend, as a first step, creating a Data Strategy, bringing together organization and people, processes and workflows, Data Management and measures, and culture and communication. Then creating and choosing a Data Governance Framework. Most importantly, periodically testing that Data Governance Framework.

To truly be confident in Data Governance structures, organizations need to do the critical testing before a breach or some other unexpected event occurs. It is this notion—implementing some testing—that is missing in much current Data Governance literature. Thinking like a software tester provides an alternative way of learning good Data Governance fundamentals.

Before Testing, Define Data Governance Requirements

Prior to offering feedback on any software developed, a great tester will ask for the product’s requirements to know what is expected and to clarify important ambiguities. Likewise, how does an organization know it has good Data Governance without understanding the agreed-upon specifications and its ultimate end? First, it helps to define the what Data Governance is supposed to do. DATAVERSITY® defines Data Governance as:

“A collection of practices and processes which help to ensure the formal management of data assets within an organization. Data Governance often includes other concepts such as Data Stewardship, Data Quality, and others to help an enterprise gain better control over its data assets.”

How Data Governance is implemented depends on business demands specifically leading to a Data Governance solution in the first place. This means breaking down the data vision and strategy into sub-goals and their components, such as a series of use cases. Nigel Turner and Donna Burbank give the following use case examples:


The Data Catalog Drives Digital Transformation – Artificial Intelligence Drives the Catalog

The Data Management category of products began with a focus on Data Integration, Master Data Management, Data Quality and management of Data Dictionaries. Today, the category has grown in importance and strategic value, with products that enhance discoverability and usability of an organization’s data by its employees. Essentially, Data Management has shifted from a tactical focus on documentation and regulatory compliance to a proactive focus on driving adoption of Analytics and accelerating data-driven thinking. At the center of this change is the modern Data Catalog.

The Importance of the Catalog

Data Catalogs began life as little more than repositories for database schema, sometimes accompanied by business documentation around the database tables and columns. In the present technology environment, Data Catalogs are business-oriented directories that help users find the data they need, quickly. Instead of looking up a table name and reading its description, users can search for business entities, then find data sets related to them, so they can quickly perform analysis and derive insights. That’s a 180-degree turn toward the business and digital transformation.

While this newer, more-business positive role for Data Catalogs is positive and progressive, it is not something that comes without effort. A Data Catalog is powerful only if its content is comprehensive and authoritative. Conversely, Data Catalogs that are missing key business or technical information will see poor adoption and can hinder an organization’s goals around building a data-driven culture. But how can enterprises, with their vast array of databases, applications and – increasingly – Data Lakes, build a catalog that is accurate and complete?

Begin to Build

One way to build a Data Catalog is by teaming business domain experts with technologists and go through the systems to which their expertise applies. Step-by-step, table-by-table and column-by-column, these experts can build out the knowledge base that is the Data Catalog. The problem with this approach is that it’s slow – slower, in fact, than the rate at which most organizations are adding new databases and data sets to their data landscape. As such, this approach is unsustainable.

Adding to the complexity, it’s increasingly the case that subject matter experts’ knowledge won’t cover databases in their entirety, and “tribal knowledge” is what’s really required to make a Data Catalog comprehensive and trustworthy. This then leads to an approach of “crowdsourcing” catalog information across business units and, indeed, the entire enterprise, to build out the catalog.

While the inclusivity of such an approach can be helpful, relying on crowdsourcing to augment business domain experts and build an authoritative catalog won’t get the job done. Crowdsourcing alone is a wing-and-a-prayer approach to Data Management.

Enter AI and ML

In the modern data arena, Artificial Intelligence and Machine Leaning must be used alongside subject matter expertise and crowdsourcing, in order to fully leverage their value, and keep up with today’s explosive growth of data. Business domain expertise and crowdsourcing anchor the catalog. Machine Learning scales that knowledge across an enterprise’s data estate to make the catalog comprehensive.

Artificial Intelligence and Machine learning can be used to discover relationships in databases, or Data Lakes, as well as between multiples of these. While some of these relationships may be contained in metadata, many will not be. Machine Learning, by analyzing the data itself, can find these hidden relationships, allowing experts to confirm the discoveries and make them even more accurate going forward.

Leveraging this relationship discovery helps extrapolate expert and crowd-sourced information in the catalog. When business entities are defined and associated with certain data elements, that same knowledge can be applied to related elements without having to be entered again. When business entities are tagged, the tags from related entities can be applied as well, so that discovered relationships can yield discovered tags.


Pillars of Data Governance Readiness: Enterprise Data Management Methodology

Pillars of Data Governance Readiness: Enterprise Data Management Methodology

Facebook’s data woes continue to dominate the headlines and further highlight the importance of having an enterprise-wide view of data assets. The high-profile case is somewhat different than other prominent data scandals as it wasn’t a “breach,” per se. But questions of negligence persist, and in all cases, data governance is an issue.

This week, the Wall Street Journal ran a story titled “Companies Should Beware Public’s Rising Anxiety Over Data.” It discusses an IBM poll of 10,000 consumers in which 78% of U.S. respondents say a company’s ability to keep their data private is extremely important, yet only 20% completely trust organizations they interact with to maintain data privacy. In fact, 60% indicate they’re more concerned about cybersecurity than a potential war.

The piece concludes with a clear lesson for CIOs: “they must make data governance and compliance with regulations such as the EU’s General Data Protection Regulation [GDPR] an even greater priority, keeping track of data and making sure that the corporation has the ability to monitor its use, and should the need arise, delete it.”

With a more thorough data governance initiative and a better understanding of data assets, their lineage and useful shelf-life, and the privileges behind their access, Facebook likely could have gotten ahead of the problem and quelled it before it became an issue. Sometimes erasure is the best approach if the reward from keeping data onboard is outweighed by the risk.

But perhaps Facebook is lucky the issue arose when it did. Once the GDPR goes into effect, this type of data snare would make the company non-compliant, as the regulation requires direct consent from the data owner (as well as notification within 72 hours if there is an actual breach).

Considering GDPR, as well as the gargantuan PR fallout and governmental inquiries Facebook faced, companies can’t afford such data governance mistakes.

During the past few weeks, we’ve been exploring each of the five pillars of data governance readiness in detail and how they come together to provide a full view of an organization’s data assets. In this blog, we’ll look at enterprise data management methodology as the fourth key pillar.
Enterprise Data Management in Four Steps

Enterprise data management methodology addresses the need for data governance within the wider data management suite, with all components and solutions working together for maximum benefits.

A successful data governance initiative should both improve a business’ understanding of data lineage/history and install a working system of permissions to prevent access by the wrong people. On the flip side, successful data governance makes data more discoverable, with better context so the right people can make better use of it.

This is the nature of Data Governance 2.0 – helping organizations better understand their data assets and making them easier to manage and capitalize on – and it succeeds where Data Governance 1.0 stumbled.

Enterprise Data Management: So where do you start?

Metadata management provides the organization with the contextual information concerning its data assets. Without it, data governance essentially runs blind.

The value of metadata management is the ability to govern common and reference data used across the organization with cross-departmental standards and definitions, allowing data sharing and reuse, reducing data redundancy and storage, avoiding data errors due to incorrect choices or duplications, and supporting data quality and analytics capabilities.


Data Privacy and Blockchain in the Age of IoT

Thanks to the internet of things (IoT), the world is connected like never before. While that fact has opened up a vast array of opportunities when it comes to communication and data sharing across platforms, it also come with concerns. Specifically, how can we ensure that our personal information is protected? And what roles do data scientists and big data play in protecting that sensitive information?

Blockchain technology is becoming an essential tool that can protect HIPAA-protected medical data and other forms of personal information that are worth quite a bit on the dark web. Society as a whole has come to favor wireless connectivity and 24/7 accessibility, but we need to ensure that our data privacy remains intact in the age of IoT-driven technology.

What Is Your Personal Data

As your personal data can take myriad forms, the amount it’s worth can vary considerably. For example, advertisers are interested in your consumer profile and shopping behavior data to develop personalized ads and run targeted campaigns. To cybercriminals, who aim to leverage personal data in order to commit identity theft, that data can be worth far more.

But that type of data sharing isn’t all bad: Consumer data can also be used by companies that have issued a product recall or by legal professionals putting together a claim against a negligent manufacturer. By identifying consumers that have purchased a defective product, companies can properly retrieve products at the individual level. In some instances, this can protect from you and your family from injury and even death, as exemplified the Fisher Price Rock ‘n Play Sleeper, which was recalled in April 2019.

But where are such organizations finding your data? Much of your consumer profile data can be found on your social media sites — especially Facebook, which has no problem sharing your data with advertisers. That’s because the corporation brings in an enormous amount of revenue from those advertisers.


An Introduction to Deep Learning and Neural Networks

It seems as if not a week goes by in which the artificial intelligence concepts of deep learning and neural networks make it into media headlines, either due to an exciting new use case or in an opinion piece speculating whether such rapid advances in AI will eventually replace the majority of human labor. Deep learning has improved speech recognition, genomic sequencing, and visual objection recognition, among many other areas.

The availability of exceptionally powerful computer systems at a reasonable cost, combined with the influx of large swathes of data that define the so-called Age of Big Data and the talents of data scientists, have together provided the foundation for the accelerated growth and use of deep learning and neural networks.

Companies are now beginning to adopt AI frameworks and libraries, such as MxNet, which is a deep learning framework that gives users the ability to train deep learning models using a variety of languages. There are also dedicated AI platforms aimed at supporting data scientists in deep learning modeling and training which professionals can integrate into their workflows.

It’s important, though, to specify that deep learning, neural networks, and machine learning are not interchangeable terms. This article helps to clarify the definitions for you with an introduction to deep learning and neural networks.

Deep Learning and Neural Networks Defined

Neural Network

An artificial neural network, shortened to neural network for simplicity, is a computer system that has the ability to learn how to perform tasks without any task-specific programming. For example, a simple neural network might learn how to recognize images that contain elephants using data alone.

The term neural network comes from the inspiration behind the architectural design of these systems, which was to mimic the basic structure of a biological brain’s own neural network so that computers could perform specific tasks.

The neural network has a layered design, with an input layer, an output layer, and one or more hidden layer between them. Mathematical functions—termed neurons—operate at all layers. Neurons essentially receive inputs and produce an output. Initially, random weights are associated with inputs, making the output of each neuron random. However, by using an algorithm that feeds errors back through the network, the system adapts the weights at each neuron and becomes better at producing an accurate output.


Two-thirds of the world’s population are now connected by mobile devices

This story was delivered to BI Intelligence Apps and Platforms Briefing subscribers. To learn more and subscribe, please click here.

Two-thirds of the world’s population are connected by mobile devices, according to data from GSMA.

This milestone of 5 billion unique mobile subscribers globally was achieved in Q2 2017. By 2020, almost 75% of the global population will be connected by mobile.

Here are the key takeaways from the report:

  • Smartphones will continue to drive new mobile subscriptions. By 2020, new smartphone users will account for 66% of new global connections, up from 53% in Q2 2017.
  • Developing markets will account for the largest share of new mobile subscription growth over the forecast period. Forty percent of new subscribers will stem from five markets: India, China, Nigeria, Indonesia, and Pakistan.
  • But mobile growth is slowing. It took around four years to reach 5 billion mobile users, compared with the three-and-a-half years it took to reach 4 billion. This suggests it’s going to take longer to reach 6 billion users, as the pool of new mobile users continues to shrink.

Affordability, content relevance, and digital literacy are likely bigger inhibitors to mobile internet adoption than a lack of network infrastructure is. Two-thirds of the 3.7 billion consumers who aren’t connected to the internet are within range of 3G or 4G networks. This suggests that device cost, a lack of relevant apps and content, and not knowing how to use the device are the primary barriers to mobile adoption.


The Inextricable Link Between Cloud Technology And The New, Untethered Workforce

Cloud computing has changed more than just how applications are bought, where they run and how data is stored.

It has changed the interaction between customers, code and business outcomes. More importantly for business information technology executives, it creates opportunities to lead initiatives well beyond the traditional IT stack — into areas ranging from e-learning to customer service. And instead of simply being a more efficient way of doing work, it’s giving IT leaders the ability to reshape work for the better.

No wonder a Harvard Business Review story hails the cloud as “the most impactful information technology of our time.”

Business leaders are turning to enterprise cloud technology because the nature of work itself is changing. It can no longer be defined as a single place in a fixed office, and a job description is more of fuzzy guideline than an out-and-out rule. As a result, work is stifled when it’s bounded by a predictable, cookie-cutter stack of devices and software.

According to the 2018 Deloitte Global Human Capital Trends report, employees at 91 percent of organizations work outside their designated functional areas. Thirty-five percent do so regularly. The old model of static software installations on fixed computers doesn’t flex or scale to these demands.

To keep track of who’s doing what and how, companies are dramatically increasing reliance on cloud collaboration and social media interaction for work communication. In the Deloitte report, 70 percent of organizations say they will expand their use of online collaboration platforms, and 67 percent will make more use of work-based social media. To free up time, they’ll curtail phone calls (seen decreasing by 30 percent of businesses) and face-to-face meetings (projected to decline by 44 percent of respondents).

The earlier waves of enterprise cloud tech focused on transforming a functional area or customer-facing process into a browser tab. That was valuable, but a screen full of browser tabs is little different than a desktop full of application icons. And the growth in these solutions created provisioning headaches, security challenges and regulatory risks. It was difficult to ensure consistent, centrally controlled permissions that gave employees the right amount of access at the right times, and to track the way sensitive information was used in a consistent manner.


What Will We Do When The World’s Data Hits 163 Zettabytes In 2025?

The shock value of the recent prediction by research group IDC that the world will be creating 163 zettabytes of data a year by 2025 depends on three things.

Firstly, who knows what a zettabyte is (one trillion gigabytes)? Secondly, what is the current annual data creation rate (16.3ZB)? And thirdly do these figures mean anything in a world where we take for granted that data will expand exponentially forever and mostly accept the future reality of autonomous cars and intelligent personal assistants, yet have little real idea of how they will change our lives?

IDC’s paper, Data Age 2025, perhaps answers only the first two questions. Forecasting a ten-fold increase in worldwide data by 2025, it envisions an era focused on creating, utilizing, and managing “life critical” data necessary for the smooth running of daily life.

Consumers will create, share and access data on a completely device-neutral basis. The cloud will grow well beyond previous expectations and corporate leaders will have unparalleled opportunities to leverage new business opportunities, it predicts. But firms will also need to make strategic choices on data collection, utilization and location.

I recently interviewed Jeff Fochtman, vice-president of marketing Seagate Technology, the $14BN market capitalization data storage group that sponsored the IDC report.

Critical Problems

“For individuals and businesses, a data deluge can cause problems in just being able to manage, store and access that data,” he says.

“But the thing that jumps out at me is the critical problems that data is going to solve. On an increasingly populated planet, data is going to solve the things impacting on the human experience: traffic, how we move around, how we grow food and how we feed the population globally.


Forbes Insights: The Rise Of The Customer Data Platform And What It Means To Businesses

Treasure Data and Forbes Insights recently partnered to present a broad ranging survey, Data Versus Goliath: Customer Data Strategies to Disrupt the Disruptors, that uncovers the attitudes and perceptions of today’s marketing leaders. This article, written by the Forbes Insights team, highlights some of the key takeaways from the survey and originally appeared on the Forbes website on June 20, 2018.

For years, marketing executives have sought an elusive 360-degree view of their customers, and the nature of customer data analytics and designing the customer experience (CX) is evolving dramatically within today’s organizations.

There is an emerging approach to bringing customer data into one place: the customer data platform, or CDP.

The good news is that much of the data to inform these questions is being collected and stored by enterprises or their partner organizations today. The bad news is that this data is typically maintained in separate systems, across organizational silos, and often cannot be surfaced at the time it’s needed to contribute to, or enhance, a specific customer experience—let alone inform a larger customer experience strategy.

For the most part, we are still in the early stages of customer data analytics, as indicated by a new survey of 400 marketing leaders, conducted by Forbes Insights and Treasure Data. According to the survey, it still takes marketers too much time to analyze and draw conclusions about the success of a marketing campaign or a change to the customer experience—47% say it takes more than a week, while another 47% say it takes three to five days.

And the tools and solutions to accelerate CX development still need to be put into place. A majority of executives, 52%, report that while they are leveraging a variety of tools and technologies in functions or lines of business, there is little coordination and there’s a lack of the right tools. Only 19% report having a robust set of analytics tools and technology services supporting customer-data-driven decisions and campaigns.

Yet there is an emerging approach to bringing customer data into one place: the customer data platform, or CDP. This new generation of systems is designed to bring all this disparate data about customers into a single intelligent environment and provide a synchronized, well-integrated view of the customer. These platforms are seeing widespread adoption across enterprises, as supported by the Forbes Insights/Treasure Data survey. Some 78% of organizations either have, or are developing, a customer data platform.

Understanding This New Type Of Platform

Customer data platforms are more broad-based than the traditional CRM systems that have been in place in many organizations for years. While CRM systems are designed to enable management and analysis of a particular customer channel, CDPs bring data from across corporate channels into a single platform. Although CRM and business intelligence solutions have provided some intelligence about customer trends, CDPs tie customer data directly to marketing and sales initiatives.