Understanding Cloud Data Services

Demystifying Service Offerings from Microsoft Azure, Amazon AWS, and Google Cloud Platform

In the past five years, a shift in Cloud Vendor offerings has fundamentally changed how companies buy, deploy and run big data systems. Cloud Vendors have absorbed more back-end data storage and transformation technologies into their core offerings and are now highlighting their data pipeline, analysis, and modeling tools. This is great news for companies deploying, migrating, or upgrading big data systems. Companies can now focus on generating value from data and Machine Learning (ML), rather than building teams to support hardware, infrastructure, and application deployment/monitoring.

The following chart shows how more and more of the cloud platform stack is becoming the responsibility of the Cloud Vendors (shown in blue). The new value for companies working with big data is the maturation of Cloud Vendor Function as a Service (FaaS), also known as serverless, and Software as a Service (SaaS) offerings. For FaaS (serverless) the Cloud Vendor manages the applications and users focus on data and functions/features. With SaaS, features and data management become the Cloud Vendor’s responsibility. Google Analytics, Workday, and Marketo are examples of SaaS offerings.

As the technology gets easier to deploy, and the Cloud Vendor data services mature, it becomes much easier to build data-centric applications and provide data and tools to the enterprise. This is good news: companies looking to migrate from on-premise systems to the cloud are no longer required to purchase directly or manage hardware, storage, networking, virtualization, applications, and databases. In addition, this changes the operational focus for a big data systems from infrastructure and application management (DevOps) to pipeline optimization and data governance (DataOps). The following table shows the different roles required to build and run Cloud Vendor-based big data systems.

[READ MORE]

How to choose a visualization

Imagine that I give you the 8 numbers at left, and ask you to graph them in a display where you can flexibly uncover patterns. I use this example frequently in data visualization workshops, and the typical result is a deer-in-the-headlights look. And these are smart audiences — college undergraduates, Ph.D. students, MBAs, or business analysts. Most are overwhelmed with options: Bars, Lines, Pies, Oh My. If I instead show the data already in a visualization and ask them to replot it, the audience pivots from being overwhelmed with options, to being unable to imagine the data plotted in any other way.

Visualization quick reference guides (also known as ‘chart choosers’) are a great solution to these problems, abstracting over wonky theories to provide direct suggestions for how to represent data. These guides are typically organized by viewer tasks — does the designer want the viewer to see a ranking, examine a distribution, inspect a relationship, or make a comparison? These guides then use these tasks to categorize (or flowchart) viable alternative designs. Students and practitioners (heck, and researchers) appreciate the way that these tools help them break out of being overwhelmed with options, or fixated on a single possibility.

There are several great task-based chart choosers out there (here’s an example from the Financial Times), so why make a new one? Choosing a visualization based on task can be a helpful constraint when it’s time to communicate a known pattern to an audience. But it can be less useful at the analysis stage before that, where you have only vague notions of potentially important tasks. Early commitment to a visualization suited to a specific task might even cause you to fixate on one pattern, and miss another. And some tasks are vaguely defined. I find ‘See Relationship’ and ‘Make Comparison’ particularly fuzzy. Didn’t Tufte proclaim that everything is a comparison? For analysts, the best visualization format is typically the one that is flexibly useful across tasks, allowing general foraging through possible patterns.

But if not task, what’s another way to organize a chooser? When I decided to set up a new one, I liked the simple objectivity of picking the visualization according to the structure of the data being plotted (though I was recently delighted to be pointed to another chooser with a similar setup).

The small dataset below illustrates the typical types of quantitative data in any excel sheet: categories, ordered categories, and continuous metrics. Once you decide which columns of the dataset to throw together, the chooser (in theory) tells you the best options. I’ll walk through how it works below.

You have a pile of metrics (numbers), perhaps you’d like to bin them by discrete categories (typically, a bar graph), or maybe two categories at the same time, as in a 2-dimensional table (I like Bar Tables for this). Or perhaps you want those metrics organized along a continuous axis (another metric) as when plotting values that change over time (typically, a line graph), and then maybe you’d like to show that binned by discrete categories (typically, a line graph with multiple lines on it). If, instead of absolute values, the metrics should be interpreted as percentages, that typically entails spatially smooshing the graph into pies or stacked bars.

[READ MORE]

How Data Masking is Driving Power to Organizations?

In this digital age, the threats against an organization’s data are massive and the consequences of a breach are extremely devastating for a business. So it has become important to consider various factors when it comes to secure databases.

Our world is on a huge risk of data theft and a few months ago, Facebook data breach news proved this. One interesting fact is that almost 80% of the confidential information in a business organization resides in a non-production environment that is used by the testers, developers and other professionals. Just a single and small breach can damage the reputation of company in the market.

Giant tech companies like Facebook has become the poster child of data misuse. If the organization is small, then such problems can be tackled but when it comes to organizations that hold the keys to a huge amount of data creates huge risk.

It is no surprise that safeguarding confidential business and customer information has become more important than ever. Companies are more focusing on product development and neglecting privacy issues. A famous movement DevSecOps are helping to raise an issue like sensitive data security. But more than that, it is very important to ensure that data security and privacy always stay at top of the mind.

How Data Masking is related to the Security of a Business Organization?

The business organization must implement security controls through their normal software testing tools so that only certain authorized individuals can access particular data. We are discussing some effective data masking strategies for business organizations. By following these techniques, organizations can make their “data securing strategy” more practical and effective:

  • Try to maintain integrity: There must be consistently masked data- even for the data derived from multiple uniform sources, so that relationship between values is preserved after the data is transformed.
  • Masked data must be delivered quickly: Although the data is constantly changing and the makeup of non-production takes some time, the companies need to continuously mask and deliver the masked data.
  • An End-to-end approach is necessary: Simply applying masking strategies is not enough! Company should make sure to look for the end to end approach too to identify the sensitive data and their accessibility to their clients.

Data Masking and Security

Few reasons that enterprise businesses should use data masking:

  1. To protect sensitive data from third party vendors: The sharing of information with third-party is mandatory but certain data must be kept confidential.
  2. The error of Operator: Big organizations trust their insiders to make good decisions, but data theft is often a result of operator error and businesses can safeguard themselves with data masking.

The Indian lab of IBM has brought out a new solution to protect against theft of sensitive data from call centres. They have developed a technology named AudioZapper for the solution in the global marketplace that addressed complete security concerns of a call center. IBM Solutions offers required protection of confidential data such as credit card numbers, PIN, and other social security numbers from getting into wrong hands.

[READ MORE]

5 Useful Statistics Data Scientists Need to Know

Data Science can be practically defined as the process by which we get extra information from data. When doing Data Science, what we’re really trying to do is explain what all of the data actually means in the real-world, beyond the numbers.

To extract the information embedded in complex datasets, Data Scientists employ a number of tools and techniques including data exploration, visualisation, and modelling. One very important class of mathematical technique often used in data exploration is statistics.

In a practical sense, statistics allows us to define concrete mathematical summaries of our data. Rather than trying to describe every single data point, we can use statistics to describe some of its properties. And that’s often enough for us to extract some kind of information about the structure and make-up of the data.

Sometimes, when people hear the word “statistics” they think of something overly complicated. Yes, it can get a bit abstract, but we don’t always need to resort to the complex theories to get some kind of value out of statistical techniques.

The most basic parts of statistics can often be of the most practical use in Data Science.

Today, we’re going to look at 5 useful Statistics for Data Science. These won’t be crazy abstract concepts but rather simple, applicable techniques that go a long way.

Let’s get started!
(1) Central Tendency

The central tendency of a dataset or feature variable is the center or typical value of the set. The idea is that there may be one single value that can best describe (to an extent) our dataset.

For example, imagine if you had a normal distribution centered at the x-y position of (100, 100). Then the point (100, 100) is the central tendency since, out of all the points to choose from, it is the one that provides the best summary of the data.

For Data Science, we can use central tendency measures to get a quick and simple idea of how our dataset looks as a whole. The “center” of our data can be a very valuable piece of information, telling us how exactly the dataset is biased, since whichever value the data revolves around is essentially a bias.

There are 2 common ways of mathematically selecting a central tendency.

Mean

The Mean value of a dataset is the average value i.e. a number around which a whole data is spread out. All values used in calculating the average are weighted equally when defining the Mean.

For example, let’s calculate the Mean of the following 5 numbers:

(3 + 64 + 187 + 12 + 52) / 5 = 63.6

The mean is great for computing the actual mathematical average. It’s also very fast to compute with Python libraries like Numpy

Median

Median is the middle value of the dataset i.e if we sort the data from smallest to biggest (or biggest to smallest) and then take the value in the middle of the set: that’s the Median.

Let’s again compute the Median for that same set of 5 numbers:

[3, 12, 52, 64, 187] → 52

The Median value is quite different from the Mean value of 63.6. Neither of them are right or wrong, but we can pick one based on our situation and goals.

Computing the Median requires sorting the data — this won’t be practical if your dataset is large.

On the other hand the Median will be more robust to outliers than the Mean, since the Mean will be pulled one way or the other if there are some very high magnitude outlier values.

The mean and median can be calculated with simple numpy one-liners:

numpy.mean(array)
numpy.median(array)

(2) Spread

Under the umbrella of Statistics, the spread of the data is the extent to which it is squeezed towards a single value or more spread out across a wider range.

Take a look at the plots of the Gaussian probability distributions below — imagine that these are probability distributions describing an real-world dataset.

The blue curve has the smallest spread value since most of its data points all fall within a fairly narrow range. The red curve has the largest spread value since most of the data points take up a much wider range.

The legend shows the standard deviation values of these curves, explained in the next section.

[READ MORE]

How To Find And Resolve Blind Spots In Your Data

There’s a growing number oftools that you can use to analyze data for a business. But you may not be overly confident in the results if you don’t take the data’s blind spots into account. There’s no single way to do that, but we’ll look at some possibilities here.

The first thing to keep in mind is that a blind spot generally represents an “unknown unknown.” In other words, it’s a factor you didn’t take into account because you didn’t think, or know, to consider it.

  1. Start Locating Your Dark Data

When many business analysts talk about blind spots in data, dark data comes into the conversation. Dark data is also called “unclassified data,” and it’s information your business has but does not use for analytical purposes or any other reason related to running the business.

If you don’t have any idea how much dark data your company has, what kind of information it entails, and where your company stores it, that unawareness could cause blind spots.

More specifically, having an excessive amount of dark data could mean you spend more time searching for data than analyzing it. Or, dark data could open your company to regulatory risks if you cannot retrieve requested information during an audit.

Similarly, some dark data contains sensitive information that hackers might try to get. If they’re successful, you may not know a data breach took place until months later — if at all.

Fortunately, there are specialized software options that can discover the data your company has — dark or otherwise — and clean it so that you can eventually use the data to meet your business analysis goals.

Instead of being overly concerned about the business investment required for that software, think of the risks to your company if you continue to ignore your unclassified data and the blind spots it causes.

  1. Pay Attention to Data Stored on Mobiles and in the Public Cloud

It’s increasingly common for people to use smartphones and tablets during their workdays. Some of them do it especially frequently if they take part in fieldwork or visit clients at their homes. Vanson Bourne conducted a study for Veritas to find out more about dark data at the company level and ended up looking at mobile data, among other things.

The study results revealed several fascinating conclusions. First, it showed that, on average, 52% of data within organizations is unclassified and untagged. Veritas asserted that this issue constitutes a security risk because it leaves potentially business-critical information up for grabs by hackers.

[READ MORE]

How to Make a Success Story of your Data Science Team

Data science resounds throughout every industry and has reached the mainstream media. I no longer have to explain what I do for a living as long as I call it AI  —  we are at the peak of data science hype!

As a consequence, more and more companies are looking towards data science with big expectations, ready to invest into a team of their own. Unfortunately, the realities of data science in the enterprise are far from a success story.

NewVantage published a survey in January 2019 which found that 77% of businesses report challenges with business adaptation. This translates into ¾ of all data projects collecting dust rather than providing a return on the investment. Gartner has always been very critical of the data science success and they haven’t gotten more cheerful as of late: According to Gartner January 2019, even analytics insights will not deliver business outcomes through 2022, what’s the hope then for data science? It’s apparent that for some reasons making data science a success is really hard!
meeting

Me scaring Execs about their data science investments at the Data Leadership Summit, London 2019.

Regardless of whether you manage an existing data science team or are about to start a new greenfield project in big data or AI, it’s important to acknowledge the inevitable: the Hype Cycle.

olympics
Luc Galoppin on Flickr

The increasing visibility of data science and AI comes hand in hand with a peak of inflated expectations. In combination with the current success rate of such projects and teams we are headed straight for the cliff edge towards the trough of disillusionment.

Christopher Conroy summarised it perfectly in a recent interview for Information Age: the renewed hype around AI simply gives a false impression of progress from where businesses were years ago with big data and data science. Did we just find an even higher cliff edge?

Thankfully, it’s not all bad news. Some teams, projects and businesses are indeed successful (around 30% according to the surveys). We simply need a new focus on the requirements for success.

[READ MORE]

The Data Fabric for Machine Learning – Part 2: Building a Knowledge-Graph

I’ve been talking about the data fabric in general, and giving some concepts of Machine Learning and Deep Learning in the data fabric. And also gave my definition of the data fabric:

The Data Fabric is the platform that supports all the data in the company. How it’s managed, described, combined and universally accessed. This platform is formed from an Enterprise Knowledge Graph to create an uniform and unified data environment.

If you take a look at the definition, it says that the data fabric is formed from an Enterprise Knowledge Graph. So we better know how to create and manage it.

Objectives

General

Set up the basis of knowledge-graphs theory and construction.

Specifics

Explain the concepts of knowledge-graphs related to enterprises.
Give some recommendation about building a successful enterprise knowledge-graph.
Show examples of knowledge-graphs. 

Main theory

The fabric in the data fabric is built from a knowledge-graph, to create a knowledge-graph you need semantics and ontologies to find an useful way of linking your data that uniquely identifies and connects data with common business terms.

Section 1. What is a Knowledge-Graph?

The knowledge graph consists in integrated collections of data and information that also contains huge numbers of links between different data.

The key here is that instead of looking for possible answers, under this new model we’re seeking an answer. We want the facts — where those facts come from is less important. The data here can represent concepts, objects, things, people and actually whatever you have in mind. The graph fills in the relationships, the connections between the concepts.

In this context we can ask this question to our data lake:

What exists here?

We are in a different here. A one where it’s possible to set up a framework to study data and its relation to other data. In a knowledge-graph information represented in a particular formal ontology can be more easily accessible to automated information processing, and how best to do this is an active area of research in computer science like data science.

All data modeling statements (along with everything else) in ontological languages and the world of knowledge-graphs for data are incremental, by their very nature. Enhancing or modifying a data model after the fact can be easily accomplished by modifying the concept.

With a knowledge-graph what we are building is a human-readable representation of data that uniquely identifies and connects data with common business terms. This “layer” helps end users access data autonomously, securely and confidently.

[READ MORE]

The Data Fabric for Machine Learning. Part 1-b: Deep Learning on Graphs.

Introduction

We are in the process of defining a new way of doing machine learning, focusing on a new paradigm, the data fabric.

In the past article I gave my new definition of machine learning:

Machine learning is the automatic process of discovering hidden insights in data fabric by using algorithms that are able to find those insights without being specifically programmed for that, to create models that solves a particular (or multiple) problem(s).

The premise for understanding this it’s that we have created a data fabric. For me the best tool out there for me for doing that is Anzo as I mentioned in other articles.
https://www.cambridgesemantics.com/

You can build something called “The Enterprise Knowledge Graph” with Anzo, and of course create your data fabric.

But now I want to focus on a topic inside machine learning, deep learning. In another article I gave a definition of deep learning:

Deep learning is a specific subfield of machine learning, a new take on learning representations from data which puts an emphasis on learning successive “layers” [neural nets] of increasingly meaningful representations.

Here we’ll talk about a combination of deep learning and graph theory, and see how it can help move our research forward.
Objectives
General

Set the basis of doing deep learning on the data fabric.

Specifics

Describe the basics of deep learning on graphs.
Explore the library Spektral.
Validate the possibility of doing deep learning on the data fabric.

Main Hypothesis

If we can construct a data fabric that supports all the data in the company, the automatic process of discovering insights through learning increasingly meaningful representations from data using neural nets (deep learning) can run inside the data fabric.

[READ MORE]

The Data Fabric for Machine Learning. Part 1.

Introduction

If you search for machine learning online you’ll find around 2,050,000,000 results. Yeah for real. It’s not easy to find that description or definition that fits every use or case, but there are amazing ones. Here I’ll propose a different definition of machine learning, focusing on a new paradigm, the data fabric.
Objectives
General

Explain the data fabric connection with machine learning.

Specifics

Give a description of the data fabric and ecosystems to create it.
Explain in a few words what is machine learning.
Propose a way of visualizing machine learning insights inside of the data fabric.

Main theory

If we can construct a data fabric that supports all the data in the company, then a business insight inside of it can be thought as a dent in it. The automatic process of discovering what that insight is, it’s called machine learning.
Section 1. What is the Data Fabric?

I’ve talked before about the data fabric, and I gave a definition of it (I’ll put it here again bellow).

There are several words we should mention when we talk about the data fabric: graphs, knowledge-graph, ontology, semantics, linked-data. Read the article from above if you want those definitions; and then we can say that:

The Data Fabric is the platform that supports all the data in the company. How it’s managed, described, combined and universally accessed. This platform is formed from an Enterprise Knowledge Graph to create an uniform and unified data environment.

Let’s break that definition in parts. The first thing we need it’s a knowledge graph.

The knowledge graph consists in integrated collections of data and information that also contains huge numbers of links between different data. The key here is that instead of looking for possible answers, under this new model we’re seeking an answer. We want the facts — where those facts come from is less important. The data here can represent concepts, objects, things, people and actually whatever you have in mind. The graph fills in the relationships, the connections between the concepts.

Knowledge graphs also allow you to create structures for the relationships in the graph. With it, it’s possible to set up a framework to study data and its relation to other data (remember ontology?).

In this context we can ask this question to our data lake:

What exists here?

The concept of the data lake it’s important too because we need a place to store our data, govern it and run our jobs. But we need a smart data lake, a place that understand what we have and how to use it, that’s one of the benefits of having a data fabric.

The data fabric should be uniform and unified, meaning that we should make an effort in being able to organize all the data in the organization in one place and really manage and govern it.

[READ MORE]

Developing a Functional Data Governance Framework


Data Governance practices need to mature. Data Governance Trends in 2019 reports that dissatisfaction with the quality of business data continues in 2019, despite a growing understanding of Data Governance’s value.

Harvard Business Review reports 92 percent of executives say their Big Data and AI investments are accelerating, and 88 percent talk about a greater urgency to invest in Big Data and AI. In order for AI and machine learning to be successful, Data Governance must also be a success. Data Governance remains elusive to the 87 percent of businesses which, according to Gartner, have lower levels of Business Intelligence.

Recent news has also suggested a need to improve Data Governance processes. Data breaches continue to affect customers and the impacts are quite broad, as an organization’s customers (including banks, universities, and pharmaceutical companies) must continually take stock and change their user names and passwords. Effective Data Governance is a fundamental component of data security processes.

Data Governance has to drive improvements in business outcomes. “Implementing Data Governance poorly, with little connection or impact on business operations will just waste resources,” says Anthony Algmin, Principal at Algmin Data Leadership.

To mature, Data Governance needs to be business-led and a continuous process, as Donna Burbank and Nigel Turner emphasize. They recommend, as a first step, creating a Data Strategy, bringing together organization and people, processes and workflows, Data Management and measures, and culture and communication. Then creating and choosing a Data Governance Framework. Most importantly, periodically testing that Data Governance Framework.

To truly be confident in Data Governance structures, organizations need to do the critical testing before a breach or some other unexpected event occurs. It is this notion—implementing some testing—that is missing in much current Data Governance literature. Thinking like a software tester provides an alternative way of learning good Data Governance fundamentals.

Before Testing, Define Data Governance Requirements

Prior to offering feedback on any software developed, a great tester will ask for the product’s requirements to know what is expected and to clarify important ambiguities. Likewise, how does an organization know it has good Data Governance without understanding the agreed-upon specifications and its ultimate end? First, it helps to define the what Data Governance is supposed to do. DATAVERSITY® defines Data Governance as:

“A collection of practices and processes which help to ensure the formal management of data assets within an organization. Data Governance often includes other concepts such as Data Stewardship, Data Quality, and others to help an enterprise gain better control over its data assets.”

How Data Governance is implemented depends on business demands specifically leading to a Data Governance solution in the first place. This means breaking down the data vision and strategy into sub-goals and their components, such as a series of use cases. Nigel Turner and Donna Burbank give the following use case examples:

[READ MORE]