Understanding Cloud Data Services

Demystifying Service Offerings from Microsoft Azure, Amazon AWS, and Google Cloud Platform

In the past five years, a shift in Cloud Vendor offerings has fundamentally changed how companies buy, deploy and run big data systems. Cloud Vendors have absorbed more back-end data storage and transformation technologies into their core offerings and are now highlighting their data pipeline, analysis, and modeling tools. This is great news for companies deploying, migrating, or upgrading big data systems. Companies can now focus on generating value from data and Machine Learning (ML), rather than building teams to support hardware, infrastructure, and application deployment/monitoring.

The following chart shows how more and more of the cloud platform stack is becoming the responsibility of the Cloud Vendors (shown in blue). The new value for companies working with big data is the maturation of Cloud Vendor Function as a Service (FaaS), also known as serverless, and Software as a Service (SaaS) offerings. For FaaS (serverless) the Cloud Vendor manages the applications and users focus on data and functions/features. With SaaS, features and data management become the Cloud Vendor’s responsibility. Google Analytics, Workday, and Marketo are examples of SaaS offerings.

As the technology gets easier to deploy, and the Cloud Vendor data services mature, it becomes much easier to build data-centric applications and provide data and tools to the enterprise. This is good news: companies looking to migrate from on-premise systems to the cloud are no longer required to purchase directly or manage hardware, storage, networking, virtualization, applications, and databases. In addition, this changes the operational focus for a big data systems from infrastructure and application management (DevOps) to pipeline optimization and data governance (DataOps). The following table shows the different roles required to build and run Cloud Vendor-based big data systems.

[READ MORE]

How to choose a visualization

Imagine that I give you the 8 numbers at left, and ask you to graph them in a display where you can flexibly uncover patterns. I use this example frequently in data visualization workshops, and the typical result is a deer-in-the-headlights look. And these are smart audiences — college undergraduates, Ph.D. students, MBAs, or business analysts. Most are overwhelmed with options: Bars, Lines, Pies, Oh My. If I instead show the data already in a visualization and ask them to replot it, the audience pivots from being overwhelmed with options, to being unable to imagine the data plotted in any other way.

Visualization quick reference guides (also known as ‘chart choosers’) are a great solution to these problems, abstracting over wonky theories to provide direct suggestions for how to represent data. These guides are typically organized by viewer tasks — does the designer want the viewer to see a ranking, examine a distribution, inspect a relationship, or make a comparison? These guides then use these tasks to categorize (or flowchart) viable alternative designs. Students and practitioners (heck, and researchers) appreciate the way that these tools help them break out of being overwhelmed with options, or fixated on a single possibility.

There are several great task-based chart choosers out there (here’s an example from the Financial Times), so why make a new one? Choosing a visualization based on task can be a helpful constraint when it’s time to communicate a known pattern to an audience. But it can be less useful at the analysis stage before that, where you have only vague notions of potentially important tasks. Early commitment to a visualization suited to a specific task might even cause you to fixate on one pattern, and miss another. And some tasks are vaguely defined. I find ‘See Relationship’ and ‘Make Comparison’ particularly fuzzy. Didn’t Tufte proclaim that everything is a comparison? For analysts, the best visualization format is typically the one that is flexibly useful across tasks, allowing general foraging through possible patterns.

But if not task, what’s another way to organize a chooser? When I decided to set up a new one, I liked the simple objectivity of picking the visualization according to the structure of the data being plotted (though I was recently delighted to be pointed to another chooser with a similar setup).

The small dataset below illustrates the typical types of quantitative data in any excel sheet: categories, ordered categories, and continuous metrics. Once you decide which columns of the dataset to throw together, the chooser (in theory) tells you the best options. I’ll walk through how it works below.

You have a pile of metrics (numbers), perhaps you’d like to bin them by discrete categories (typically, a bar graph), or maybe two categories at the same time, as in a 2-dimensional table (I like Bar Tables for this). Or perhaps you want those metrics organized along a continuous axis (another metric) as when plotting values that change over time (typically, a line graph), and then maybe you’d like to show that binned by discrete categories (typically, a line graph with multiple lines on it). If, instead of absolute values, the metrics should be interpreted as percentages, that typically entails spatially smooshing the graph into pies or stacked bars.

[READ MORE]

How Data Masking is Driving Power to Organizations?

In this digital age, the threats against an organization’s data are massive and the consequences of a breach are extremely devastating for a business. So it has become important to consider various factors when it comes to secure databases.

Our world is on a huge risk of data theft and a few months ago, Facebook data breach news proved this. One interesting fact is that almost 80% of the confidential information in a business organization resides in a non-production environment that is used by the testers, developers and other professionals. Just a single and small breach can damage the reputation of company in the market.

Giant tech companies like Facebook has become the poster child of data misuse. If the organization is small, then such problems can be tackled but when it comes to organizations that hold the keys to a huge amount of data creates huge risk.

It is no surprise that safeguarding confidential business and customer information has become more important than ever. Companies are more focusing on product development and neglecting privacy issues. A famous movement DevSecOps are helping to raise an issue like sensitive data security. But more than that, it is very important to ensure that data security and privacy always stay at top of the mind.

How Data Masking is related to the Security of a Business Organization?

The business organization must implement security controls through their normal software testing tools so that only certain authorized individuals can access particular data. We are discussing some effective data masking strategies for business organizations. By following these techniques, organizations can make their “data securing strategy” more practical and effective:

  • Try to maintain integrity: There must be consistently masked data- even for the data derived from multiple uniform sources, so that relationship between values is preserved after the data is transformed.
  • Masked data must be delivered quickly: Although the data is constantly changing and the makeup of non-production takes some time, the companies need to continuously mask and deliver the masked data.
  • An End-to-end approach is necessary: Simply applying masking strategies is not enough! Company should make sure to look for the end to end approach too to identify the sensitive data and their accessibility to their clients.

Data Masking and Security

Few reasons that enterprise businesses should use data masking:

  1. To protect sensitive data from third party vendors: The sharing of information with third-party is mandatory but certain data must be kept confidential.
  2. The error of Operator: Big organizations trust their insiders to make good decisions, but data theft is often a result of operator error and businesses can safeguard themselves with data masking.

The Indian lab of IBM has brought out a new solution to protect against theft of sensitive data from call centres. They have developed a technology named AudioZapper for the solution in the global marketplace that addressed complete security concerns of a call center. IBM Solutions offers required protection of confidential data such as credit card numbers, PIN, and other social security numbers from getting into wrong hands.

[READ MORE]

5 Useful Statistics Data Scientists Need to Know

Data Science can be practically defined as the process by which we get extra information from data. When doing Data Science, what we’re really trying to do is explain what all of the data actually means in the real-world, beyond the numbers.

To extract the information embedded in complex datasets, Data Scientists employ a number of tools and techniques including data exploration, visualisation, and modelling. One very important class of mathematical technique often used in data exploration is statistics.

In a practical sense, statistics allows us to define concrete mathematical summaries of our data. Rather than trying to describe every single data point, we can use statistics to describe some of its properties. And that’s often enough for us to extract some kind of information about the structure and make-up of the data.

Sometimes, when people hear the word “statistics” they think of something overly complicated. Yes, it can get a bit abstract, but we don’t always need to resort to the complex theories to get some kind of value out of statistical techniques.

The most basic parts of statistics can often be of the most practical use in Data Science.

Today, we’re going to look at 5 useful Statistics for Data Science. These won’t be crazy abstract concepts but rather simple, applicable techniques that go a long way.

Let’s get started!
(1) Central Tendency

The central tendency of a dataset or feature variable is the center or typical value of the set. The idea is that there may be one single value that can best describe (to an extent) our dataset.

For example, imagine if you had a normal distribution centered at the x-y position of (100, 100). Then the point (100, 100) is the central tendency since, out of all the points to choose from, it is the one that provides the best summary of the data.

For Data Science, we can use central tendency measures to get a quick and simple idea of how our dataset looks as a whole. The “center” of our data can be a very valuable piece of information, telling us how exactly the dataset is biased, since whichever value the data revolves around is essentially a bias.

There are 2 common ways of mathematically selecting a central tendency.

Mean

The Mean value of a dataset is the average value i.e. a number around which a whole data is spread out. All values used in calculating the average are weighted equally when defining the Mean.

For example, let’s calculate the Mean of the following 5 numbers:

(3 + 64 + 187 + 12 + 52) / 5 = 63.6

The mean is great for computing the actual mathematical average. It’s also very fast to compute with Python libraries like Numpy

Median

Median is the middle value of the dataset i.e if we sort the data from smallest to biggest (or biggest to smallest) and then take the value in the middle of the set: that’s the Median.

Let’s again compute the Median for that same set of 5 numbers:

[3, 12, 52, 64, 187] → 52

The Median value is quite different from the Mean value of 63.6. Neither of them are right or wrong, but we can pick one based on our situation and goals.

Computing the Median requires sorting the data — this won’t be practical if your dataset is large.

On the other hand the Median will be more robust to outliers than the Mean, since the Mean will be pulled one way or the other if there are some very high magnitude outlier values.

The mean and median can be calculated with simple numpy one-liners:

numpy.mean(array)
numpy.median(array)

(2) Spread

Under the umbrella of Statistics, the spread of the data is the extent to which it is squeezed towards a single value or more spread out across a wider range.

Take a look at the plots of the Gaussian probability distributions below — imagine that these are probability distributions describing an real-world dataset.

The blue curve has the smallest spread value since most of its data points all fall within a fairly narrow range. The red curve has the largest spread value since most of the data points take up a much wider range.

The legend shows the standard deviation values of these curves, explained in the next section.

[READ MORE]