Data Science can be practically defined as the process by which we get extra information from data. When doing Data Science, what we’re really trying to do is explain what all of the data actually means in the real-world, beyond the numbers.
To extract the information embedded in complex datasets, Data Scientists employ a number of tools and techniques including data exploration, visualisation, and modelling. One very important class of mathematical technique often used in data exploration is statistics.
In a practical sense, statistics allows us to define concrete mathematical summaries of our data. Rather than trying to describe every single data point, we can use statistics to describe some of its properties. And that’s often enough for us to extract some kind of information about the structure and make-up of the data.
Sometimes, when people hear the word “statistics” they think of something overly complicated. Yes, it can get a bit abstract, but we don’t always need to resort to the complex theories to get some kind of value out of statistical techniques.
The most basic parts of statistics can often be of the most practical use in Data Science.
Today, we’re going to look at 5 useful Statistics for Data Science. These won’t be crazy abstract concepts but rather simple, applicable techniques that go a long way.
Let’s get started!
(1) Central Tendency
The central tendency of a dataset or feature variable is the center or typical value of the set. The idea is that there may be one single value that can best describe (to an extent) our dataset.
For example, imagine if you had a normal distribution centered at the x-y position of (100, 100). Then the point (100, 100) is the central tendency since, out of all the points to choose from, it is the one that provides the best summary of the data.
For Data Science, we can use central tendency measures to get a quick and simple idea of how our dataset looks as a whole. The “center” of our data can be a very valuable piece of information, telling us how exactly the dataset is biased, since whichever value the data revolves around is essentially a bias.
There are 2 common ways of mathematically selecting a central tendency.
The Mean value of a dataset is the average value i.e. a number around which a whole data is spread out. All values used in calculating the average are weighted equally when defining the Mean.
For example, let’s calculate the Mean of the following 5 numbers:
(3 + 64 + 187 + 12 + 52) / 5 = 63.6
The mean is great for computing the actual mathematical average. It’s also very fast to compute with Python libraries like Numpy
Median is the middle value of the dataset i.e if we sort the data from smallest to biggest (or biggest to smallest) and then take the value in the middle of the set: that’s the Median.
Let’s again compute the Median for that same set of 5 numbers:
[3, 12, 52, 64, 187] → 52
The Median value is quite different from the Mean value of 63.6. Neither of them are right or wrong, but we can pick one based on our situation and goals.
Computing the Median requires sorting the data — this won’t be practical if your dataset is large.
On the other hand the Median will be more robust to outliers than the Mean, since the Mean will be pulled one way or the other if there are some very high magnitude outlier values.
The mean and median can be calculated with simple numpy one-liners:
Under the umbrella of Statistics, the spread of the data is the extent to which it is squeezed towards a single value or more spread out across a wider range.
Take a look at the plots of the Gaussian probability distributions below — imagine that these are probability distributions describing an real-world dataset.
The blue curve has the smallest spread value since most of its data points all fall within a fairly narrow range. The red curve has the largest spread value since most of the data points take up a much wider range.
The legend shows the standard deviation values of these curves, explained in the next section.