In the field of Data Science, one actually has to extract information about what the data has, beyond the numbers in the table format. To deduce this information, a data scientist employs numerous tools to explore, visualize and derive insights from the given dataset. Statistics allows us to establish concrete structure and make-up of the data. Descriptive statistics help to gain insights into the data such as
- How biased the dataset is?
- Which is the center around which the data is clustered?
- What are the unusual values for certain features in the dataset?
Let’s see the different statistical techniques used to gain the above insights!
What is Statistics?
The best thing about being a statistician is that you get to play in everyone’s backyard. — John Tukey
Statistics is a collective study of collecting, interpreting and analyzing a set of observations known as Data and producing summary of various features of the data. These observations pertain to a certain population of interest. Sometimes, a sample from the selected population can also be used for statistical study and derive summary from it. This can be then applied to the population to confirm the results.
The sample is essentially treated as a true representative of the population. Based on this sample, the confidence intervals give judgmental information about the population parameters. A representative sample is one that accurately represents, reflects, or “is like” your population. A representative sample should be an unbiased reflection of what the population looks like.
A Population is a group containing all the observations of the group of interest. A group containing elements of anything that is to be studied, such as objects, events, organizations, countries, species, organisms, etc.
A Sample is a subset extracted from a large population. Since it’s not always possible to collect data of an entire population, a random sample, representing the whole population, can be extracted for study. Probability sampling methods such as Random Sampling or Stratified Sampling are used so as to reduce sampling bias and extend the generalization of the sample results to the concerned population as a whole.
You want to test the hypothesis that students from the science and technology faculty consider themselves smarter than the ones from humanities faculty. Your target population is 15,000 students from the under graduate segment at the Mumbai University. Your sample participants are 500 students. They have science, computer science, economics and history as their majors. Most of the participants are males. You conduct a test on basic science and economics and a self-assessment quiz on their performance. You find that 73% students pursuing majors in science and technology consider themselves smarter than students studying humanities.
Can you conclude the result from this sample as a generalized result for the target population?
The answer is NO. Here your sample does not represent students from other branches of humanities such as Geography, Geology and Psychology. The sample is strongly biased with male students. So the result of the sample can be applied to only the subset of the population sharing the above characteristics. To draw valid conclusions from the sample that can be applied to the whole population, Probability sampling techniques are used. It means that each observation in the population has a higher chance of being selected.
Simple Random Sampling
Tools like random number generators can be used generate random numbers which are purely based on chance. sample() is an inbuilt function of random module in Python that returns a particular length list of items chosen from the sequence i.e. list, tuple, string or set.
Stratified sampling involves dividing the population into subpopulation based on the relevant features of the observations. Based on the overall proportions, select sample from each subgroup by using simple random sampling method.
Descriptive Statistics is the summary of the results derived from collection and analysis of dataset. There are two main areas of study.
- Central Tendency — It concerns the averages of the values.
- Variance — It concerns how spread out the values is.
There are three ways to identify the center or average of a dataset, namely; Mean Median and Mode.
Outliers are observations or data values that are far from other data values. They are uncommon values in a dataset. Outliers are challenging for many statistical analyses because the test results may not be accurate. The extreme values or outliers can give dubious statistical results.
Mean is a weak measure and is influenced and misleading due to the presence of extreme values.
The median is not so sensitive to skewed and extreme values. In datasets containing extreme values, the mean gets draw away from the center of values which can produce misleading results. Median is a good option in such cases since extreme or skewed values will not affect it. For example, the median is a good statistic for describing annual income. It is more informative.
Importance of Central Tendency in Data Exploratory Phase
Missing values in a dataset can significantly impact the analysis of the data and thereby performance of algorithm used for prediction. To minimize this effect, missing values need to be tracked and fixed by imputing them. There are several methods of imputation of which one of them is single imputation. Single imputation methods are applicable to missing values in numerical and nominal features, based on existing values, statistical measures, and predicted values.
Depending on the values used for each one of these strategies, there are methods that work on numerical values only and methods that work on both numerical and nominal columns. They are summarized below:
Data Distributions and Measures Of Central Tendency
Skewed Distribution using Histogram
Skewness is defined as a measure of the asymmetry of the probability distribution of a random variable about its mean. In other words Skewed Distribution the data is not equally distributed on either side of the curve. The data can be right skewed or left skewed. In a skewed distribution, a cluster of higher values forms at one end and the data points keep spreading out into a longer tail at the other end. The direction of the tail tells us the side of the skew.
NOTE: If the data is skewed, then the results of descriptive statistics such as mean can be misleading.
In a positively skewed distribution, the data points with higher scores get accumulated on the left side of the distribution, leaving a long tail towards the right and the central tendency of the distribution falls towards the lower values of the feature.
In a negatively skewed distribution, the data points with higher scores get accumulated on the right side of the distribution, leaving a long tail towards the left and the central tendency of the distribution falls towards the higher values of the feature.
How to deal with skewness ?
The predictions from a regression model are more reliable and accurate if the target variable follows a Gaussian or Normal distribution. Let’s say we want to predict the price of house based on a given dataset. The target variable is Price. Above is the histogram with the KDE plot, clearly showing that the feature is highly right skewed. The skew function of the pandas gives us the value of skewness of the variable. The value of skewness tells us how skewed is the data. The general rule of thumb for skew values and the general level of acceptance are as below:
- If skewness is less than -1 or greater than 1, the distribution is highly skewed.
- If skewness is between -1 and -0.5 or between 0.5 and 1, the distribution is moderately skewed.
- If skewness is between -0.5 and 0.5, the distribution is approximately symmetric.
The target variable must be transformed to follow a normal distribution before feeding it into a model.
Let’s explore a few methods for transformation of skewed data.
Log transformation is a data transformation method in which it replaces each variable x with a log(x). Log transformation reduces or removes the skewness of our original data. It can be easily done via Numpy, just by calling the log() function on the desired feature. Below is the feature, Price, after applying log transformation. It’s not normally distributed on a whole but the skewness has been reduced to a large extent to give a bell-shape to the curve.
Square Root Transform
As the name suggests, the square root transformation is applying square root to the target column. It sometimes works out well but its often not the best form of transformation. It can applied by calling the calling the sqrt() function from Numpy library. Below is the graph after applying SQRT transformation.
NOTE: The transformation applied must be reversed after making predictions.
Now that we have seen, another important terminology related to skewness is Kurtosis.
Kurtosis tells you the height and sharpness of the central peak, relative to that of a standard bell curve. Kurtosis is a measure of whether the data are heavy-tailed or light-tailed, relative to a normal distribution. Data sets with high kurtosis tend to have heavy tails, or outliers. A standard normal distribution has a kurtosis of zero. Significant skewness and kurtosis clearly indicate that data are not normal.
The acceptable values of Kurtosis are 0–3. The base line value is 0.
Values for Kurtosis
- For a normal distribution, the feature has a perfect kurtosis value of 0.
- A distribution with a positive kurtosis value indicates that the distribution has heavier tails and a sharper peak than the normal distribution
- A distribution with a negative kurtosis value indicates that the distribution has lighter tails and a flatter peak than the normal distribution.
- High kurtosis in a data set is an indicator that data has heavy outliers.
The Normal Distribution is a probability function that describes how the values of a feature are spread. It is a symmetric distribution without any skew. Most of the observations cluster around the central peak and the probabilities for values further away from the mean taper off equally in both directions. Height, birth weight, reading interests, job satisfaction, or IQ scores are just a few examples follow the normal distribution.
The histogram shows the distribution of values for the feature number of webinars attended this year. The mean, median and mode are equal showing normal distribution around the central tendency = 8
Central Limit Theorem (CLT)
The central limit theorem in statistics states that, given a sufficiently large sample size, the sampling distribution of the mean for a feature will approximately have a normal distribution regardless of that variable’s distribution in the population.
The clause, “regardless of that variable’s distribution in the population” in other words means that in a population, the values of a feature can follow different probability distributions such as left-skewed, right-skewed, normal distribution and others. CLT can be applied to almost all types of probability distributions based on the following criteria –
- The feature values should be independent
- The values should be identically distributed.
That is, one data value should not depend on the other and the distribution of the feature remains the same across all statistical measures.
What is “the sampling distribution of the mean for a feature”?
A sample from the population can be drawn and studied. A mean can be calculated from this sample. Similarly many samples from the population dataset can be extracted, studied and respective mean can be calculated. If graphical representation of the calculated mean is plotted using a histogram, we can see the distribution of the sample means of the selected samples. This is also known as the sampling distribution of mean. The sampling distribution curve depends strongly on the sample size. The sampling distribution will be different for each sample with varying sample size.
How large should the sample size be to satisfy this clause, “given a sufficiently large sample size”?
From the previous lines, we can infer that the shape of the sampling distribution changes with the sampling size. The CLT definition states that given a sufficiently large sample size, the sampling distribution of mean will approximately have a normal distribution.
The underlying data values for a feature in the population may have different sampling distribution for various sample sizes. The definition says that with a sufficiently large sample size the sampling distribution of the mean will follow normal distribution irrespective of the variable’s distribution in population.
How to know distribution of data graphically?
A boxplot is a graph that gives you a good indication of how the values in the data are spread out.
- It is a standardized way of displaying the distribution of data based on a five number summary (“minimum”, first quartile (Q1), median, third quartile (Q3), and “maximum”).
- It can point out outliers and their values.
- It can indicate if the feature is following a normal distribution.
- It can also say tightly the data is grouped, and can indicate the kind of skewness, if any.
Measures of Variability
Measure of variability gives the summary of how spread out are the values in the dataset. Variability is studied in the context of the distribution of observations. If the dispersion is low, the data points are closely clustered around the center, lower the variability, higher is the consistency. The most common measures of variability are Range, Interquartile Range, Variance and Standard Deviation
Why is Measure of Variability Important?
Just as mean is an important measure of central tendency, variability is an important measure of distribution. The statistical summary with respect to distribution will vary significantly regardless of the same the mean.
Let’s see two graphs, adjacent to each other. The probability of distribution differs with less and more variability despite having the same mean. In the graphs below, the dispersion of data points is less, leading to the cluster of data around the center or mean. While in the second figure, the data points are spread out from the center, showing more dissimilar data points and extreme values. Although both the mean value for both the figures is same, the variability in the dataset defines the probability of distribution. Distributions with greater variability produce observations with unusually large and small values more frequently than distributions with less variability.
From the graph above we can deduce that central tendency alone is not enough to read a statistical summary, the variability in the distribution is equally important to get the complete information.
How does variability affect the overall summary?
Let us understand the importance of variability using an instance that occurs in our day-to-day lives. Suppose two online food delivery aggregators advertise their services. Both of them have advertised their services very well and promise quick delivery schedule. You want to try their services and then decide on as where you would place your order when you are HUNGRY!
You monitor their delivery schedules and the variability in them. You also conclude that there is a lot of variability in their delivery timings. You plot the distribution of delivery time and this is what you can see. Let us draw our attention to the shaded portion in the graph.
The service provider with high variability nearly has around 16% (shaded portion) of deliveries after actual (promised) delivery time. This means 16% of deliveries are considerably out of the scheduled time limit of 20–30 min. While the mean time remains the same in the second graph, 2% of deliveries are delayed. This clearly depicts how variability is essential to avoid misleading conclusions.
Another approach to decide on the service provider would be to calculate the Coefficient Of Variance and choose the one whose value is lower.
Coefficient Of Variation
Another measure for variability is the Coefficient of Variation. The coefficient of variation (CV) is defined as a measure of relative variability. It is the ratio of the standard deviation to the mean. The CV is principally useful when you want to compare results from two different surveys or tests that have different measures or values. For example, if you are comparing the results from two tests that have different scoring mechanisms. If sample A has a CV of 12% and sample B has a CV of 25%, you would say that sample B has more variation, relative to its mean.
Comparing standard deviations from two surveys will not be correct method of measuring variability, because the means are also different. The Coefficient of Variation should only be used to compare positive data on a ratio scale. The CV has little or no meaning for measurements on an interval scale. Examples of interval scales include temperatures in Celsius or Fahrenheit.
Range is an up-front measure of variability which can be easily calculated and understood. The difference between the largest value and the smallest value in the dataset defines Range.
Range is susceptible to outliers. A single extreme value can mislead the measure of Range. Range should be used as a measure to compare variability only when the sample sizes of both the dataset are similar. Random samples from a population may tend to have a wide variability, increasing the range. Consequently as the size of the samples increase, the range in turn will increase due to extreme values in the sample. Hence, range alone cannot be considered as a measure of variability.
Statisticians represent the normal distribution into three quartiles. The lowest quartile (Q1) contains the first quarter or 25% of dataset and third quartile contains the 25% of dataset containing the highest value. The middle half (50%) of the dataset that falls in between the Q1 & Q3 quartiles, is called the Interquartile Range (IQR). IQR is not strongly influenced by outliers or extreme values. In a normal distribution with high variability, IQR is the best measure of variability since it’s unaffected by dispersed values. IQR the difference between Q3 & Q1.
The breadth of the box determines the skew distribution of the dataset. If the Q1-Median box is broader than the other, then it’s a negatively skewed distribution. If the Q1 -Median box is narrower than the other then it’s a positively skewed distribution
Variance is the average squared difference between the mean and the data point. Unlike the previous measures, variance uses mean for calculation. It states the extent of spread in the dataset. The larger the spread of data, larger is the variance with respect to mean. Variance is expressed in much larger units. Statistical calculations such as variance tests or the analysis of variance (ANOVA) use sample variance to assess group differences of populations. They use the variances of the samples to evaluate whether the populations they come from significantly differ from each other.
Variance is a positive measure. It can be calculated for ungrouped and grouped data.
Standard Deviation is defined as the average amount of variability in the dataset. It gives us an idea as to on an average, how far each data point is away from the mean. Low value of SD means that the data points are closer to the mean while a higher value of SD means that the data points are dispersed or spread out. Standard Deviation is the most important measure of variability since it reflects the dispersion of distribution.
SD= 5 shows that the data points are closely clustered around the center.
SD= 10 shows that the data points are spread out and moving away from the center.
SD= 20 shows that the data points are very far from the center showing a wide distribution of variability.
Standard Deviation Formula for Sample
s = sample standard deviation, ∑ = sum of…, X = each value, x̅ = sample, mean, n = number of values in the sample
The standard deviation of sample is the square root of the variance. In a normally distributed dataset, the standard deviation becomes very valuable. It can be used to determine the proportion of values that fall within the specified number of standard deviation from the mean.
The empirical rule states that standard deviation and the mean together can tell you where most of the values in your distribution lie if they follow a normal distribution.
- 1SD : Around 68% of scores are within 2 standard deviations of the mean,
- 2SD : Around 95% of scores are within 4 standard deviations of the mean,
- 3SD : Around 99.7% of scores are within 6 standard deviations of the mean.
The standard deviation is expressed in the same units as the original values. Since the units of variance are much larger, standard deviation is preferred over variance for statistical analysis.
Having basic statistical knowledge can be greatly beneficial in a role as Data Scientist. The task of a data scientists starts with collection of data and then its exploration. Understanding descriptive statistics will prove immensely helpful in exploratory analysis.
Hope you enjoyed reading!
Coming up next is a case study to understand descriptive statistics….
Descriptive statistics - Statistics By Jim
Descriptive statistics are numbers that summarize data, such as the mean, standard deviation, percentages, rates…
Variance: Simple Definition, Step by Step Examples - Statistics How To
Variance measures how far a data set is spread out. It is mathematically defined as the average of the squared…
Normal distributions review (article) | Khan Academy
If you're seeing this message, it means we're having trouble loading external resources on our website. If you're…