How is the Central Limit Theorem applied to Data Science?

The Central Limit Theorem (CTL) in statistics states that, regardless of the distribution of the variable in the population, the sampling distribution of the mean will approximate a normal distribution with a sufficiently large sample size. To draw inferences from any large event the Central Limit Theorem play’s crucial role as it establishes a solid foundation for the assumption to be made. To understand the application of CLT in data science, in this article, we are going to discuss the normal distribution of the data and the formula behind the statement. The following listed points will be covered in this post.

Table of content

What is a normal distribution?
About Central Limit Theorem
Central Limit Theorem in python

Let’s start with the understanding of normal distribution for a sample.

What is a normal distribution?

Normal distributions are continuous probability distributions that are symmetrically distributed around their mean. Most observations tend to cluster around the central peak and probabilities for values near and far from the mean taper off equally. Extreme values in either tail of the distribution are similarly unlikely. Generally, the normal distribution is symmetrical but not all symmetrical distributions are normal.

Symmetrical distribution is a distribution with two mirror images on either side of a dividing line, but the actual data might be two bumps or a series of hills in addition to a bell curve that indicates a normal distribution.

In statistical reports, a normal distribution is typically shaped like a bell curve. The bell shape of the distribution depends on two parameters mean and standard deviation as with any probability distribution.

The mean is the central tendency of the normal distribution. It defines the location of the peak for the bell curve. The mean in a normal distribution should be zero (0). If the mean is shifted the entire curve shifts left or right on the X-axis due to which the skewness changes from zero to either negative or positive.

The standard deviation is a measure of diversity that is used to determine how far apart values tend to be from the mean. It shows how dispersed observations tend to be from the mean. The change in the standard deviation either tightens or spreads out the width of the distribution along the X-axis. Larger standard deviations produce wider distributions. The standard deviation for normal distribution needs to be one (1). Below is the pictorial representation of the above explanation and this representation is referred from the links mentioned in the references.

(Image link)

Are you looking for for a complete repository of Python libraries used in data science, check out here.

About Central Limit Theorem

The Central Limit Theorem (CLT) states that when plotting a sample distribution of means the means of the sample will be equal to the population mean and the sample distribution will approach normal distribution with variance equal to standard error.

The standard error(SE) in statistics is the standard deviation of the sampling distribution. Mathematically it could be explained as the standard deviation of the sample divided by the size of the sample is the standard error. It is an estimation of the standard deviation.

So, this concludes that if the sample distribution is not normal it means the distribution could be approximately normal. There are a few assumptions behind the CLT.

The sample data must be sampled and selected randomly from the population.
There should not be any multicollinearity in the sampled data which is one sample should not influence the other samples.
The sample size should be no more than 10% of the population. Generally, a sample size greater than 30 (n>30) is considered good.

What statistical inference does CLT provide?

Without taking a new sample to compare with, this theorem can be applied to quantify the probability that the sample will diverge from its population. There is no requirement of the whole population’s characteristics to understand the likelihood of the sample as the sample mean is approximately equal to the population mean.

This doesn’t mean that the sample can provide information about the precision and reliability of the estimate concerning the larger population. This uncertainty can be explained by introducing the confidence interval.

A confidence interval is a probability that a parameter will fall between a pair of values around the mean. For example, you survey a supermart to see how many cans of beverages they sell per hour. You test your statistic to get a confidence interval of (200,300). That means you think they sell between 200 and 300 cans an hour.

Till now we have discussed all the theoretical aspects of the CLT. Let’s implement the CLT on data in python.

CLT in python

As in CLT, there are two important parameters which include the mean and standard deviation of the sample and population. Let’s start with the calculation of the mean and standard deviation.

Import libraries:

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import random

Reading the data:

df=pd.read_csv("/content/drive/MyDrive/Datasets/Dummy Data HSS.csv")
df.head()

The data is a dummy sales data of a store selling tv, radio and the number of sales from social media influencers.

Calculating the mean and standard deviation of the population:

print("shape of the data:",df.shape)
mean_pop = df["Sales"].mean()
std_pop = df["Sales"].std()
print("population mean (?) = {} and standard deviation (?) of population = {}.".format(round(mean_pop,2),round(std_pop,2)))

Plot the distribution of sales:

fig, ax = plt.subplots(figsize=(15, 8))
sns.histplot(df['Sales'],kde=True,ax=ax)
plt.show()

As we can see, the distribution of population data is not normally distributed. So now we can take some samples and plot the distribution of them.

Taking samples and plotting the distribution:

def mean_distribution(data, samples_count, data_points_count):
    li_samp = list()
    data = np.array(data.values)
    for i in range(0, samples_count):
        samples = random.sample(range(0, data.shape[0]), data_points_count)
        li_samp.append(data[samples].mean())
    return np.array(li_samp)
count = 0
mean_list = list()
fg, ax = plt.subplots(nrows=2, ncols=2, figsize=(10, 10))
lst = [(20,5),(100,50),(100,100),(200,50)]
for i in (0,1):
    for j in (0,1):
        ax[i,j].set_title("sample of " + str(lst[count][0]) + " with mean " + str(lst[count][1]))
        sns.distplot(mean_distribution(df["Sales"], lst[count][0],lst[count][1]),ax = ax[i,j])
        mean_list.append(mean_distribution(df["Sales"], lst[count][0],lst[count][1]))
        count +=1

The function is used for taking random samples from the population with the given sample size.

In the first subplot with a sample size of 20 and a mean of 5, the distribution is not normal, right another hump is created. Similarly, the other two samples with sizes 100 and 80 have the same problem: the tip of the curve is forming a curve but the tails are not widely spread. But the sample size of 150 forms approximately a normal distribution.

As observed from the above subplots as the sample size increases the distribution of the sample tends to be normal.

Verdict

The central limit theorem is a powerful and important statistical theorem that helps in the normality assumption and the precision of the estimates. With a hands-on implementation of this concept in this article, we could understand how CLT can be used in data science.