Mohammed Sharaqi

Posted on Feb 17

Graphing CLT-Central Limit Theorem- by using Pandas and Matplotlib

#statistics #datascience #pandas

Statistics cares about the central tendency and to answer the question that says, "How is our data going to be distributed?" You need to have an idea about the center, how each value varies from the center, and the outliers. In addition, to get information about the center, you have to focus on three factors: mean, median, and mode. Mean is a single numerical figure that represents the center of an entire distribution of data. The median is the middle value in a set of data. Mode is the value that is most repeated in the set of data. However, knowing the center is not the only case to finalize your thought about what the shape of our distribution is going to be; so besides having an idea about the center, we need to get two additional ideas that tell us the kurtosis, the "tailedness," or the peak of the distribution. It tells you how much of your data sits in the tails versus the center, and skewness describes the asymmetry of the distribution; in other words, it tells you which way the "tail" of the graph is leaning.

if you got lost, here is the questions you can answer once you have a these three values:

Where is it? (Mean)
Is it lopsided? (Skewness)
Is it pointy or flat? (Kurtosis)

But what about the second momentum or variance?

well, variance is good to understand a crucial information about your graph, it basically tells you How much is the crowd huddling or spreading out? this why we need to calculate the Standard deviation and variance to tell us how each individual are spred out or how each value is far away from the center. While:

Low Variance: The data points are very close to the mean. The graph looks like a tall, narrow spike.
High Variance: The data points are spread far from the mean. The graph looks wide and stretched out.

Congrats, dear reader, I’ve deceived you by giving you this information to describe a unique distribution while you can only find this distribution only in the textbook; to be honest with you, my faculty has said that “it’s hard to get normal distribution [easily] in real life.” This is why I have to introduce you to the Central Limit Theory as a method to overcome real-life situations and somehow force your problem to be normally distributed so you can get the advantages of normally distributed graphs.

💡
Not all population are distributed normally; But with CLT we can have N number samples that their means or “sample mean” is equal to population mean and they’re normally distributed.

So, What is the idea of CLT?

Imagine you have a population; it doesn’t matter either if it’s normally distributed or not, and you know the mean beside its standard deviation. Then, you start having N numbers of samples from your population; After that, you can calculate their mean, and you’ll notice something weird. The mean of the samples is going to be the same as the mean of the population. And more other properties below the image:

Your sample mean is equal to your population mean.

But please note that we’re speaking about sample mean, not population mean. We’ll use this idea when we’re going to calculate the probability of the problems. The formula is changing depend on whether you’re dealing with mean or sample mean.

The standard deviation of the sample means will be smaller than the standard deviation of the population, and it will be equal to the population standard deviation divided by the square root of the sample size. And you have to specify it by saying “Standard Error”.

$$\text{Standard Error (SE)} = \frac{σ} { \sqrt{n}}$$

Also there are some condition for this theory:

Now, for the central limit theorem to work the way we expect, there are a few conditions to consider:

Sufficiently large sample size: The sample size has to be large enough. A sample size of 30 or more is usually considered good. But if the population we’re sampling from is skewed or has many outliers, we may need a bigger sample to see that nice, bell-shaped curve show up.

Independent and Identically Distributed (i.i.d.) samples: The samples we take need to be independent and identically distributed(i.i.d.). This means each sample is chosen randomly and comes from the same population. If that’s not true, then the results may not be reliable.

Finite population variance: The population we’re sampling must have a finite variance. If the data comes from a distribution with infinite variance, like the Cauchy distribution, the CLT won’t apply to those.

the source

Example, The following example illustrates these two properties. Suppose a professor gave an 8-point quiz to a small class of four students. The results of the quiz were 2, 6, 4, and 8. For the sake of discussion, assume that the four students constitute the population.

before solving this question, here’s the algorithm that i’m going to follow to show that the sample mean is equal to your population mean.

finding the mean and the STD of the population.
take samples with replacement from my population and calculate the mean to:
1. proof that the sample mean is equal to the population’s mean
2. the sample mean is going to be distributed as normal although the population dataset hasn’t distributed as normal

The mean of the population is

$$\mu = \frac{\sum X}{N} = \frac{20}{4} = 5$$

and STD is

Here is a Step-by-Step Arithmetic for getting STD:

Calculate Deviations:

* $$2 - 5 = -3$$

* $$4 - 5 = -1$$

* $$6 - 5 = 1$$

* $$8 - 5 = 3$$

Square the Deviations:

* $$(-3)^2 = 9$$

* $$(-1)^2 = 1$$

* $$(1)^2 = 1$$

* $$(3)^2 = 9$$

Sum the Results:

$$SS = 9 + 1 + 1 + 9 = 20$$

This is the graph that shows our data; look at how our data is not normally distribution.

let us take samples from the population, and remember, our sample mean data is going to be normal as much as our N is getting bigger and bigger. Our samples is going to be a set of two different numbers and we’re going to have the mean of each set:

Sample	Mean (\bar{x})	Sample	Mean (\bar{x})
2, 2	2	6, 2	4
2, 4	3	6, 4	5
2, 6	4	6, 6	6
2, 8	5	6, 8	7
4, 2	3	8, 2	5
4, 4	4	8, 4	6
4, 6	5	8, 6	7
4, 8	6	8, 8	8

A frequency distribution of sample means is as follows:

X	f	f⋅X
2	1	2×1=2
3	2	3×2=6
4	3	4×3=12
5	4	5×4=20
6	3	6×3=18
7	2	7×2=14
8	1	8×1=8
Total	∑f=16	∑(f⋅X)=80

So, the mean is x¯=80/16=5

and this is the distribution of the mean sample, and as you see it’s approaching the normal distribution.

Let’s do some python stuff

Before we can start, you can find the project here on Kaggle (it includes the CSV file and the code there).

Let’s study a new case to get a better understanding. What if we increase the number of samples? Like, instead of the scores being {2,4,6,8}, let’s make it {2,4,6,8,10,12} so we can have three numbers in each sample. For pairing off the values, we need to ask ChatGPT to do it because we are going to see that the number of samples is going to be 216 sets or samples.

here is the output:

Starting with 2	Starting with 4	Starting with 6
(2, 2, 2) (2, 2, 4) (2, 2, 6) (2, 2, 8) (2, 2, 10) (2, 2, 12)	(4, 2, 2) (4, 2, 4) (4, 2, 6) (4, 2, 8) (4, 2, 10) (4, 2, 12)	(6, 2, 2) (6, 2, 4) (6, 2, 6) (6, 2, 8) (6, 2, 10) (6, 2, 12)
(2, 4, 2) (2, 4, 4) (2, 4, 6) (2, 4, 8) (2, 4, 10) (2, 4, 12)	(4, 4, 2) (4, 4, 4) (4, 4, 6) (4, 4, 8) (4, 4, 10) (4, 4, 12)	(6, 4, 2) (6, 4, 4) (6, 4, 6) (6, 4, 8) (6, 4, 10) (6, 4, 12)
(2, 6, 2) (2, 6, 4) (2, 6, 6) (2, 6, 8) (2, 6, 10) (2, 6, 12)	(4, 6, 2) (4, 6, 4) (4, 6, 6) (4, 6, 8) (4, 6, 10) (4, 6, 12)	(6, 6, 2) (6, 6, 4) (6, 6, 6) (6, 6, 8) (6, 6, 10) (6, 6, 12)
(2, 8, 2) (2, 8, 4) (2, 8, 6) (2, 8, 8) (2, 8, 10) (2, 8, 12)	(4, 8, 2) (4, 8, 4) (4, 8, 6) (4, 8, 8) (4, 8, 10) (4, 8, 12)	(6, 8, 2) (6, 8, 4) (6, 8, 6) (6, 8, 8) (6, 8, 10) (6, 8, 12)
(2, 10, 2) (2, 10, 4) (2, 10, 6) (2, 10, 8) (2, 10, 10) (2, 10, 12)	(4, 10, 2) (4, 10, 4) (4, 10, 6) (4, 10, 8) (4, 10, 10) (4, 10, 12)	(6, 10, 2) (6, 10, 4) (6, 10, 6) (6, 10, 8) (6, 10, 10) (6, 10, 12)
(2, 12, 2) (2, 12, 4) (2, 12, 6) (2, 12, 8) (2, 12, 10) (2, 12, 12)	(4, 12, 2) (4, 12, 4) (4, 12, 6) (4, 12, 8) (4, 12, 10) (4, 12, 12)	(6, 12, 2) (6, 12, 4) (6, 12, 6) (6, 12, 8) (6, 12, 10) (6, 12, 12)

Starting with 8	Starting with 10	Starting with 12
(8, 2, 2) (8, 2, 4) (8, 2, 6) (8, 2, 8) (8, 2, 10) (8, 2, 12)	(10, 2, 2) (10, 2, 4) (10, 2, 6) (10, 2, 8) (10, 2, 10) (10, 2, 12)	(12, 2, 2) (12, 2, 4) (12, 2, 6) (12, 2, 8) (12, 2, 10) (12, 2, 12)
(8, 4, 2) (8, 4, 4) (8, 4, 6) (8, 4, 8) (8, 4, 10) (8, 4, 12)	(10, 4, 2) (10, 4, 4) (10, 4, 6) (10, 4, 8) (10, 4, 10) (10, 4, 12)	(12, 4, 2) (12, 4, 4) (12, 4, 6) (12, 4, 8) (12, 4, 10) (12, 4, 12)
(8, 6, 2) (8, 6, 4) (8, 6, 6) (8, 6, 8) (8, 6, 10) (8, 6, 12)	(10, 6, 2) (10, 6, 4) (10, 6, 6) (10, 6, 8) (10, 6, 10) (10, 6, 12)	(12, 6, 2) (12, 6, 4) (12, 6, 6) (12, 6, 8) (12, 6, 10) (12, 6, 12)
(8, 8, 2) (8, 8, 4) (8, 8, 6) (8, 8, 8) (8, 8, 10) (8, 8, 12)	(10, 8, 2) (10, 8, 4) (10, 8, 6) (10, 8, 8) (10, 8, 10) (10, 8, 12)	(12, 8, 2) (12, 8, 4) (12, 8, 6) (12, 8, 8) (12, 8, 10) (12, 8, 12)
(8, 10, 2) (8, 10, 4) (8, 10, 6) (8, 10, 8) (8, 10, 10) (8, 10, 12)	(10, 10, 2) (10, 10, 4) (10, 10, 6) (10, 10, 8) (10, 10, 10) (10, 10, 12)	(12, 10, 2) (12, 10, 4) (12, 10, 6) (12, 10, 8) (12, 10, 10) (12, 10, 12)
(8, 12, 2) (8, 12, 4) (8, 12, 6) (8, 12, 8) (8, 12, 10) (8, 12, 12)	(10, 12, 2) (10, 12, 4) (10, 12, 6) (10, 12, 8) (10, 12, 10) (10, 12, 12)	(12, 12, 2) (12, 12, 4) (12, 12, 6) (12, 12, 8) (12, 12, 10) (12, 12, 12)

Let’s importing Pandas and Matplotlib

import pandas as pd #importing pandas
import matplotlib.pyplot as plt #importing matplotlib

Then we can access the samples_and_means.csv from Kaggle and use it to graph the values.

data=pd.read_csv("/kaggle/input/sample-mean-of-the-our-samples/samples_and_means.csv")
df=pd.DataFrame(data)
# Store the series and count the values and reset the index to access it in graphing steps, also we have to sort the value by the mean.
counts = df["Mean"].value_counts().reset_index().sort_values("Mean")
counts
# Visualize using the index and values
counts.plot(kind="bar",x="Mean",y="count",color="salmon")
plt.xlabel('Category')
plt.ylabel('Frequency')
plt.show()

And this the output here.

Final thought.

We can see that the graph is becoming normally distributed and since we’ve reached this conclusion, so solving problems become more easier. Also here why it’s important in statistics field:

Predictability from Chaos: No matter how "weird" or skewed your original population distribution is, the distribution of the sample means will always pull toward a Normal Distribution (a bell curve) as your sample size (n) increases.
The Center Holds: The mean of all your sample means will be exactly equal to the original population mean.
Reduced Risk: As you take larger samples, the spread (Standard Error) of those sample means gets smaller, making your estimates much more precise.

Before you go, i’ve built this series to share my journey in learning data science so you may subscribe if you’re interesting to know my journey. thanks.

https://mosharaqi.hashnode.dev/series/data-science

If you think i’m mistaken please don’t hesitate to reach me out on my LinkedIn

https://www.linkedin.com/in/mosharaqi