The Central Limit Theorem (Part 2)

In the activity The Central Limit Theorem (Part 1), we concluded with the following observations on the Central Limit Theorem.

  1. If you draw samples from a normal distribution, then the distribution of sample means is also normal.
  2. The mean of the distribution of sample means is identical to the mean of the "parent population," the population from which the samples are drawn.
  3. The higher the sample size that is drawn, the "narrower" will be the spread of the distribution of sample means.

We stated that we would refine this statement of the Central Limit Theorem in further activities. Let's proceed to do that now.

What if the Parent Distribution is not Normal?

In the activity The Central Limit Theorem (Part 1), we drew our random samples from a "parent" population whose distribution was "normal." In this activity, we'll choose parent populations that are not normal, then see if the conclusions of the Central Limit Theorem still hold.

In the activity Continuous Distributions, we introduced the Exponential Distribution defined by the following probability density function.

The Exponential Probability Density Function

The exponential probability density function.

Figure 1. The exponential probability density function has both mean and standard deviation equal to 1/λ.

We can easily plot the distribution for λ = 1.

> curve(dexp(x,rate=1),0,4,lwd=2,col="red",ylab="p")

Some comments are in order for the command curve.

The curve command above produces the probability density curve for the exponential distribution shown in Figure 2.

Sketching the probability density function for the exponential distribution (λ = 1).

Figure 2. Sketching the probability density function for the exponential distribution (λ = 1).

Note that the exponential distribution shown in Figure 2 is not normal.

The Letter r — Drawing Random Numbers

In previous activities (e.g., The Normal Distribution and Continuous Distributions), we introduced the use of the letters d, p, and q in relation to the various distributions (e.g., normal, uniform, and exponential). A reminder of their use follows:

There is a fourth letter, namely "r", that is used to draw random numbers from a distribution. Let's use the rexp command to draw 500 numbers at random from the exponential distribution having mean 1 and standard deviation 1.

> x=rexp(500,rate=1)

We can view the result, some of which are shown below.

> x
  [1] 0.279198724 2.974863732 1.784980038 1.089242851 1.654580539
  [6] 6.817406399 0.521311578 0.620673515 0.012819642 2.024541829
 [11] 1.489907151 0.316853154 0.756492146 1.915256215 0.087262080

...
...

[496] 0.609009484 0.504856527 3.498884801 0.871710847 1.009950972

When you examine the numbers stored in the variable x, it is difficult to get a sense of the distribution of numbers. However, a histogram of this selection provides a better understanding of the data stored in x.

> hist(x,prob=TRUE)

The above command produces the histogram shown in Figure 3.

A histogram of 500 random numbers drawn from the exponential distribution with λ = 1

Figure 3. A histogram of 500 random numbers drawn from the exponential distribution with λ = 1

Several comments are in order regarding the histogram in Figure 3:

  1. The histogram is not normal. Indeed, the distribution is decidedly skewed to the right.
  2. Visually, it is not unreasonable to estimate that the "balance point" or "center" of the distribution is near 1. However, a quick calculation provides convincing evidence that the mean is 1.
    > mean(x)
    [1] 1.006549
    

The Distribution of Sample Means

In our previous examples, we drew 500 random numbers from an exponential distribution with mean and standard deviation equal to 1. This is called "drawing a sample of size 500" from the exponential distribution with mean and standard deviation eqyak ti 1. This leads to a sample of 500 random numbers. One immediate question we can ask is "what is the mean of our sample?"

> mean(x)
[1] 1.006549

Thus, the mean of this sample is 1.006549.

Of course, if we take another sample of 500 random numbers from the exponential distribution with mean and standard deviation equal to 1, we get a new sample that has a different mean.

> x=rexp(500,rate=1)
> mean(x)
[1] 0.9780556

In this case, we have a new sample of 500 randomly selected numbers, and the mean of this sample provides a different result, namely 0.9780556. The next question to ask is "what happens if we do this repeatedly?"

Producing a Vector of Sample Means

In the next activity, we will repeatedly sample from the exponential distribution. Each sample will select five random numbers from the exponential distribution having mean and standard deviation equal to 1. We will then find the mean of the five numbers in our sample. We will repeat this experiment 500 times, collecting the sample means in a vector xbar as we go.

We begin by declaring the rate of the exponential distribution from which we will draw random numbers. Then we declare the sample size (the number of random numbers drawn).

> lambda=1
> n=5

Each time we draw a sample of size n = 5 from the exponential distribution having mean μ = 1 and standard deviation σ = 1, we need someplace to store the mean of the sample. Because we intend to collect the means of 500 samples, we initialize a vector xbar to initially contain 500 zeros.

> xbar=rep(0,500)

The rep command "repeats" the entry zero 500 times. As a result, the vector xbar now contains 500 entries, each of which is zero.

It is easy to draw a sample of size n = 5 from the exponential distribution having mean μ = 1 and standard deviation σ = 1. We simply issue the command exp(n,rate=lambda). To find the mean of this result, we simply add the adjustment mean(exp(n,rate=lambda)). The final step is to store this result in the vector xbar. Then we must repeat this same process an additional 499 times for a total of 500 sample means. This requires the use of a for loop.

> for (i in 1:500) { xbar[i]=mean(exp(n,rate=lambda)) }

The for construct used by R is similar to the "for loops" used in many programming languages.

It is a simple task to sketch the histogram of the sample means contained in the vector xbar.

> > hist(xbar,prob=TRUE,breaks=12)

The above command produces the histogram shown in Figure 4.

The histogram of the sample means.

Figure 4. The histogram of the sample means.

There are a number of important observations to be made about the histogram of sample means in Figure 4, particularly when it is compared with the histograms of Figures 2 and 3.

  1. It is essential to note the labels on the horizontal axis. In Figures 2 and 3, the label is x. This is because the histograms in Figures 2 and 3 are simply describing the shape of 500 random numbers selected from the exponential distribution with mean μ = 1 and standard deviation σ = 1. On the other hand, the histogram of Figure 4 is describing the distribution of 500 sample means, each of which was found by selecting n = 5 numbers from the exponential distribution with mean μ = 1 and standard deviation σ = 1, the computing their mean (average). The horizonatal axis in Figure 4 emphasizes this fact with the label xbar.
  2. It is important to note that the distribution of xbar in Figure 4 is not normalin shape. Indeed, the distribution is decidedly skewed to the right.

Increasing the Sample Size

Let's repeat the last experiment, but this time let's draw samples of size n = 10 from the same "parent population," the exponential distribution having mean μ = 1 and standard deviation σ = 1.

> lambda=1
> n=10
> xbar=rep(0,500)
> for (i in 1:500) { xbar[i]=mean(rexp(n,rate=1))}
> hist(xbar,prob=TRUE,breaks=12)

The above commands produce the histogram in Figure 5.

Increasing the sample size to <i>n = 10</i>.

Figure 5. Increasing the sample size to n = 10.

The histogram in Figure 5 is still not normal in shape. Again, it is definitely skewed to the right, though perhaps not as much as the histogram in Figure 4 that was produced with a smaller sample size.

Let's increase the sample size to n = 20 and repeat the experiment.

> lambda=1
> n=20
> xbar=rep(0,500)
> for (i in 1:500) { xbar[i]=mean(rexp(n,rate=1))}
> hist(xbar,prob=TRUE,breaks=12)

The code above will produce the image in Figure 6.

Increasing the sample size to <i>n = 20</i>.

Figure 6. Increasing the sample size to n = 20.

Aha! The histogram in Figure 6 has the appearance of a normal distribution. The "right-skewness" is beginning to disappear, when compared with the histograms in Figures 4 and 5.

> lambda=1
> n=30
> xbar=rep(0,500)
> for (i in 1:500) { xbar[i]=mean(rexp(n,rate=1))}
> hist(xbar,prob=TRUE,breaks=12)

The code above will produce the image in Figure 7.

Increasing the sample size to <i>n = 30</i>.

Figure 7. Increasing the sample size to n = 30.

The histogram in Figure 7 has the symmetric bell-shape of the normal distribution.

Key Observation: It would appear that the distribution of the sample means will be normal in shape, regardless of the shape of the "parent" population, provided the sample size is large enough. In Figure 7, a sample size of n = 30 seemed to be enough to guarantee that the distribution of sample means is normal in shape, even though the samples were drawn from the exponential distribution, a distribution that is highly skewed to the right.

There are two more important observations to be made about the distribution shown in Fiugre 7:

  1. Key Observation: The histogram in Figure 7 appears to be "balanced" or "centered" about xbar = 1. Indeed:
    > mean(xbar)
    [1] 0.9978525
    
    This mean is the same as the mean of the "parent" population (the exponential distribution with mean μ = 1) from which our samples were drawn.
  2. Key Observation: The standard deviation of the distribution in Figure 7 appears to be much smaller than the standard deviation of the parent population (the exponential distribution had σ = 1). Indeed, if we recall that all data in the normal distribution falls within about three standard deviations of the mean, the histogram in Figure 7 would indicate a standard deviation of about 0.2. We will have more to say about the standard deviation of the sample means in a later activity.

Sampling from a Discrete Population

Let's repeat the experiment again (increasing the sample size), only this time let's use a "parent" population that is discrete and skewed to the left. Specifically:

A Discrete Distribution
x p
1 0.1
2 0.1
3 0.1
4 0.1
5 0.2
6 0.4

We can "load" the values of the random variable and their probabilities in R as follows:

> x=c(1,2,3,4,5,6)
> p=c(0.1,0.1,0.1,0.1,0.2,0.4)

We can provide a "stick" plot of this discrete distribution with the following code:

x=c(1,2,3,4,5,6)
p=c(0.1,0.1,0.1,0.1,0.2,0.4)
plot(x,p,type="h",lwd=2,col="red",ylim=c(0,0.5))
points(x,p,pch=16,cex=2,col="black")

The code above will produce the discrete distribution shown in Figure 8.

A discrete distribution that is badly skewed to the left.

Figure 7. A discrete distribution that is badly skewed to the left.

Determine the Mean of the Discrete Distribution

Here is a simple formula for computing the mean of a discrete distribution.

A Formula for the Mean of a Discrete Distribution

The mean is found by summing the product of the values of the random variable and their associated probabilities.

Figure 1.The mean is found by summing the product of the values of the random variable and their associated probabilities.

Thus, we can find the mean of our discrete distribution with the following calculation:

Calculating the mean of the discrete distribution.

Making this calculation is R greatly simplifies the task. First find the product of the vectors x and p. Note: When you take the product of two vectors, R will produce a third vector, each entry of which is the product of the corresponding entries in the vectors being multiplied.

> x*p
[1] 0.1 0.2 0.3 0.4 1.0 2.4

Sum these numbers to find the mean.

> sum(x*p)
[1] 4.4

Thus, the mean of the discrete distribution is μ = 4.4.

Key Observation: Look again at the "spike" plot in Figure 7. Imagine a "knife-edge" at 4.4 and set the distribution of Figure 7 atop the "knife-edge." Will the distribution balance? Keep in mind the "principle of the lever" or the "teeter-totter effect." The outliers, such as x = 1, positioned a farther distance from the mean but with lower probability, can balance values with more "massive" probabilities that are clumped closer to the mean. With these thoughts in mind, the mean value μ = 4.4 seems reasonable.

Sampling from the Discrete Distribution

In the activity Sampling a Discrete Population, the sample command was used to draw a sample. The syntax sample(x, size, replace=, prob=) indicates that we must provide the following arguments:

Let's draw a sample of size 1000 and create a histogram of the resulting sample.

n=1000
xs=sample(x,size=n,replace=TRUE,prob=p)

As in the activity Sampling a Discrete Distribution, the table command provides a nice summary of the sample.

> table(xs)
xs
  1   2   3   4   5   6 
 97 116  91 102 189 405 

Even better is a visualization provided by a barplot.

> barplot(table(xs)/length(xs),xlab="x",ylab="frequency")

This last command produces the barplot shown in Figure 8.

Plotting a sample of size 1000 selected from the discrete distribution.

Figure 8.Plotting a sample of size 1000 selected from the discrete distribution.

The barplot in Figure 8 is a random sample from the discrete distribution shown in Figure 7, whose theoretical mean is 4.4. Let's see what the mean of our sample is.

> mean(xs)
[1] 4.385

Of course, if we draw another random sample, we will get a different sample mean.

> xs=sample(x,size=n,replace=TRUE,prob=p)
> mean(xs)
[1] 4.479

The Distribution of Sample Means

As in the last example, let's start with a sample size n = 5. Will draw 500 samples, then plot the distribution of samples using a histogram.

n=5
xbar=rep(0,500)
for (i in 1:500) {
	xbar[i]=mean(sample(x,size=n,replace=TRUE,prob=p))
}
hist(xbar,prob=TRUE,breaks=12)

This last command produces the barplot shown in Figure 9.

Plotting samples of size <i>n = 5</i>  selected from the discrete distribution.

Figure 9. Plotting samples of size n = 5 selected from the discrete distribution.

Note that the distribution of sample means in Figure 9 is not normal. Indeed, the distribution is skewed to the left. This is to be expected as the "parent" population is highly skewed to the left and the sample size we are using is quite small. Let's increase the sample size and see what happens.

n=10
xbar=rep(0,500)
for (i in 1:500) {
	xbar[i]=mean(sample(x,size=n,replace=TRUE,prob=p))
}
hist(xbar,prob=TRUE,breaks=12)

This last command produces the barplot shown in Figure 10.

Plotting samples of size <i>n = 10</i>  selected from the discrete distribution.

Figure 10. Plotting samples of size n = 10 selected from the discrete distribution.

There's a bit of an improvement (starting to look somewhat normal) but the distribution of sample means in Figure 10 is still skewed to the left. Let's increase the sample size and see what happens.

n=20
xbar=rep(0,500)
for (i in 1:500) {
	xbar[i]=mean(sample(x,size=n,replace=TRUE,prob=p))
}
hist(xbar,prob=TRUE,breaks=12)

This last command produces the barplot shown in Figure 11.

Plotting samples of size <i>n = 20</i>  selected from the discrete distribution.

Figure 11. Plotting samples of size n = 20 selected from the discrete distribution.

The distribution is now taking on the symmery and "bell-shape" of the normal distribution. Let's increase the sample size one more time and see what happens.

n=30
xbar=rep(0,500)
for (i in 1:500) {
	xbar[i]=mean(sample(x,size=n,replace=TRUE,prob=p))
}
hist(xbar,prob=TRUE,breaks=12)

This last command produces the barplot shown in Figure 12.

Plotting samples of size <i>n = 30</i>  selected from the discrete distribution.

Figure 12. Plotting samples of size n = 30 selected from the discrete distribution.

The histogram in Figure 7 has the symmetric bell-shape of the normal distribution.

Key Observation: It would appear that the distribution of the sample means will be normal in shape, regardless of the shape of the "parent" population, provided the sample size is large enough. In Figure 12, a sample size of n = 30 seemed to be enough to guarantee that the distribution of sample means is normal in shape, even though the samples were drawn from a discrete distribution, a distribution that is highly skewed to the left.

Key Observation: The histogram in Figure 12 appears to be "balanced" or "centered" about xbar = 4.4. Indeed:

> mean(xbar)
[1] 4.396933

This mean is the same as the mean of the "parent" population (the discrete distribution of Figure 7 with mean μ = 4.4) from which our samples were drawn.

The Central Limit Theorem

We are now in a position to refine our statement of the Central Limit Theorem.

  1. If you draw samples from a distribution, then the distribution of sample means is also normal, provided a large enough sample size is used. It appears that that the Magic Number for a sufficient sample size is n = 30. Note: This is why the Central Limit Theorem is oftentimes referred to a the "Law of Large Numbers."
  2. The mean of the distribution of sample means is identical to the mean of the "parent population," the population from which the samples are drawn.
  3. The higher the sample size that is drawn, the "narrower" will be the spread of the distribution of sample means.

This statement of the Central Limit Theorem is still not complete. We still need to discuss just how the standard deviation of the sample means varies with the sample size. We will attack this question in a later activity.

Enjoy!

We hope you enjoyed this second activity on the Central Limit Theorem system. We encourage you to explore further. You might try repeating the experiments in this activity with different "parent" distributions.