The Central Limit Theorem (Part 1)
One of the most important theorems in all of statistics is called the Central Limit Theorem or the Law of Large Numbers. The introduction of the Central Limit Theorem requires examining a number of new concepts as well as introducing a number of new commands in the R programming language. Consequently, we will break our introduction of the Central Limit Theorem into several parts.
In this first part of the introduction to the Central Limit Theorem, we will show how to draw and visualize a sample of random numbers from a distribution. From there we will examine the mean and standard deviation of the sample, then examine the distribution of the sample means.
We begin by learning how to draw random numbers from a distribution.
The Letter r — Drawing Random Numbers
In previous activities (e.g., The Normal Distribution and Continuous Distributions), we introduced the use of the letters d, p, and q in relation to the various distributions (e.g., normal, uniform, and exponential). A reminder of their use follows:
- "d" is for "density." It is used to find values of the probability density function.
- "p" is for "probability." It is used to find the probability that the random variable lies to the left of a given number.
- "q" is for "quantile." It is used to find the quantiles of a given distribution.
There is a fourth letter, namely "r", that is used to draw random numbers from a distribution. So, for example, runif and rexp would be used to draw random numbers from the uniform and exponential distributions, respectively.
Let's use the rnorm command to draw 500 numbers at random from a normal distribution having mean 100 and standard deviation 10.
> x=rnorm(500,mean=100,sd=10)
We can view the result, some of which are shown below.
> x [1] 110.67263 102.07696 114.41904 98.52447 95.31791 103.32522 [7] 96.85134 105.37060 98.60348 103.72672 101.82439 107.65795 [13] 96.91467 89.42021 106.06962 111.24015 90.51377 108.22921 ... ... [493] 110.61727 87.24973 113.95993 89.80688 106.44881 109.89394 [499] 87.57305 90.88494
When you examine the numbers stored in the variable x, there is a sense that you are pulling random numbers that are clumped about a mean of 100. However, a histogram of this selection provides a better understanding of the data stored in x.
> hist(x,prob=TRUE)
The above command produces the histogram shown in Figure 1.

Figure 1. A histogram of 500 random numbers drawn from a normal distribution with mean 100 and standard deviation 10.
Several comments are in order regarding the histogram in Figure 1:
- The histogram is approximately normal in shape.
- The "balance point" of the histogram appears to be located near 100, suggesting that the random numbers were drawn from a distribution having mean 100.
- It appears that almost all of the values appear within 3 increments of 10 from the mean, suggesting that the random numbers were drawn from a distribution having standard deviation 10.
Let's try the experiment again, drawing a new set of 500 random numbers from the normal distribution having mean 100 and standard deviation 10.
> x=rnorm(500,mean=100,sd=10) > hist(x,prob=TRUE,ylim=c(0,0.04))
These commands produce the plot shown in Figure 2.

Figure 2. A second drawing of 500 random numbers from a normal distribution having mean 100 and standard deviation 10.
The histogram in Figure 2 is different from the histogram shown in Figure 1, owing to the "random" selection of numbers. However, it does share some common traits with the histogram shown in Figure 1: (1) it appears "normal" in shape, (2) it appears to be "balanced" or "centered" about 100, and (3) all data appears to occur within 3 increments of 10 of the mean. This is strong evidence that the random numbers have been drawn from a normal distribution having mean 100 and standard deviation 10. We can provide further evidence of this claim by superimposing a normal probability density function with mean 100 and standard deviation 10.
> curve(dnorm(x,mean=100,sd=10),70,130,add=TRUE,lwd=2,col="red")
The above command superimposes the normal curve shown in Figure 3.

Figure 3. Superimpose a normal curve having mean 100 and standard deviation 10.
The curve command is new. Some comments on its use are in order:
- In its simplest form, the syntax curve(f(x),from=,to=) draws the "function" defined by f(x) on the interval (from,to). Our function is dnorm(x,mean=100,sd=10). The curve command sketches this function of x on the interval (from,to).
- The notation "from=" and "to=" may be omitted if the arguments are submitted in the proper order to the curve command, function first, value of "from" second, then value of "to" third. That is what we've done, substituting 70 for "from" and 130 for "to".
- If the argument "add" is set to TRUE, as we have done, then the curve is "added" to the existing figure. If this argument is omitted, or if it is set to FALSE, then a new plot is drawn, erasing any previous graphics drawn.
The Distribution of Sample Means
In our previous examples, we drew 500 random numbers from a normal distribution with mean 100 and standard deviation 10. This is called "drawing a sample of size 500" from the normal distribution with mean 100 and standard deviation 10. This leads to a sample of 500 random numbers. One immediate question we can ask is "what is the mean of our sample?"
> mean(x) [1] 99.75439
Thus, the mean of this sample is 99.75439.
Of course, if we take another sample of 500 random numbers from the normal distribution with mean 100 and standard deviation 10, we get a new sample that has a different mean.
> x=rnorm(500,mean=100,sd=10) > mean(x) [1] 99.91978
In this case, we have a new sample of 500 randomly selected numbers, and the mean of this sample provides a different result, namely 99.91978. The next question to ask is "what happens if we do this repeatedly?"
Producing a Vector of Sample Means
In the next activity, we will repeatedly sample from the normal distribution. Each sample will select five random numbers from the normal distribution having mean 100 and standard deviation 10. We will then find the mean of the five numbers in our sample. We will repeat this experiment 500 times, collecting the sample means in a vector xbar as we go.
We begin by declaring the mean and standard deviation of the distribution from which we will draw random numbers. Then we declare the sample size (the number of random numbers drawn).
> mu=100; sigma=10 > n=5
Each time we draw a sample of size n = 5 from the normal distribution having mean μ = 100 and standard deviation σ = 10, we need someplace to store the mean of the sample. Because we intend to collect the means of 500 samples, we initialize a vector xbar to initially contain 500 zeros.
> xbar=rep(0,500)
The rep command "repeats" the entry zero 500 times. As a result, the vector xbar now contains 500 entries, each of which is zero.
It is easy to draw a sample of size n = 5 from the normal distribution having mean μ = 100 and standard deviation σ = 10. We simply issue the command rnorm(n,mean=mu,sd=sigma). To find the mean of this result, we simply add the adjustment mean(rnorm(n,mean=mu,sd=sigma)). The final step is to store this result in the vector xbar. Then we must repeat this same process an additional 499 times for a total of 500 sample means. This requires the use of a for loop.
> for (i in 1:500) { xbar[i]=mean(rnorm(n,mean=mu,sd=sigma)) }
The for construct used by R is similar to the "for loops" used in many programming languages.
- The i in for (i in 1:500) is called the index of the "for loop."
- The index i is first set equal to 1, then the "body" of the "for loop" (the part between the curly braces) is executed. On the next iteration, i is set equal to 2 and the body of the loop is executed again. The loop continues in this manner, incrementing the index i by 1, finally setting the index i to 500, upon which the body of the loop executes one last time. Then the "for loop" is terminated.
- In the body of the "for loop" we have xbar[i]=mean(rnorm(n,mean=mu,sd=sigma)). This draws a sample of size n = 5 from the normal distribution, calculates the mean of the sample, and stores the result in xbar[i], the ith entry of xbar.
- When the "for loop" completes 500 iterations, the vector xbar contains the means of 500 samples of size n = 5 drawn from the normal distribution having mean μ = 100 and standard deviation σ = 10.
It is a simple task to sketch the histogram of the sample means contained in the vector xbar.
> hist(xbar,prob=TRUE,breaks=12,xlim=c(70,130),ylim=c(0,0.1))
The above command produces the histogram shown in Figure 4.

Figure 4. The histogram of the sample means.
There are a number of important observations to be made about the histogram of sample means in Figure 4, particularly when it is compared with the histograms of Figures 2 and 3.
- It is essential to note the labels on the horizontal axis. In Figures 2 and 3, the label is x. This is because the histograms in Figures 2 and 3 are simply describing the shape of 500 random numbers selected from the normal distribution with mean μ = 100 and standard deviation σ = 10. On the other hand, the histogram of Figure 4 is describing the distribution of 500 sample means, each of which was found by selecting n = 5 numbers from the normal distribution with mean μ = 100 and standard deviation σ = 10, then computing their mean (average). The horizontal axis in Figure 4 emphasizes this fact with the label xbar.
- It is important to note that the distribution of xbar in Figure 4 appears "normal" in shape. This is so even though the sample size is relatively small (n = 5).
- It appears that the "balance point" or "center" of the distribution in Figure 4 occurs near 100. This can be checked with the following command:
This calculation provides the "mean of the sample means," so to speak. It is important to note that the mean of the sample means appears to be identical to the mean of the distribution from which the samples were drawn.
> mean(xbar) [1] 100.3104
- The distribution of sample means in Figure 4 appears to be "less spread out" or "narrower" than the distributions in Figures 2 and 3.
Increasing the Sample Size
Let's repeat the last experiment, but this time let's draw samples of size n = 10 from the same "parent population," the normal distribution having mean μ = 100 and standard deviation σ = 10.
> mu=100; sigma=10
> n=10
> xbar=rep(0,500)
> for (i in 1:500) {xbar[i]=mean(rnorm(n,mean=mu,sd=sigma))}
> hist(xbar,prob=TRUE,breaks=12,xlim=c(70,130),ylim=c(0,0.1))
The above commands produce the histogram in Figure 5.

Figure 5. Increasing the sample size decreases the spread.
The image in Figure 5 sheds light on three key ideas:
Key Idea: When we select samples from a normal distribution, then the distribution of sample means is also "normal" in shape.
Key Idea: The mean of the distribution of sample means appears to be the same as the mean of the "parent population" from which we selected our samples (see the "balance point" or "center" in Figure 5). This is easily checked.
> mean(xbar) [1] 100.0866
Key Idea: By increasing the size of our samples (n = 10), the histogram of the sample means becomes "less spread out" or "narrower," as is clearly seen when contrasting the spread of the histograms in Figures 4 and 5. This behavior (increasing the sample size decreases the spread) seems quite reasonable. We would expect a more accurate estimate of the mean of the parent population if we take the mean of larger sample size. For example, you have a better chance of estimating the average height of the student population at a school if you ask ten students their height and average than if you asked only five students their height and averaged.
The Central Limit Theorem
We finish with a statement of the Central Limit Theorem.
- If you draw samples from a normal distribution, then the distribution of sample means is also normal.
- The mean of the distribution of sample means is identical to the mean of the "parent population," the population from which the samples are drawn.
- The higher the sample size that is drawn, the "narrower" will be the spread of the distribution of sample means.
This statement of the Central Limit Theorem is not complete. We will add refinements to this statement in later activities.
Enjoy!
We hope you enjoyed this introduction to the Central Limit Theorem system. We encourage you to explore further. You might try repeating the experiments provided in this activity with sample sizes n = 15, 20, and 25.
