The Central Limit Theorem (Part 3)

In the activity The Central Limit Theorem (Part 1), we concluded with the following observations on the Central Limit Theorem.

  1. If you draw samples from a normal distribution, then the distribution of sample means is also normal.
  2. The mean of the distribution of sample means is identical to the mean of the "parent population," the population from which the samples are drawn.
  3. The higher the sample size that is drawn, the "narrower" will be the spread of the distribution of sample means.

To provide visual evidence of the narrowing spread, the following code will produce the image shown in Figure 1. The four examples pictured use sample sizes of 5, 10, 20, and 30. (Note: It is very difficult to type the following commands without error. In a moment, we'll provide a more efficient way to enter the commands. In the meantime, give the commands a try, if only to appreciate how difficult it is to get them correctly entered in sequence.)

> par(mfrow=c(2,2))
> 
> n=5
> xbar_5=rep(0,500)
> for (i in 1:500){ xbar_5[i]=mean(rnorm(n,mean=100,sd=10))}
> hist(xbar_5,prob=TRUE,breaks=12,xlim=c(70,130))
> 
> n=10
> xbar_10=rep(0,500)
> for (i in 1:500) { xbar_10[i]=mean(rnorm(n,mean=100,sd=10))}
> hist(xbar_10,prob=TRUE,breaks=12,xlim=c(70,130))
> 
> n=20
> xbar_20=rep(0,500)
> for (i in 1:500) { xbar_20[i]=mean(rnorm(n,mean=100,sd=10))}
> hist(xbar_20,prob=TRUE,breaks=12,xlim=c(70,130))
> 
> n=30
> xbar_30=rep(0,500)
> for (i in 1:500) { xbar_30[i]=mean(rnorm(n,mean=100,sd=10))}
> hist(xbar_30,prob=TRUE,breaks=12,xlim=c(70,130))

Some comments are in order:

  1. There is a new command, par(mfrow=c(2,2)), which divides the graphics window into two rows and two columns (see Figure 1). The par stands for "parameter," and there are a whole host of that can alter the figure window (type ?par for more information).
  2. Think of mfrow as meaning "multiple figures entered by rows." Because there are two rows and two columns, the first histogram command creates a plot in row 1, column 1, the second in row 1, column 2, the third in row 2, column 1, and the fourth in row 2 column 2. There is also a command mfcol that enters the figures by columns.
  3. The for loops are explained in the activity The Central Limit Theorem (Part 1).
  4. Note that xlim=c(70,130) creates a common axis so that a proper comparison of the spreads can be made.

In Figure 1, note that the spread narrows as the sample size increases from n = 5 to n = 10, n = 20, and n = 30.

A visual comparison of the distribution of sample means as the sample size increases.

Figure 1. A visual comparison of the distribution of sample means as the sample size increases.

Note that the mean of each distribution of sample means is 100 (each of the histograms in Figure 1 balances about 100), the same as the mean of the parent population from which the samples are drawn.

Sourcing Files

If you were successful in typing the lines that produced Figure 1, then you are far better typist than me. When commands must be correctly entered in sequence, such as the lines of code that produced Figure 1, it can be exasperating to make a mistake, knowing that you must retype the entire sequence anew.

A far more efficient way to proceed is to use R's editor to enter the lines of code, save the file, then execute the code in the file. This way, if you make a simple typing mistake, you make a change to the file, save, then execute the file anew. R calls these files of code "source files," and executing the file is called "sourcing the file."

On the Mac operating system (Windows and Linux should be similar), you open the editor by selecting New Document from the File menu (there is also an icon for new documents on the console toolbar). When the editor opens, type the following lines.

par(mfrow=c(2,2))

n=5
xbar_5=rep(0,500)
for (i in 1:500){ xbar_5[i]=mean(rnorm(n,mean=100,sd=10))}
hist(xbar_5,prob=TRUE,breaks=12,xlim=c(70,130))

n=10
xbar_10=rep(0,500)
for (i in 1:500) { xbar_10[i]=mean(rnorm(n,mean=100,sd=10))}
hist(xbar_10,prob=TRUE,breaks=12,xlim=c(70,130))

n=20
xbar_20=rep(0,500)
for (i in 1:500) { xbar_20[i]=mean(rnorm(n,mean=100,sd=10))}
hist(xbar_20,prob=TRUE,breaks=12,xlim=c(70,130))

n=30
xbar_30=rep(0,500)
for (i in 1:500) { xbar_30[i]=mean(rnorm(n,mean=100,sd=10))}
hist(xbar_30,prob=TRUE,breaks=12,xlim=c(70,130))

Save the file as central3_1.R. The .R extension reminds us that the file contains R source code.

To execute the code in the file central3_1.R, select Source File from the File menu (there is also an icon for "sourcing" files on the console toolbar). This opens a dialog where you can browse your system in the usual manner and select the saved source file (central3_1.R in our case). Clicking the Open button will execute the source code in the file and produce the image in Figure 1. If you have errors, or wish to make adaptions, correct your source code, save, then "source" the file again.

On our system, "sourcing" the file central3_1.R sends the following command to the console:

> source("/Users/darnold/Documents/MathDept/trunk/MathDept/html/R/code/central3_1.R")

This line shows both the path and filename of the file containing our source code. We've found that should errors occur, or should we wish to add further code or make changes, we can save our changes, then hit the "up-arrow" on our keyboard to replay this line in our console. Hitting enter executes the line and "sources" our saved changes, a most efficient way to work.

Examining the Workspace

Once you've "sourced" your code, the workspace will reflect variables used in your code. (Note: Your workspace may differ.)

> ls()
[1] "i"       "n"       "xbar_10" "xbar_20" "xbar_30" "xbar_5" 

The variable i should contain the last value of iteration.

> i
[1] 500

The variable n should contain the last sample size used in your code.

> n
[1] 30

The variables xbar_5, xbar_10, xbar_20, and xbar_30 should contain 500 sample means with the indicated sample size. For example, xbar_5 contains the means of 500 samples of size n = 5 drawn from a normal population having mean μ = 100 and standard deviation σ = 10.

> xbar_5
  [1]  94.47642  98.71575 106.02464 104.54210 102.46225  95.55821
  [7] 106.01677 103.29282 101.07612  93.36717  97.08930  99.53715
 [13]  98.72121  91.59586 100.12131 100.20003  99.66180  97.42841
 [19] 102.15585  93.75008  98.89875 103.13431 101.45754  97.65430
 [25] 103.02266  97.22582  95.37828  96.91194  98.07255  96.93296

...
...

[493] 102.79047  92.16152 103.85065  95.74779  92.83653  97.43172
[499]  99.70777 106.23438

The histograms in Figure 1 indicate that the spread of the distributions of sample means narrows with increasing sample size. Let's examine the standard deviations of xbar_5, xbar_10, xbar_20, and xbar_30. The sd command is used to find the standard deviation of any collection of data.

> sd(xbar_5)
[1] 4.52229
> sd(xbar_10)
[1] 3.179132
> sd(xbar_20)
[1] 2.165316
> sd(xbar_30)
[1] 1.820735

Sure enough, as the sample size increases, the standard deviation, which is a measure of the spread of the distribution, clearly decreases.

Adding Data Points

Let's add a couple more data points. Save the following lines in the file central3_2.R.

n=5
xbar_5=rep(0,500)
for (i in 1:500){ xbar_5[i]=mean(rnorm(n,mean=100,sd=10))}

n=10
xbar_10=rep(0,500)
for (i in 1:500) { xbar_10[i]=mean(rnorm(n,mean=100,sd=10))}

n=20
xbar_20=rep(0,500)
for (i in 1:500){ xbar_20[i]=mean(rnorm(n,mean=100,sd=10))}

n=30
xbar_30=rep(0,500)
for (i in 1:500) { xbar_30[i]=mean(rnorm(n,mean=100,sd=10))}

n=40
xbar_40=rep(0,500)
for (i in 1:500) { xbar_40[i]=mean(rnorm(n,mean=100,sd=10))}

n=50
xbar_50=rep(0,500)
for (i in 1:500) { xbar_50[i]=mean(rnorm(n,mean=100,sd=10))}

Now, collect the sample sizes in a vector ssizes and standard deviations of the distributions of sample means in a vector sdevs, then plot the standard deviations versus the sample sizes. Add these lines to the code in the file central3_2.R.

ssizes=c(5,10,20,30,40,50)
sdevs=c(sd(xbar_5),sd(xbar_10),sd(xbar_20),sd(xbar_30),sd(xbar_40),sd(xbar_50))

plot(ssizes,sdevs)

Save the file central3_2.R then "source" the file to produce the scatterplot shown in Figure 2.

Plotting the standard deviations of the distributions of sample means versus the sample size.

Figure 2. Plotting the standard deviations of the distributions of sample means versus the sample size.

Modeling Standard Deviation versus Sample Size

Let's examine the current workspace in R. Note that this next command is executed at the command line.

> ls()
 [1] "i"       "n"       "sdevs"   "ssizes"  "xbar_10" "xbar_20"
 [7] "xbar_30" "xbar_40" "xbar_5"  "xbar_50"
 

Good! All the data we need exists in the current workspace.

We will now try to find the specific relationship between the standard deviations of the distributions of sample means and the corresponding sample size. In the activity Transforming Data in R, we examined two models: power functions and exponential functions. Let's plot the logarithm of sdevs versus ssizes.

> plot(log(ssizes),log(sdevs))

The above command produces the scatterplot shown in Figure 3.

Plotting the logarithm of <strong>sdevs</strong> versus the logarithm of <strong>ssizes</strong>.

Figure 3. Plotting the logarithm of sdevs versus the logarithm of ssizes.

The fact that the plot of the logarithm of sdevs versus ssizes in Figure 3 is linear indicates that a power function would be a good model for the relationship between sdevs and ssizes.

As we learned in the activity Transforming Data in R, we first fit a line to the logarithm of sdevs versus the logarithm of ssizes in Figure 1.

> res=lm(log(sdevs)~log(ssizes))
> res

Call:
lm(formula = log(sdevs) ~ log(ssizes))

Coefficients:
(Intercept)  log(ssizes)  
     2.2962      -0.4949 

Let's add the line of best fit to the data in Figure 3.

> abline(res)

This produces the line of best fit in Figure 4.

Adding the line of best fit.

Figure 4. Adding the line of best fit..

The fact that the line fits the data in Figure 4 so well is a strong indication that a power function is the relationship we seek between the standard deviations of the distribution of sample means and the sample sizes.

The output in the variable res tells us that the line of best fit in Figure 4 has the following equation.

log sdevs = 2.2962 - 0.4949 log ssizes

We exponentiate both sides of the last equation.

elog sdevs = e2.2962 - 0.4949 log ssizes

On the left, the exponential and the logarithm are inverses. On the right, the exponential of a sum is the product of the exponentials.

sdevs = e2.2962 e - 0.4949 log ssizes

We compute e2.2962.

> exp(2.2962)
[1] 9.936352

Thus, our last equation becomes:

sdevs = 9.936352 e - 0.4949 log ssizes

We use a property of logarithms to move - 0.4949 into the exponent.

sdevs = 9.936352 elog ssizes - 0.4949

Again, the logarithm and exponential are inverses, which leads to the following result.

sdevs = 9.936352 ssizes - 0.4949

Conclusions

Suppose that we round the numbers in our last result to obtain:

sdevs = 10 ssizes - 0.5

Note the number 10. It is not a coincidence that this number is the standard deviation of the parent population. Remember, we drew our samples from a normal distribution with mean μ = 100 and standard deviation σ = 10.

Indeed, here is the final conclusion of the Central Limit Theorem.

Central Limit Theorem. Let X be a random variable having normal distribution with mean μ and standard deviation σ. Let X be the random variable representing the mean of a sample of size n drawn from the distribution of X. Then:

  1. The distribution of X will be normal.
  2. The mean of the distribution of X is identical to the mean of the distribution of X. In symbols, we write:

    The mean of xbar equals the mean of x.

  3. The standard deviation of the distribution of X is related to the standard deviation of the distribution of X by the following equation:

    The standard deviation of xbar equals the standard deviation of x divided by the square root of the sample size.

Note that the result in item #3 is identical to our experimental result, repeated here for convenience.

sdevs = 10 ssizes - 0.5

The comparison is better seen if we use a little algebra to change the form of the result in item #3.

The standard deviation of the distribution of the mean equals the product of the standard deviation of the parent population and the sample size raised to the minus one-half.

Thus, to find the standard deviation of the distribution of sample means, take the standard deviation of the parent population and divide it by the square root of the sample size.

Sampling from Distributions that are not Normal

Here is the key fact: The hypotheses and conclusions of the Central Limit Theorem as stated above hold regardless of the distribution of the parent population, provided the sample size is large enough. We now will examine this statement through examples.

Drawing from a Uniform Distribution

Consider the uniform distribution on the interval [0,10].

> x=seq(0,10,0.1)
> plot(x,y,type="l")
> plot(x,y,type="l",lwd=2,col="red")

These commands produce the uniform density function on [0,10] in Figure 5.

The uninform density function on [0,10].

Figure 5. The uniform density function on [0,10].

The mean of the uniform distribution on the interval [a, b] is given by the formula:

The mean of the uniform distribution on [a,b] is given by (a+b)/2.

Thus, the mean of the uniform distribution on [0,10] is:

The mean of the uniform distribution on [0,10] is 5.

You need to use a little calculus to determine the standard deviation of the uniform distribution on the interval [a, b], but the result is:

The standard deviation of the uniform distribution on the interval [a,b] is found by taking the difference of b and a, squaring, dividing by 12, then taking the square root of the result.

Thus, the standard deviation of the uniform distribution on the interval [0,10] is:

The standard deviation of the uniform distribution on [0,10] is iapproximately 2.886751.

Now, let's draw 500 samples of size n=100 from the uniform distribution on [0,10].

> n=100
> xbar=rep(0,500)
> for (i in 1:500) { xbar[i]=mean(runif(n,min=0,max=10))}
> hist(xbar,prob=TRUE)

These commands produce the histogram in Figure 6.

Histogram of means of samples from the uniform distribution o [0,10] having sample size n = 100.

Figure 6. Histogram of means of samples of size n = 10 drawn from the uniform distribution [0,10].

Now, let's check the conclusions of the Central Limit Theorem.

  1. The first conclusion of the Central Limit Theorem is the fact that the distribution of sample means is normal. This is evident in Figure 6.
  2. The second conclusion of the Central Limit Theorem states that the mean of the sample means is the same as the mean of the distribution from which the samples are drawn. In Figure 6, the mean of the distribution of sample means appears to be about 5 (estimate the balance point in Figure 6), which is the same as the mean of the uniform distribution on [0,10]. However, let's do one more check:
    > mean(xbar)
    [1] 5.002197
    

    Looks good!

  3. The third conclusion of the Central Limit Theorem states that the standard deviation of the distribution of sample means is calculated with:

    Calculating the standard deviation of the distribution of sample means.

    Let's check this prediction.

    > sd(xbar)
    [1] 0.2936821
    

    That's pretty good agreement!

Drawing from an Exponential Distribution

In the activity Continuous Distributions in R, we examined the exponential function having the following probability density function:

The Exponential Probability Density Function

The probability density function for the exponential distribution.

The mean and standard deviation of this exponential distribution are μ = 1/λ and σ = 1/λ, respectively.

Let's consider the exponential distribution with λ = 1.

> lambda=1
> y=dexp(x,rate=lambda)
> plot(x,y,type="l",lwd=2,col="red")

These commands produce the histogram in Figure 7.

The exponential density function for λ = 1.

Figure 7. The exponential density function for λ = 1.

The mean and standard deviation of this distribution are:

μX = 1/λ = 1/1 = 1 and σX = 1/λ = 1/1 = 1

Now, let's draw 500 samples of size n=100 from the exponential distribution with λ = 1.

> n=100
> xbar=rep(0,500)
> for (i in 1:500) { xbar[i]=mean(rexp(n,rate=lambda))}
> hist(xbar,prob=TRUE)

These commands produce the histogram of the sample means shown in Figure 8.

Distribution of sample means of size n = 100 from the expontial distribution with lambda = 1.

Figure 8. Distribution of sample means of size n = 100 from the exponential distribution with λ = 1.

Now, let's check the conclusions of the Central Limit Theorem.

  1. The first conclusion of the Central Limit Theorem is the fact that the distribution of sample means is normal. This is evident in Figure 8.
  2. The second conclusion of the Central Limit Theorem states that the mean of the sample means is the same as the mean of the distribution from which the samples are drawn. In Figure 8, the mean of the distribution of sample means appears to be about 1 (estimate the balance point in Figure 8), which is the same as the mean of the exponential distribution with λ = 1. However, let's do one more check:
    > mean(xbar)
    [1] 0.9997962
    

    Looks good!

  3. The third conclusion of the Central Limit Theorem states that the standard deviation of the distribution of sample means is calculated with:

    The standard deviation of the sample means is approximately 1/10.

    Let's check this prediction.

    > sd(xbar)
    [1] 0.09948667
    

    That's pretty good agreement!

Enjoy!

We hope you enjoyed this third activity on the Central Limit Theorem system. We encourage you to explore further. You might try repeating the experiments in this activity with different "parent" distributions.