The Normal Distribution in R

One of the most fundamental distributions in all of statistics is the Normal Distribution or the Gaussian Distribution. According to Wikipedia, "Carl Friedrich Gauss became associated with this set of distributions when he analyzed astronomical data using them, and defined the equation of its probability density function. It is often called the bell curve because the graph of its probability density resembles a bell."

The Probability Density Function

The probability density function for the normal distribution having mean μ and standard deviation σ is given by the function in Figure 1.

The Normal Probability Density Function

The probability density function for the normal distribution.

Figure 1. The probability density function for the normal distribution.

If we let the mean μ = 0 and the standard deviation σ = 1 in the probability density function in Figure 1, we get the probability density function for the standard normal distribution in Figure 2.

The Standard Normal Probability Density Function

The probability density function for the standard normal distribution has mean μ = 0 and standard deviation σ = 1.

Figure 2. The probability density function for the standard normal distribution has mean μ = 0 and standard deviation σ = 1.

In the activity The Standard Normal Distribution, we examined the normal distribution having mean and standard deviation 0 and 1, respectively. You might want to work your way through that activity before continuing.

It is a simple matter to produce a plot of the probability density function for the standard normal distribution.

> x=seq(-4,4,length=200)
> y=1/sqrt(2*pi)*exp(-x^2/2)
> plot(x,y,type="l",lwd=2,col="red")

If you'd like a more detailed introduction to plotting in R, we refer you to the activity Simple Plotting in R. However, these commands are simply explained.

  1. The command x=seq(-4,4,length=200) produces 200 equally spaced values between -4 and 4 and stores the result in a vector assigned to the variable x.
  2. The command y=1/sqrt(2*pi)*exp(-x^2/2) evaluates the probability density function of Figure 2 at each entry of the vector x and stores the result in a vector assigned to the variable y.
  3. The command plot(x,y,type="l",lwd=2,col="red") plots y versus x, using:
    • a solid line type (type="l") --- that's an "el", not an I (eye) or a 1 (one),
    • a line width of 2 points (lwd=2), and
    • uses the color red (col="red").

The result is the "bell-shaped" curve shown in Figure 3.

The bell-shaped curve of the standard normal distribution.

Figure 3. The bell-shaped curve of the standard normal distribution.

An Alternate Approach

The command dnorm can be used to produce the same result as the probability density function of Figure 2. Indeed, the "d" in dnorm stands for "density." Thus, the command dnorm is designed to provide values of the probability density function for the normal distribution.

> x=seq(-4,4,length=200)
> y=dnorm(x,mean=0,sd=1)
> plot(x,y,type="l",lwd=2,col="red")

These commands produce the plot shown in Figure 4. Note that the result is identical to the plot in Figure 3.

The bell-shaped curve of the standard normal distribution.

Figure 4. The bell-shaped curve of the standard normal distribution.

The Standard Deviation

The standard deviation represents the "spread" in the distribution. With "spread" as the interpretation, we would expect a normal distribution with a standard deviation of 2 to be "more spread out" than a normal distribution with a standard deviation of 1. Let's simulate this idea in R.

> x=seq(-8,8,length=500)
> y1=dnorm(x,mean=0,sd=1)
> plot(x,y1,type="l",lwd=2,col="red")
> y2=dnorm(x,mean=0,sd=2)
> lines(x,y2,type="l",lwd=2,col="blue")

The above sequence of commands produces the image shown in Figure 5.

A normal distribution with standard deviation 2 is

Figure 5. A normal distribution with standard deviation 2 is "wider" than the standard normal distribution having standard deviation 1.

Key Idea: The key idea of importance is to note that the normal curve having standard deviation equal to 2 is "twice as spread out" as the "standard" normal curve having standard deviation 1.

In similar fashion, a normal curve with standard deviation 1/2 would be "half as wide" as the standard normal curve.

> x=seq(-8,8,length=500)
> y3=dnorm(x,mean=0,sd=1/2)
> plot(x,y3,type="l",lwd=2,col="green")
> y2=dnorm(x,mean=0,sd=2)
> lines(x,y2,type="l",lwd=2,col="blue")
> y1=dnorm(x,mean=0,sd=1)
> lines(x,y1,type="l",lwd=2,col="red")
> legend("topright",c("sigma=1/2","sigma=2","sigma=1"),
+ lty=c(1,1,1),col=c("green","blue","red"))

The above sequence of commands produces the image shown in Figure 6. Note: Remember that the "plus" symbol is R's line continuation character. R will provide this character when you hit the Enter key to end the previous line.

The standard deviation detemrines the spread of the distribution.

Figure 6. The standard deviation detemrines the spread of the distribution.

Key Idea: The key idea of importance is to note that the normal curve having standard deviation equal to 1/2 is "half as spread out" as the "standard" normal curve having standard deviation 1.

The Mean

In Figures 3, 4, and 5, note that each of the normal curves is "centered" about a mean equal to zero. If the mean were different, we would expect the normal curve to be "centered" about new mean. In the following example, we use coding to create a normal curve with mean equal to 10 and standard deviation equal to 2.

> x=seq(4,16,length=200)
> y=dnorm(x,mean=10,sd=2)
> plot(x,y,type="l",lwd=2,col="red")

The above sequence of commands produces the image shown in Figure 7.

A normal distribution with mean equal to 10 and standard deviation equal to 2.

Figure 7. A normal distribution with mean equal to 10 and standard deviation equal to 2.

The choice of domain bears some explanation. In the activity The Standard Normal Distribution, we learned that essentially all of the data occurs within three standard deviations of the mean. Because the standard deviation is 2, three standard deviations to the left of the mean takes us to 4, while three standard deviations to the right of the mean takes us to 16.

Key Idea: The key idea of importance is to note that this particular normal curve is "centered" or "balanced" about its mean 10.

As a second example, the following code draws the normal curve with mean equal to 100 with standard deviation equal to 10.

> x=seq(70,130,length=200)
> y=dnorm(x,mean=100,sd=10)
> plot(x,y,type="l",lwd=2,col="red")

The above sequence of commands produces the image shown in Figure 8.

A normal distribution with mean equal to 100 and standard deviation equal to 10.

Figure 8. A normal distribution with mean equal to 100 and standard deviation equal to 10.

Again, the mean is 100 and the standard deviation is 10, so three standard deviations each side of the mean sees us sketching this normal curve on the interval (70,130).

Key Idea: As in the previous example, the most important thing to note is that the distribution shown in Figure 8 is "centered" or "balanced" about its mean 100.

The Area Under the Probability Density Function

As in the activity The Standard Normal Distribution, the command pnorm will compute the area to the left of a given value of x.

> x=seq(70,130,length=200)
> y=dnorm(x,mean=100,sd=10)
> plot(x,y,type="l",lwd=2,col="red")
> x=seq(70,90,length=100)
> y=dnorm(x,mean=100,sd=10)
> polygon(c(70,x,90),c(0,y,0),col="gray")

The above commands produce the image shown in Figure 9.

The area to the left of 90 represents the probability of selecting a number from a normal distribution with mean 100 and standard deviation 10.

Figure 9. The area to the left of 90 represents the probability of selecting a number less than 90 from a normal distribution with mean 100 and standard deviation 10.

The following command will calculate the shaded area in Figure 9.

> pnorm(90,mean=100,sd=10)
[1] 0.1586553

Hence, the probability of selecting a random number less than 90 from a normal distribution having mean 100 and standard deviation 10 is 0.1586553.

As a second example, the following code shades the area between 90 and 100 shown in Figure 10.

> x=seq(70,130,length=200)
> y=dnorm(x,mean=100,sd=10)
> plot(x,y,type="l",lwd=2,col="red")
> x=seq(90,100,length=200)
> y=dnorm(x,mean=100,sd=10)
> polygon(c(90,x,100),c(0,y,0),col="gray")

The area to the right of 90 and to the left of 100 represents the probability of selecting a number between 90 and 100 from a normal distribution with mean 100 and standard deviation 10.

Figure 10. The area to the right of 90 and to the left of 100 represents the probability of selecting a number between 90 and 100 from a normal distribution with mean 100 and standard deviation 10.

The following command will calculate the shaded area in Figure 10.

> pnorm(100,mean=100,sd=10)-pnorm(90,mean=100,sd=10)
[1] 0.3413447

68%-95%-99.7% Rule

In the activity The Standard Normal Distribution, we introduced the 68% - 95% - 99.7% rule in conjunction with the standard normal distribution. The 68% - 95% - 99.7% works just as well as a rule of thumb even when the mean and standard deviation change. For example, the following code shades the region within one standard deviation of the mean in Figure 11.

> x=seq(70,130,length=200)
> y=dnorm(x,mean=100,sd=10)
> plot(x,y,type="l",lwd=2,col="red")
> x=seq(90,110,length=200)
> y=dnorm(x,mean=100,sd=10)
> polygon(c(90,x,110),c(0,y,0),col="gray")

The shaded area represents the probability of drawing a number from the normal distribution that falls within one standard deviation of the mean.

Figure 11. The shaded area represents the probability of drawing a number from the normal distribution (mean = 100, standard deviation = 10) that falls within one standard deviation of the mean.

Remember that R's pnorm command finds the area to the left of a given value of x. Thus, to find the area between x = 90 and x = 110, we must subtract the area to the left of x = 90 from the area to the left of x = 110.

> pnorm(110, mean=100, sd=10)-pnorm(90, mean=100, sd=10)
[1] 0.6826895

There's that promised 68% again!

In similar fashion, we can get the area within two and three standard deviations.

> x=seq(70,130,length=200)
> y=dnorm(x,mean=100,sd=10)
> plot(x,y,type="l",lwd=2,col="red")
> x=seq(80,120,length=200)
> y=dnorm(x,mean=100,sd=10)
> polygon(c(80,x,120),c(0,y,0),col="gray")

The shaded area represents the probability of drawing a number from the standard normal distribution that falls within two standard deviations of the mean.

Figure 12. The shaded area represents the probability of drawing a number from the normal distribution (mean = 100, standard deviation = 10) that falls within two standard deviations of the mean.

To find the area between x = 80 and x = 120, we must subtract the area to the left of x = 80 from the area to the left of x = 120.

> pnorm(120, mean=100, sd=10)-pnorm(80, mean=100, sd=10)
[1] 0.9544997

Note again that there is a 95% chance that the number drawn falls within two standard deviations of the mean.

> x=seq(70,130,length=200)
> y=dnorm(x,mean=100,sd=10)
> plot(x,y,type="l",lwd=2,col="red")
> polygon(c(70,x,130),c(0,y,0),col="gray")

The shaded area represents the probability of drawing a number from the standard normal distribution that falls within three standard deviations of the mean.

Figure 13. The shaded area represents the probability of drawing a number from the normal distribution (mean = 100, standard deviation = 10) that falls within three standard deviations of the mean.

To find the area between x = 70 and x = 130, we must subtract the area to the left of x = 70 from the area to the left of x = 130.

> pnorm(130, mean=100, sd=10)-pnorm(70, mean=100, sd=10)
[1] 0.9973002

Therefore, the chance that a number drawn randomly from the normal distribution falls within three standard deviations of the mean is 99.7%!

Important Result: We conclude that virtually all numbers from any normal distribution occur within three standard deviations of the mean.

Quantiles

In the activity The Standard Normal Distribution, the command qnorm was introduced to calculate quantiles. This command is fed the area under the curve to the left of some unknown number, then calculates the unknown number.

For example, suppose that the area under the curve to the left of some unknown x-value is 0.95, as shown in Figure 14.

The area under the curve to the left of some unknown <i>x</i>-value is 0.95.

Figure 14. The area under the curve to the left of some unknown x-value is 0.95.

To find the unknown value of x we use R's qnorm command (the "q" is for "quantile").

> qnorm(0.95,mean=100,sd=10)
[1] 116.4485

Hence, there is a 95% probability that a random number less than or equal to 116.4485 is chosen from the standard normal distribution.

In a sense, R's pnorm and qnorm commands play the roles of inverse functions. On one hand, the command pnorm is fed a number and asked to find the probability that a random selection from the standard normal distribution falls to the left of this number. On the other hand, the command qnorm is given the probability and asked to find a limiting number so that the area under the curve to the left of that number equals the given probability.

We must emphasize that the area under the curve to the left is used when applying the commands pnorm and qnorm. If you are given an area to the right, then you must make a simple adjustment before applying the qnorm command. Suppose, as shown in Figure 15, that the area to the right of an unknown number is 0.80.

The area under the curve to the right of some unknown <i>x</i>-value is 0.80.

Figure 15. The area under the curve to the right of some unknown x-value is 0.80.

In this case, we must subtract the area to the right from the number 1 to obtain 1 - 0.80 = 0.20, which is the area to the left of the unknown value of x shown in Figure 16.

The area under the curve to the left of some unknown <i>x</i>-value is 1 - 0.80 = 0.20.

Figure 16. The area under the curve to the left of some unknown x-value is 1 - 0.80 = 0.20.

We can now use the qnorm command to find the unknown value of x.

> qnorm(0.20,mean=100,sd=10)
[1] 91.58379

We now know that the probability of selecting a number from the standard normal distribution that is greater than or equal to 91.58379 is 0.80.

Enjoy!

We hope you enjoyed this introduction to the normal distribution R system. We encourage you to explore further.