## The Normal Distribution in R

One of the most fundamental distributions in all of statistics is the *Normal Distribution* or the *Gaussian Distribution*. According to Wikipedia, "Carl Friedrich Gauss became associated with this set of distributions when he analyzed astronomical data using them, and defined the equation of its probability density function. It is often called the *bell curve* because the graph of its probability density resembles a bell."

### The Probability Density Function

The *probability density function* for the normal distribution having mean μ and standard deviation σ is given by the function in Figure 1.

**The Normal Probability Density Function**

**Figure 1.** The probability density function for the normal distribution.

If we let the mean μ = 0 and the standard deviation σ = 1 in the probability density function in Figure 1, we get the probability density function for the *standard normal distribution* in Figure 2.

**The Standard Normal Probability Density Function**

**Figure 2.** The probability density function for the standard normal distribution has mean μ = 0 and standard deviation σ = 1.

In the activity The Standard Normal Distribution, we examined the normal distribution having mean and standard deviation 0 and 1, respectively. You might want to work your way through that activity before continuing.

It is a simple matter to produce a plot of the probability density function for the standard normal distribution.

> x=seq(-4,4,length=200) > y=1/sqrt(2*pi)*exp(-x^2/2) > plot(x,y,type="l",lwd=2,col="red")

If you'd like a more detailed introduction to plotting in R, we refer you to the activity Simple Plotting in R. However, these commands are simply explained.

- The command
**x=seq(-4,4,length=200)**produces 200 equally spaced values between -4 and 4 and stores the result in a vector assigned to the variable**x**. - The command
**y=1/sqrt(2*pi)*exp(-x^2/2)**evaluates the probability density function of Figure 2 at each entry of the vector**x**and stores the result in a vector assigned to the variable**y**. - The command
**plot(x,y,type="l",lwd=2,col="red")**plots**y**versus**x**, using: - a solid line type (
**type="l"**) --- that's an "el", not an I (eye) or a 1 (one), - a line width of 2 points (
**lwd=2**), and - uses the color red (
**col="red"**).

The result is the "bell-shaped" curve shown in Figure 3.

**Figure 3.** The bell-shaped curve of the standard normal distribution.

#### An Alternate Approach

The command **dnorm** can be used to produce the same result as the probability density function of Figure 2. Indeed, the "d" in **dnorm** stands for "density." Thus, the command **dnorm** is designed to provide values of the probability density function for the normal distribution.

> x=seq(-4,4,length=200) > y=dnorm(x,mean=0,sd=1) > plot(x,y,type="l",lwd=2,col="red")

These commands produce the plot shown in Figure 4. Note that the result is identical to the plot in Figure 3.

**Figure 4.** The bell-shaped curve of the standard normal distribution.

### The Standard Deviation

The standard deviation represents the "spread" in the distribution. With "spread" as the interpretation, we would expect a normal distribution with a standard deviation of 2 to be "more spread out" than a normal distribution with a standard deviation of 1. Let's simulate this idea in R.

> x=seq(-8,8,length=500) > y1=dnorm(x,mean=0,sd=1) > plot(x,y1,type="l",lwd=2,col="red") > y2=dnorm(x,mean=0,sd=2) > lines(x,y2,type="l",lwd=2,col="blue")

The above sequence of commands produces the image shown in Figure 5.

**Figure 5.** A normal distribution with standard deviation 2 is "wider" than the standard normal distribution having standard deviation 1.

Key Idea: The key idea of importance is to note that the normal curve having standard deviation equal to 2 is "twice as spread out" as the "standard" normal curve having standard deviation 1.

In similar fashion, a normal curve with standard deviation 1/2 would be "half as wide" as the standard normal curve.

> x=seq(-8,8,length=500) > y3=dnorm(x,mean=0,sd=1/2) > plot(x,y3,type="l",lwd=2,col="green") > y2=dnorm(x,mean=0,sd=2) > lines(x,y2,type="l",lwd=2,col="blue") > y1=dnorm(x,mean=0,sd=1) > lines(x,y1,type="l",lwd=2,col="red") > legend("topright",c("sigma=1/2","sigma=2","sigma=1"), + lty=c(1,1,1),col=c("green","blue","red"))

The above sequence of commands produces the image shown in Figure 6. *Note: Remember that the "plus" symbol is R's line continuation character. R will provide this character when you hit the Enter key to end the previous line.*

**Figure 6.** The standard deviation detemrines the spread of the distribution.

Key Idea: The key idea of importance is to note that the normal curve having standard deviation equal to 1/2 is "half as spread out" as the "standard" normal curve having standard deviation 1.

### The Mean

In Figures 3, 4, and 5, note that each of the normal curves is "centered" about a mean equal to zero. If the mean were different, we would expect the normal curve to be "centered" about new mean. In the following example, we use coding to create a normal curve with mean equal to 10 and standard deviation equal to 2.

> x=seq(4,16,length=200) > y=dnorm(x,mean=10,sd=2) > plot(x,y,type="l",lwd=2,col="red")

The above sequence of commands produces the image shown in Figure 7.

**Figure 7.** A normal distribution with mean equal to 10 and standard deviation equal to 2.

The choice of domain bears some explanation. In the activity The Standard Normal Distribution, we learned that essentially all of the data occurs within three standard deviations of the mean. Because the standard deviation is 2, three standard deviations to the left of the mean takes us to 4, while three standard deviations to the right of the mean takes us to 16.

Key Idea: The key idea of importance is to note that this particular normal curve is "centered" or "balanced" about its mean 10.

As a second example, the following code draws the normal curve with mean equal to 100 with standard deviation equal to 10.

> x=seq(70,130,length=200) > y=dnorm(x,mean=100,sd=10) > plot(x,y,type="l",lwd=2,col="red")

The above sequence of commands produces the image shown in Figure 8.

**Figure 8.** A normal distribution with mean equal to 100 and standard deviation equal to 10.

Again, the mean is 100 and the standard deviation is 10, so three standard deviations each side of the mean sees us sketching this normal curve on the interval (70,130).

Key Idea: As in the previous example, the most important thing to note is that the distribution shown in Figure 8 is "centered" or "balanced" about its mean 100.

### The Area Under the Probability Density Function

As in the activity The Standard Normal Distribution, the command **pnorm** will compute the area to the left of a given value of *x*.

> x=seq(70,130,length=200) > y=dnorm(x,mean=100,sd=10) > plot(x,y,type="l",lwd=2,col="red") > x=seq(70,90,length=100) > y=dnorm(x,mean=100,sd=10) > polygon(c(70,x,90),c(0,y,0),col="gray")

The above commands produce the image shown in Figure 9.

**Figure 9.** The area to the left of 90 represents the probability of selecting a number less than 90 from a normal distribution with mean 100 and standard deviation 10.

The following command will calculate the shaded area in Figure 9.

> pnorm(90,mean=100,sd=10) [1] 0.1586553

Hence, the probability of selecting a random number less than 90 from a normal distribution having mean 100 and standard deviation 10 is 0.1586553.

As a second example, the following code shades the area between 90 and 100 shown in Figure 10.

> x=seq(70,130,length=200) > y=dnorm(x,mean=100,sd=10) > plot(x,y,type="l",lwd=2,col="red") > x=seq(90,100,length=200) > y=dnorm(x,mean=100,sd=10) > polygon(c(90,x,100),c(0,y,0),col="gray")

**Figure 10.** The area to the right of 90 and to the left of 100 represents the probability of selecting a number between 90 and 100 from a normal distribution with mean 100 and standard deviation 10.

The following command will calculate the shaded area in Figure 10.

> pnorm(100,mean=100,sd=10)-pnorm(90,mean=100,sd=10) [1] 0.3413447

### 68%-95%-99.7% Rule

In the activity The Standard Normal Distribution, we introduced the 68% - 95% - 99.7% rule in conjunction with the standard normal distribution. The 68% - 95% - 99.7% works just as well as a rule of thumb even when the mean and standard deviation change. For example, the following code shades the region within one standard deviation of the mean in Figure 11.

> x=seq(70,130,length=200) > y=dnorm(x,mean=100,sd=10) > plot(x,y,type="l",lwd=2,col="red") > x=seq(90,110,length=200) > y=dnorm(x,mean=100,sd=10) > polygon(c(90,x,110),c(0,y,0),col="gray")

**Figure 11.** The shaded area represents the probability of drawing a number from the normal distribution (mean = 100, standard deviation = 10) that falls within one standard deviation of the mean.

Remember that R's **pnorm** command finds the area *to the left* of a given value of *x*. Thus, to find the area between *x = 90* and *x = 110*, we must subtract the area to the left of *x = 90* from the area to the left of *x = 110*.

> pnorm(110, mean=100, sd=10)-pnorm(90, mean=100, sd=10) [1] 0.6826895

There's that promised 68% again!

In similar fashion, we can get the area within two and three standard deviations.

> x=seq(70,130,length=200) > y=dnorm(x,mean=100,sd=10) > plot(x,y,type="l",lwd=2,col="red") > x=seq(80,120,length=200) > y=dnorm(x,mean=100,sd=10) > polygon(c(80,x,120),c(0,y,0),col="gray")

**Figure 12.** The shaded area represents the probability of drawing a number from the normal distribution (mean = 100, standard deviation = 10) that falls within two standard deviations of the mean.

To find the area between *x = 80* and *x = 120*, we must subtract the area to the left of *x = 80* from the area to the left of *x = 120*.

> pnorm(120, mean=100, sd=10)-pnorm(80, mean=100, sd=10) [1] 0.9544997

Note again that there is a 95% chance that the number drawn falls within two standard deviations of the mean.

> x=seq(70,130,length=200) > y=dnorm(x,mean=100,sd=10) > plot(x,y,type="l",lwd=2,col="red") > polygon(c(70,x,130),c(0,y,0),col="gray")

**Figure 13.** The shaded area represents the probability of drawing a number from the normal distribution (mean = 100, standard deviation = 10) that falls within three standard deviations of the mean.

To find the area between *x = 70* and *x = 130*, we must subtract the area to the left of *x = 70* from the area to the left of *x = 130*.

> pnorm(130, mean=100, sd=10)-pnorm(70, mean=100, sd=10) [1] 0.9973002

Therefore, the chance that a number drawn randomly from the normal distribution falls within three standard deviations of the mean is 99.7%!

Important Result: We conclude that virtually all numbers from *any* normal distribution occur within three standard deviations of the mean.

### Quantiles

In the activity The Standard Normal Distribution, the command **qnorm** was introduced to calculate quantiles. This command is fed the area under the curve to the left of some unknown number, then calculates the unknown number.

For example, suppose that the area under the curve to the left of some unknown *x*-value is 0.95, as shown in Figure 14.

**Figure 14.** The area under the curve to the left of some unknown *x*-value is 0.95.

To find the unknown value of *x* we use R's **qnorm** command (the "q" is for "quantile").

> qnorm(0.95,mean=100,sd=10) [1] 116.4485

Hence, there is a 95% probability that a random number less than or equal to 116.4485 is chosen from the standard normal distribution.

In a sense, R's **pnorm** and **qnorm** commands play the roles of inverse functions. On one hand, the command **pnorm** is fed a number and asked to find the probability that a random selection from the standard normal distribution falls to the left of this number. On the other hand, the command **qnorm** is given the probability and asked to find a limiting number so that the area under the curve to the left of that number equals the given probability.

We must emphasize that the area under the curve to the left is used when applying the commands **pnorm** and **qnorm**. If you are given an area to the right, then you must make a simple adjustment before applying the **qnorm** command. Suppose, as shown in Figure 15, that the area to the right of an unknown number is 0.80.

**Figure 15.** The area under the curve to the right of some unknown *x*-value is 0.80.

In this case, we must subtract the area to the right from the number 1 to obtain 1 - 0.80 = 0.20, which is the area to the left of the unknown value of *x* shown in Figure 16.

**Figure 16.** The area under the curve to the left of some unknown *x*-value is 1 - 0.80 = 0.20.

We can now use the **qnorm** command to find the unknown value of *x*.

> qnorm(0.20,mean=100,sd=10) [1] 91.58379

We now know that the probability of selecting a number from the standard normal distribution that is greater than or equal to 91.58379 is 0.80.

### Enjoy!

We hope you enjoyed this introduction to the normal distribution R system. We encourage you to explore further.