## Histograms in R

There are a number of important types of plots that are used in descriptive statistics. A common plot that is frequently used in newspapers and magazines is the histogram, a sequence of vertical bars, where the height of each bar represents a count of the data values falling in a "bin." The "count", "bins", and "bars" will be explained in the images that follow.

### Creating a Histogram of the Standard Normal Distribution

The standard normal distribution is the famous "bell-shaped" curve of statistics, known to have a mean value of zero and a standard deviation of one. If you are not familiar with the standard normal distribution, read on. This distribution will be made clear in the images that follow.

We first ask for help on **rnorm**.

> ?rnorm

The help file response describes the use of several commands relating to the normal distribution. The one of interest for this activity is R's **rnom** command, with syntax **rnorm(n,mean,sd)**, with the following argument use:

**n**- the number of random numbers requested**mean**- the mean of the normal distribution of requested numbers**sd**- the standard deviation of the normal distributin of requested numbers

let's begin by drawing 1000 numbers from this "standard normal" distribution.

> x=rnorm(1000,mean=0,sd=1)

Well, that was easy! And fast! The variable **x** now contains 1000 random numbers drawn from the standard normal distribution. You can view these numbers by typing **x** at the R prompt. The first few numbers are shown below.

> x [1] -1.123987512 0.865229526 -1.325374408 0.679182289 -1.184965803 [6] 1.755767521 0.064290993 -1.733885165 0.470695523 0.303721954 [11] 0.496295681 -0.431201657 1.378353239 1.729874427 0.445363031 ...

You can test the length of this vector with the **length** command.

> length(x) [1] 1000

We have our 1000 random numbers, so just how do we go about creating a histogram of these numbers? If we were doing the problem by hand, we would first define some categories called "bins", and then sort the random numbers into these "bins". The final count of occurrences in each bin might look like the following.

Counting Occurences In Each Bin | |

(-3.5,-3] | 1 |

(-3,-2.5] | 6 |

(-2.5,-2] | 21 |

(-2,-1.5] | 46 |

(-1.5,-1] | 82 |

(-1,-0.5] | 147 |

(-0.5,0] | 163 |

(0,0.5] | 207 |

(0.5,1] | 171 |

(1,1.5] | 103 |

(1.5,2] | 33 |

(2,2.5] | 13 |

(2.5,3] | 5 |

(3,3.5] | 2 |

The table shows that there was only one number falling between -3.5 and -3, there were 6 numbers falling between -3 and -2.5, etc. Proceeding by hand we would next set up axes on graph paper, partition the horizontal axis in "bins", then scale the vertical axes to accomodate the counts in our table. Over each bin we would create a rectangle having a vertical height the corresponds to the number of occurences in that particular bin.

Performing this task by hand is a painstaking procedure. R can automate the process for us, which will allow us to spend more time **interpreting the results** and less time having to deal with the tedium of sorting a 1000 number into bins and counting them by hand.

What follows is the simplest way to get a quick histogram of the data in the variable **x**.

> hist(x)

The command **hist(x)** responds by spewing some output to the terminal window (more on this later) and creating the nice looking histogram shown in Figure 1.

Figure 1. A histogram of the standard normal distribution data in the variable **x**.

R offers fine-grain control over the appearance and form of its histograms. We can learn more about this command through the interactive shell of the R environment.

> ?hist

The help system responds with a wealth of information on R's **hist** command, a snippet of which follows.

Usage hist(x, ...) ## Default S3 method: hist(x, breaks = "Sturges", freq = NULL, probability = !freq, include.lowest = TRUE, right = TRUE, density = NULL, angle = 45, col = NULL, border = NULL, main = paste("Histogram of" , xname), xlim = range(breaks), ylim = NULL, xlab = xname, ylab, axes = TRUE, plot = TRUE, labels = FALSE, nclass = NULL, ...)

Let's first focus on the "breaks" argument to the **hist** command. This argument can be used in a number of ways.

- You can "suggest" the number of breaks ("bins") you want. For example, if we wanted fewer bins, we'd pass this request to the
**hist**command as follows.> hist(x,breaks=5)

Note that there are not exactly 5 bins Figure 2, but there are fewer.

Figure 2. The command

**hist(x,breaks=5)**only "suggests" 5 bins. - There is a second way to proceed and that entails directing the
**hist**command to use a specific set of bins. Let's say that we want the histogram to use the bins described in our tabular work. Use the**seq**command to produce this set of bins.> bins=seq(-3.5,3.5,by=0.5)

**bins**at the R prompt.> bins [1] -3.5 -3.0 -2.5 -2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0 2.5 [14] 3.0 3.5

> hist(x,breaks=bins)

Figure 3. The command

**hist(x,breaks=bins)**uses the bins stored in the variable**bins**.

### Annotations and Color

Finally, we'll add a bit of color and our own personal annotations to the axes and title. In the code that follows, the plus (+) sign is the line continuation character. After entering **hist(x,**, hit the Enter key. The plus sign is added by R's shell automatically. Continue entering the code as shown, hitting Enter at the end of each line. When you close the parentheses on the last line and hit Enter, the entire command is executed.

> hist(x, + breaks=bins, + col="lightblue", + xlab="x-values", + ylab="count", + main="Random Numbers from the Standard Normal Distribution")

The result of this command is shown in Figure 4.

Figure 4. Adding custom axes annotations, a custom title, and color.

Things to Note in Figure 4:

- Note how the histogram is approximately "balanced" about its mean, which appears to be apprimately located at zero, which is precisely what we would anticipate, given the fact that the numbers in the variable
**x**were randomly drawn from the standard normal distribution, which has mean zero. - Because the numbers in the variable
**x**were drawn from the standard normal distribution, which has a standard deviation of one, note that almost all of the data occurs within 3 standard deviations of the mean, either way. That is, not that almost all of the data in the histogram the data in the variable**x**occurs between -3 and 3, which represents 3 standard deviations of 1 on either side of the mean 0.

### Enjoy!

We hope you enjoyed this introduction to the R system. This interactive system provides a strong interactive interface for exploration in statistics.

We encourage you to explore further. Use the command **?hist** to learn more about what you can do with the **hist** command.