Histograms in R

There are a number of important types of plots that are used in descriptive statistics. A common plot that is frequently used in newspapers and magazines is the histogram, a sequence of vertical bars, where the height of each bar represents a count of the data values falling in a "bin." The "count", "bins", and "bars" will be explained in the images that follow.

Creating a Histogram of the Standard Normal Distribution

The standard normal distribution is the famous "bell-shaped" curve of statistics, known to have a mean value of zero and a standard deviation of one. If you are not familiar with the standard normal distribution, read on. This distribution will be made clear in the images that follow.

We first ask for help on rnorm.

> ?rnorm

The help file response describes the use of several commands relating to the normal distribution. The one of interest for this activity is R's rnom command, with syntax rnorm(n,mean,sd), with the following argument use:

  • n - the number of random numbers requested
  • mean - the mean of the normal distribution of requested numbers
  • sd - the standard deviation of the normal distributin of requested numbers

let's begin by drawing 1000 numbers from this "standard normal" distribution.

> x=rnorm(1000,mean=0,sd=1)

Well, that was easy! And fast! The variable x now contains 1000 random numbers drawn from the standard normal distribution. You can view these numbers by typing x at the R prompt. The first few numbers are shown below.

> x
   [1] -1.123987512  0.865229526 -1.325374408  0.679182289 -1.184965803
   [6]  1.755767521  0.064290993 -1.733885165  0.470695523  0.303721954
  [11]  0.496295681 -0.431201657  1.378353239  1.729874427  0.445363031
        ...

You can test the length of this vector with the length command.

> length(x)
[1] 1000

We have our 1000 random numbers, so just how do we go about creating a histogram of these numbers? If we were doing the problem by hand, we would first define some categories called "bins", and then sort the random numbers into these "bins". The final count of occurrences in each bin might look like the following.

Counting Occurences In Each Bin
(-3.5,-3] 1
(-3,-2.5] 6
(-2.5,-2] 21
(-2,-1.5] 46
(-1.5,-1] 82
(-1,-0.5] 147
(-0.5,0] 163
(0,0.5] 207
(0.5,1] 171
(1,1.5] 103
(1.5,2] 33
(2,2.5] 13
(2.5,3] 5
(3,3.5] 2

The table shows that there was only one number falling between -3.5 and -3, there were 6 numbers falling between -3 and -2.5, etc. Proceeding by hand we would next set up axes on graph paper, partition the horizontal axis in "bins", then scale the vertical axes to accomodate the counts in our table. Over each bin we would create a rectangle having a vertical height the corresponds to the number of occurences in that particular bin.

Performing this task by hand is a painstaking procedure. R can automate the process for us, which will allow us to spend more time interpreting the results and less time having to deal with the tedium of sorting a 1000 number into bins and counting them by hand.

What follows is the simplest way to get a quick histogram of the data in the variable x.

> hist(x)

The command hist(x) responds by spewing some output to the terminal window (more on this later) and creating the nice looking histogram shown in Figure 1.

A simple histogram of data drawn from normal distribution with mean zero and standard deviation one

Figure 1. A histogram of the standard normal distribution data in the variable x.

R offers fine-grain control over the appearance and form of its histograms. We can learn more about this command through the interactive shell of the R environment.

> ?hist

The help system responds with a wealth of information on R's hist command, a snippet of which follows.

Usage
hist(x, ...)

## Default S3 method:
hist(x, breaks = "Sturges",
     freq = NULL, probability = !freq,
     include.lowest = TRUE, right = TRUE,
     density = NULL, angle = 45, col = NULL, border = NULL,
     main = paste("Histogram of" , xname),
     xlim = range(breaks), ylim = NULL,
     xlab = xname, ylab,
     axes = TRUE, plot = TRUE, labels = FALSE,
     nclass = NULL, ...)

Let's first focus on the "breaks" argument to the hist command. This argument can be used in a number of ways.

  1. You can "suggest" the number of breaks ("bins") you want. For example, if we wanted fewer bins, we'd pass this request to the hist command as follows.

    > hist(x,breaks=5)
    

    Note that there are not exactly 5 bins Figure 2, but there are fewer.

    Suggesting fewer bins with break equals 5

    Figure 2. The command hist(x,breaks=5) only "suggests" 5 bins.

  2. There is a second way to proceed and that entails directing the hist command to use a specific set of bins. Let's say that we want the histogram to use the bins described in our tabular work. Use the seq command to produce this set of bins.
    > bins=seq(-3.5,3.5,by=0.5)
    
    You can see the result of this command by typing bins at the R prompt.
    > bins
     [1] -3.5 -3.0 -2.5 -2.0 -1.5 -1.0 -0.5  0.0  0.5  1.0  1.5  2.0  2.5
    [14]  3.0  3.5
    
    These are precisely the bins in our table. We can now request that R use these bins with the following command.
    > hist(x,breaks=bins)
    
    The result is shown in Figure 3.

    This histogram uses bins suggested by the user.

    Figure 3. The command hist(x,breaks=bins) uses the bins stored in the variable bins.

Annotations and Color

Finally, we'll add a bit of color and our own personal annotations to the axes and title. In the code that follows, the plus (+) sign is the line continuation character. After entering hist(x,, hit the Enter key. The plus sign is added by R's shell automatically. Continue entering the code as shown, hitting Enter at the end of each line. When you close the parentheses on the last line and hit Enter, the entire command is executed.

> hist(x,
+ breaks=bins,
+ col="lightblue",
+ xlab="x-values",
+ ylab="count",
+ main="Random Numbers from the Standard Normal Distribution")

The result of this command is shown in Figure 4.

Adding annotations to axes, color, and a personal title.

Figure 4. Adding custom axes annotations, a custom title, and color.

Things to Note in Figure 4:

  1. Note how the histogram is approximately "balanced" about its mean, which appears to be apprimately located at zero, which is precisely what we would anticipate, given the fact that the numbers in the variable x were randomly drawn from the standard normal distribution, which has mean zero.
  2. Because the numbers in the variable x were drawn from the standard normal distribution, which has a standard deviation of one, note that almost all of the data occurs within 3 standard deviations of the mean, either way. That is, not that almost all of the data in the histogram the data in the variable x occurs between -3 and 3, which represents 3 standard deviations of 1 on either side of the mean 0.

Enjoy!

We hope you enjoyed this introduction to the R system. This interactive system provides a strong interactive interface for exploration in statistics.

We encourage you to explore further. Use the command ?hist to learn more about what you can do with the hist command.