Boxplots in R
In this activity we show our readers how to create a boxplot in R. In preparation for this activity, we must first explore what statisticians call "measures of central tendency," specifically the mean and median of a data set.
Measures of Central Tendency
We first create a set of data that we will use throughout this activity. Although our data set is somewhat artificial, all of what we explain in this activity (as it relates to our data set) can also be applied to any set of data chosen by our readers. With this thought in mind, we enter our data set at the R prompt.
> x=c(0,4,15, 1, 6, 3, 20, 5, 8, 1, 3)
We can examine the contents of the variable x with the following command:
> x  0 4 15 1 6 3 20 5 8 1 3
One of the most important measures of central tendency in statistics is the mean, which is found by summing the elements of the data set, then dividing by the number of elements in the set. We can sum the elements in the data set contained in the variable x with the R-command sum.
> sum(x)  66
Readers should convince themselves (get out the pencil and paper) that the elements in the variable x do indeed sum to 66. Add up the individual elements in the list stored in x and show that the sum is 66.
To find the number of elements in the list stored in the variable x, we can get out our abacus and count them, or we can use R's length command.
> length(x)  11
Readers should check that the list stored in x does indeed have 11 elements. Count them!
To find the average of the list stored in x, we divide the sum by the number of elements in the list.
> sum(x)/length(x)  6
Recall that the sum of the elements in x was 66, the length was 11, so the average (or mean) is 66/11=6, as verified by our R-command sum(x)/length(x).
However, because finding the mean is such a common requirement in most statistical analysis, it should come as no surprise that R has a command for finding the mean of a data set.
> mean(x)  6
The mean of a data set can be strongly influenced by "outliers" in the data. Consider anew the data stored in the variable x.
> x  0 4 15 1 6 3 20 5 8 1 3
Let's sketch a quick histogram of the data stored in x. The following command produces the histogram shown in Figure 1.
Figure 1. A histogram of the data stored in the variable x. Note that the data is badly skewed to the right.
Note the long tail to the right in Figure 1. Statisticians say that the data is "skewed to the right."
Imagine that the bars of the histograms represent masses of equal density. If we were to place a fulcrum or "knife-edge" located at the mean (at x = 6), the masses would balance. The outliers can greatly affect the placement of the mean. It's like an old-fashioned "teeter-totter." A child seated at greater distance from the fulcrum is able to balance a much heavier child seated closer to the fulcrum.
To pursue this line of reasoning a bit further, imagine that the numbers contained in the variable x represent speakers' fees in thousands of dollars. Let's sort the data in ascending order.
Let's sort the data in ascending order.
> sort(x)  0 1 1 3 3 4 5 6 8 15 20
It is probably unfair to say that the "average speaking fee is $6,000." Although statistically correct, the average (mean) speaking fee ($6,000) does not reflect a common speaking charge for this collection of speakers. Indeed, the two outliers (the speakers charging $15,000 and $20,000) unduely influence the mean.
A second measure of central tendencey, a statistic called the median, will be seen to more closely resemble what a group might be charged should they hire one of the speakers represented in the data set stored in x. The median is defined to be the data item that is precisely in the middle of the sorted data set; that is, half (50%) of the data occurs to the left of the median, and half (50%) occurs to the right of the median.
In the case that the data set has an odd number of elements, it is a simple matter to spot the data item that lies precisely in the middle. The data set stored in the variable x has 11 elements. Hence, the sixth element lies exactly in the middle of this data set. Thus, the median is 4. Note that this number represents a speaking fee of $4,000, which is probably more representative of a "middling fee" that a group might expect should they use one of the speakers represented by the data stored in x.
Of course, R finds the median with ease.
> median(x)  4
If a data set has an even number of elements, the median is found by averaging the two "middle elements." For example, the following data set has six elements.
> y=1:6 > y  1 2 3 4 5 6
The median is found by averaging the third and fourth elements; that is, the median is (3 + 4)/2 = 3.5. R is completely aware of the even case.
> median(y)  3.5
The median of a data set is located so that 50% of the data occurs to the left of the median (and 50% of the data occurs to the right of the median). There is no reason to restrict our attention to the 50% level. For example, we can find a point where 25% of the data occurs on its left (and 75% to its right). This point is known as the first "quartile" and is found with the following R command:
> sort(x)  0 1 1 3 3 4 5 6 8 15 20 > quantile(x,0.25) 25% 2
To help explain, we've listed the data set in ascending order. R provides nine different algorithms for computing the 25% quantile which can be viewed by typing the command ?quantile. The default technique is to use linear interpolation to find the entry in the position given by the formula 1 + p(n -1), where p is the required percentage and n is the length of the data set. In this particular case, p = 0.25 and n = 11, so 1 + p(n -1) = 3.5. Thus, R will interpolate (linearly) a number that is exactly halfway between the third and fourth entries, arriving at 1 + 0.5(3 - 1) = 2.
In similar fashion, R will approximate the 75% quantile with the following command:
> sort(x)  0 1 1 3 3 4 5 6 8 15 20 > quantile(x,0.75) 75% 7
Note that 1 + p(n - 1) = 1 + 0.75(11 - 1) = 8.5, so R reports the number that is exactly halfway between the eighth and ninth entries, namely 6 + 0.5(8 - 6) = 7.
In general, the p% quantile will be a number that finds p% of the data to its left. For the remainder of this activity, the most important statistics are the minimum, first quartile, median, second quartile, and the maximum. We can use the quantile command to compute all of these at once.
> quantile(x,c(0,0.25,0.5,0.75,1)) 0% 25% 50% 75% 100% 0 2 4 7 20
However, R's summary command will report each of these quantiles with descriptive headers, and throw in the mean for good measure.
> summary(x) Min. 1st Qu. Median Mean 3rd Qu. Max. 0 2 4 6 7 20
The Spread of the Data Set
Two pairs of numbers in the summary for our data set give the user a sense of the "spread" of the data involved. The first is the range of the data set.
> range(x)  0 20
Note the R's range command reports the minimum and maximum entries in the data set. In addition, R's IQR command gives the inner quartile range.
> IQR(x)  5
The inner quartile range reports the difference between the 75% quantile and the 25% quantile. In this case, IQR = 7 - 2 = 5.
The Standard Boxplot
It is easier to explain the boxplot if we first have a picture to which we can refer in the discussion. So, without any further ado, here is how R produces a boxplot for the data in the variable x.
The above command was used to produce the boxplot shown in Figure 2.
Figure 2. The minimum, quartiles, median, and maximum are used to construct a "box and whisker plot."
So, how is this boxplot constructed? First, recall the summary data for the data in the variable x.
> summary(x) Min. 1st Qu. Median Mean 3rd Qu. Max. 0 2 4 6 7 20
Here are the steps for creating the standard box and whiskers plot.
- Draw a thick, dark, horizontal segment at the median, that is, at 4. See Figure 2.
- Second, draw horizontal lines at the first and third quartiles, that is, at 2 and at 7. Use these to draw the "box." See Figure 2.
- From the bottom edge of the box, draw a "whisker" that extends to the the minimum data value, namely 0. See Figure 2.
- From the top edge of the box, draw a "whisker" that extends to the maximum data value, namely 20. See Figure 2.
There are several important points that need making in regard to our box and whisker plot.
- 50% of the data occurs between the lower and upper edges of the box, namely, between the first and third quartiles located at 2 and 7, respectively.
- The lower 50% of the data occurs below the median, the dark horizontal line in the box in Figure 2. Likewise, the upper 50% of the data occurs above the median line in the box.
- The lower 25% of the data occurs between the bottom edge of the box and the bottom edge of the lower whisker. Likewise, the upper 25% of the data occurs above the top edge of the box and the top edge of the upper whisker.
The Modified Box Plot
The Standard Box Plot does not pay special attention to outliers that might be present. The Modified Box Plot is constructed so as to highlight outliers. As in the Standard Boxplot described above, let's begin with a picture. Note that the Modified Boxplot is the default in R, and requires no special parameters.
The above command was used to produce the modified "box and whiskers" plot shown in Figure 3.
Figure 3. A modified boxplot marks outliers for "special attention."
Before explaining the construction, let's repeat the sorted data and the summary information.
> sort(x)  0 1 1 3 3 4 5 6 8 15 20 > summary(x) Min. 1st Qu. Median Mean 3rd Qu. Max. 0 2 4 6 7 20
Here are the steps required to construct a modified box and whiskers plot:
- The median and the quartiles are used to construct the box in exactly the same manner used to construct the standard boxplot.
- Multiply the IQR by 1.5. So, in this case, 1.5 x IQR = 1.5 (5) = 7.5. Let's call this result the STEP. That is, STEP = 7.5.
- Add the STEP to the third quartile, obtaining 3rd Quartile + STEP = 7 + 7.5 = 14.5. Use this to perform two tasks:
- Any data beyond 14.5 is plotted using an empty circle. This explains the two circles at 15 and 20 in Figure 3.
- Locate the largest data point below 14.5. This is the number 8. This is where the end of the upper whisker is drawn.
- Subtract the STEP from the first quartile, obtaining 1st Quartile - STEP = 2 - 7.5 = -5.5. Use this to perform two tasks:
- Any data below -5.5 is plotted using an empty circle. There are no such data points in Figure 3.
- Locate the smallest data point that occurs above -5.5. This is the number 0. This is where the end of the lower whisker is drawn.
We hope you enjoyed this introduction to the R system. This interactive system provides a strong interactive interface for exploration in statistics.
We encourage you to explore further. Use the command ?boxplot to learn more about what you can do with the boxplot command.