Linear Regression in R

In this activity we will explore the relationship between a pair of variables. We will first learn to create a scatter plot for the given data, then we will learn how to craft a "Line of Best Fit" for our plot.


The data set in the table that follows is taken from The Data and Story Library. Researchers measured the heights of 161 children in Kalama, a village in Egypt. The heights were averaged and recorded each month, with the study lasting several years. The data is presented in the table that follows.

Mean Height versus Age
Age in Months Average Height in Centimeters
18 76.1
19 77
20 78.1
21 78.2
22 78.8
23 79.7
24 79.9
25 81.1
26 81.2
27 81.8
28 82.8
29 83.5

To build our scatter plot, we must first enter the data in R. This is a fairly straightforward task. Because the ages are incremented in months, we can use the command age=18:29, which uses R start:finish syntax to begin a vector at the number 18, then increment by 1 until the number 29 is reached.

> age=18:29

We can view the result by typing age at R's prompt.

> age
 [1] 18 19 20 21 22 23 24 25 26 27 28 29

Entering the average heights is a bit more tedious, but straightforward.

> height=c(76.1,77,78.1,78.2,78.8,79.7,79.9,81.1,81.2,81.8,82.8,83.5)

Again, we can view the result by entering height at R's prompt.

> height
 [1] 76.1 77.0 78.1 78.2 78.8 79.7 79.9 81.1 81.2 81.8 82.8 83.5

We can check that age and height have the same number of elements with R's length command.

> length(age)
[1] 12
> length(height)
[1] 12

It is now a simple matter to produce a scatterplot of height versus age.

> plot(age,height)

The result of this command is shown in Figure 1.

A scatterplot of height versus age data.

Figure 1. A scatterplot of height versus age data..

The Line of Best Fit

Note that the data in Figure 1 is approximately linear. As the age increases, the average height increases at an approximately constant rate. One could not fit a single line through each and every data point, but one could imagine a line that is fairly close to each data point, with some of the data points appearing above the line, others below for balance. In this next activity, we will calculate and plot the Line of Best Fit, or the Least Squares Regression Line.

We will use R's lm command to compute a "linear model" that fits the data in Figure 1. The command lm is a very sophisticated command with a host of options (type ?lm to view a full description), but in its simplest form, it is quite easy to usea. The syntax height~age is called a model equation and is a very sophisticated R construct. We are using its most simple form here. The symbol separating "height" and "age" in the syntax height~age is a "tilde." It is located on the key to the immediate left of the the #1 key on your keyboard. You must use the Shift Key to access the "tilde."

> res=lm(height~age)

Let's examine what is returned in the variable res.

> res

lm(formula = height ~ age)

(Intercept)          age  
     64.928        0.635  

Note the "Coefficients" part of the contents of res. These coefficients are the intercept and slope of the line of best fit. Essentially, we are being told that the equation of the line of best fit is:

height = 0.635 age + 64.928.

Note that this result has the form y = m x + b, where m is the slope and b is the intercept of the line.

It is a simple matter to superimpose the "Line of Best Fit" provided by the contents of the variable res. The command abline will use the data in res to draw the "Line of Best Fit."

> abline(res)

The result of this command is the "Line of Best Fit" shown in Figure 2.

The <strong>abline</strong> command superimposes the "Line of Best Fit" on our previous scatterplot.

Figure 2. The abline command superimposes the "Line of Best Fit" on our previous scatterplot.

The command abline is a versatile tool. You can learn more about this command by entering ?abline and reading the resulting help file.


Now that we have the equation of the line of best fit, we can use the equation to make predictions. Suppose, for example, that we wished to estimate the average height of a child at age 27.5 months. One technique would be to enter this value into the equation of the line of best fit.

height = 0.635 age + 64.928 = 0.635(27.5) + 64.928 = 82.3905.

We can use R as a simple calculator to perform this calculation.

> 0.635*27.5+64.928
[1] 82.3905

Thus, the average height at age 27.5 months is 82.3905 centimeters.


We hope you enjoyed this introduction to the principles of Linear Regression in the R system. We encourage you to explore further. Use the commands ?plot, ?lm, and ?abline to learn more about producing scatterplots and performing linear regression.