## Linear Regression in R

In this activity we will explore the relationship between a pair of variables. We will first learn to create a scatter plot for the given data, then we will learn how to craft a "Line of Best Fit" for our plot.

### Scatterplots

The data set in the table that follows is taken from The Data and Story Library. Researchers measured the heights of 161 children in Kalama, a village in Egypt. The heights were averaged and recorded each month, with the study lasting several years. The data is presented in the table that follows.

Mean Height versus Age | |

Age in Months | Average Height in Centimeters |
---|---|

18 | 76.1 |

19 | 77 |

20 | 78.1 |

21 | 78.2 |

22 | 78.8 |

23 | 79.7 |

24 | 79.9 |

25 | 81.1 |

26 | 81.2 |

27 | 81.8 |

28 | 82.8 |

29 | 83.5 |

To build our scatter plot, we must first enter the data in R. This is a fairly straightforward task. Because the ages are incremented in months, we can use the command **age=18:29**, which uses R **start:finish** syntax to begin a vector at the number 18, then increment by 1 until the number 29 is reached.

> age=18:29

We can view the result by typing **age** at R's prompt.

> age [1] 18 19 20 21 22 23 24 25 26 27 28 29

Entering the average heights is a bit more tedious, but straightforward.

> height=c(76.1,77,78.1,78.2,78.8,79.7,79.9,81.1,81.2,81.8,82.8,83.5)

Again, we can view the result by entering **height** at R's prompt.

> height [1] 76.1 77.0 78.1 78.2 78.8 79.7 79.9 81.1 81.2 81.8 82.8 83.5

We can check that **age** and **height** have the same number of elements with R's **length** command.

> length(age) [1] 12 > length(height) [1] 12

It is now a simple matter to produce a *scatterplot* of **height** versus **age**.

> plot(age,height)

The result of this command is shown in Figure 1.

Figure 1. A scatterplot of height versus age data..

### The Line of Best Fit

Note that the data in Figure 1 is approximately linear. As the age increases, the average height increases at an approximately constant rate. One could not fit a single line through each and every data point, but one could imagine a line that is fairly close to each data point, with some of the data points appearing above the line, others below for balance. In this next activity, we will calculate and plot the Line of Best Fit, or the *Least Squares Regression Line*.

We will use R's **lm** command to compute a "linear model" that fits the data in Figure 1. The command **lm** is a very sophisticated command with a host of options (type **?lm** to view a full description), but in its simplest form, it is quite easy to usea. The syntax **height~age** is called a *model equation* and is a very sophisticated R construct. We are using its most simple form here. The symbol separating "height" and "age" in the syntax **height~age** is a "tilde." It is located on the key to the immediate left of the the #1 key on your keyboard. You must use the Shift Key to access the "tilde."

> res=lm(height~age)

Let's examine what is returned in the variable **res**.

> res Call: lm(formula = height ~ age) Coefficients: (Intercept) age 64.928 0.635

Note the "Coefficients" part of the contents of **res**. These coefficients are the *intercept* and *slope* of the line of best fit. Essentially, we are being told that the equation of the line of best fit is:

height = 0.635 age + 64.928.

Note that this result has the form *y = m x + b*, where *m* is the slope and *b* is the intercept of the line.

It is a simple matter to superimpose the "Line of Best Fit" provided by the contents of the variable **res**. The command **abline** will use the data in **res** to draw the "Line of Best Fit."

> abline(res)

The result of this command is the "Line of Best Fit" shown in Figure 2.

Figure 2. The **abline** command superimposes the "Line of Best Fit" on our previous scatterplot.

The command **abline** is a versatile tool. You can learn more about this command by entering **?abline** and reading the resulting help file.

### Prediction

Now that we have the equation of the line of best fit, we can use the equation to make predictions. Suppose, for example, that we wished to estimate the average height of a child at age 27.5 months. One technique would be to enter this value into the equation of the line of best fit.

height = 0.635 age + 64.928 = 0.635(27.5) + 64.928 = 82.3905.

We can use R as a simple calculator to perform this calculation.

> 0.635*27.5+64.928 [1] 82.3905

Thus, the average height at age 27.5 months is 82.3905 centimeters.

### Enjoy!

We hope you enjoyed this introduction to the principles of *Linear Regression* in the R system. We encourage you to explore further. Use the commands **?plot**, **?lm**, and **?abline** to learn more about producing scatterplots and performing linear regression.