Data Frames in R

In this activity we will introduce R's dataframes. Technically, a dataframe is R is a type of object. Less formally, a dataframe is a type of table where the typical use employs the rows as observations and the columns as variables. Take, for example, Mean Height versus Age data from the activity Linear Regression in R.

Mean Height versus Age
Age in Months Average Height in Centimeters
18 76.1
19 77
20 78.1
21 78.2
22 78.8
23 79.7
24 79.9
25 81.1
26 81.2
27 81.8
28 82.8
29 83.5

In the activity Linear Regression in R, we entered the age in a vector called age.

> age=18:29
> age
[1] 18 19 20 21 22 23 24 25 26 27 28 29

In similar fashion, we entered the average heights in a vector called height.

> height=c(76.1,77,78.1,78.2,78.8,79.7,79.9,81.1,81.2,81.8,82.8,83.5)
> height
[1] 76.1 77.0 78.1 78.2 78.8 79.7 79.9 81.1 81.2 81.8 82.8 83.5

We will now use R's data.frame command to create our first dataframe and store the results in the variable village.

> village=data.frame(age=age,height=height)

Let's examine the result of this command.

> village
   age height
1   18   76.1
2   19   77.0
3   20   78.1
4   21   78.2
5   22   78.8
6   23   79.7
7   24   79.9
8   25   81.1
9   26   81.2
10  27   81.8
11  28   82.8
12  29   83.5

The result is fairly clear. The contents of the vector age are stored in the first column of the dataframe under the heading "age" and the contents of the vector height are stored in the second column of the dataframe under the heading "height."

Examining R's Workspace

Let's examine the contents of our workspace with R's ls() command (which stands for "list the contents of the workspace." The contents of your workspace might vary, depending on whether or not you have created other variables than those introduced in this activity

> ls()
[1] "age"     "height"  "village"

Note that their are three objects in our workspace. We can examine the "class" of each object in our workspace, which tells us what kind of object is contained in each variable.

> ls()
> class(age)
[1] "integer"
> class(height)
[1] "numeric"
> class(village)
[1] "data.frame"

Thus we see that the vector age has class "integer." This should be clear as it was created with the command age=18:29. Each of these numbers are integers. On the other hand, the average heights were numbers such as 76.1, etc. These are not integers, but floating point numbers, so it is not surprising that the class of the object height is "numeric."

It is interesting, though not surprising --- after all, village was created with the data.frame command --- that the class of the object village is "data.frame." This is a new type of object that we will find quite useful.

Before we begin to explore dataframes in more depth, let's clean up our workspace a bit. Let's "remove" the age and height objects from our workspace.

> remove(age,height)

If you now try to print the contents of age and/or height, again you will see that they are gone.

> age
Error: object "age" not found
> height
Error: object "height" not found

Indeed, if you "list" the contents of your workspace, you will see that the age and height objects are gone. However, the dataframe village remains.

> ls()
[1] "village"

Accessing the Variables in the Dataframe

We will now explain how to access the variables contained in our data frame. In this case, we know that the each column contains a variable.

> village
   age height
1   18   76.1
2   19   77.0
3   20   78.1
4   21   78.2
5   22   78.8
6   23   79.7
7   24   79.9
8   25   81.1
9   26   81.2
10  27   81.8
11  28   82.8
12  29   83.5
The first column contains age and the second height. But we can quickly ascertain the names of the columns with R's names command. This would be useful if we were using someone else's dataframe and we were not sure of the column headings.

> names(village)
[1] "age"    "height"

OK, as expected. But how do we access the data in each column? One way is to state the variable containing the dataframe, followed by a dollar sign, then the name of the column we wish to access. For example, if we wanted to access the data in the "age" column, we would do the following:

> village$age
 [1] 18 19 20 21 22 23 24 25 26 27 28 29
 

In a similar fashion, we can access the values in the "height" column.

> village$height
 [1] 76.1 77.0 78.1 78.2 78.8 79.7 79.9 81.1 81.2 81.8 82.8
[12] 83.5

However, the additional typing required by the "dollar sign" notation can quickly become tiresome, so R provides the ability to "attach" the variables in the dataframe to our workspace.

> attach(village)

Let's re-examine our workspace.

> ls()
[1] "village"

No evidence of the variables in the workspace. However, R has made copies of the variables in the columns of the dataframe, and most importantly, we can access them without the "dollar notation."

> age
 [1] 18 19 20 21 22 23 24 25 26 27 28 29
> height
 [1] 76.1 77.0 78.1 78.2 78.8 79.7 79.9 81.1 81.2 81.8 82.8
[12] 83.5

It is important to understand that these are "copies" of the columns of the data frame. Suppose that we make a change to one of the entries of age.

> age[1]=12
> age
 [1] 12 19 20 21 22 23 24 25 26 27 28 29

Note, however, that the data in the dataframe village remains unchanged.

> village
   age height
1   18   76.1
2   19   77.0
3   20   78.1
4   21   78.2
5   22   78.8
6   23   79.7
7   24   79.9
8   25   81.1
9   26   81.2
10  27   81.8
11  28   82.8
12  29   83.5

Revisiting the Line of Best Fit

Because we've "attached" the dataframe village, we can execute the plot command as we did in the Linear Regression activity.

> plot(age,height)

This command produces the plot shown in Figure 1.

A scatterplot of height versus age data.

Figure 1. A scatterplot of height versus age data..

And we can once again use the linear model command to produce the line of best fit.

> res=lm(height~age)
> res

Call:
lm(formula = height ~ age)

Coefficients:
(Intercept)          age  
     64.928        0.635  

Again, the coefficients provide the intercept and slope of the line of best fit.

height = 0.635 age + 64.928.

The command abline uses the data in res to draw the "Line of Best Fit."

> abline(res)

The result of this command is the "Line of Best Fit" shown in Figure 2.

The <strong>abline</strong> command superimposes the "Line of Best Fit" on our previous scatterplot.

Figure 2. The abline command superimposes the "Line of Best Fit" on our previous scatterplot.

Prediction

We can use equation of the line of best fit to make predictions. Suppose, for example, that we wished to estimate the average height of a child at age 27.5 months. One technique would be to enter this value into the equation of the line of best fit.

height = 0.635 age + 64.928 = 0.635(27.5) + 64.928 = 82.3905.

We can use R as a simple calculator to perform this calculation.

> 0.635*27.5+64.928
[1] 82.3905

However, it is more efficient to use dataframes and R's predict command to predict the average height when the age is 27.5 months.

> predict(res,data.frame(age=27.5))
[1] 82.38986

Thus, the average height at age 27.5 months is 82.3905 centimeters.

Detach

When you no longer need the data in the dataframe village, it is good practice to "detach" the dataframe.

> detach(village)

Note that "age" and "height" are no longer available for use.

> age
Error: object "age" not found
> height
Error: object "height" not found

Advanced Use of Dataframes

Many R commands are able to handle dataframes without attaching them. For example, if you give the plot command a dataframe with two columns, by default it will plot the second column versus the first. The following command will produce the image shown in Figure 1.

> plot(village)

We can get the linear model in this case without attaching the dataframe village, simply by passing the dataframe village as an argument to the lm command.

> res=lm(height~age,data=village)
> res

Call:
lm(formula = height ~ age, data = village)

Coefficients:
(Intercept)          age  
     64.928        0.635  

The abline will produce the line of best fit shown in Figure 2.

> abline(res)

The single command abline(lm(height~age),data=village) will also produce the line of best fit shown in Figure 2. Thus, if you are in a hurry, two commands are enough to produce the scatterplot and line of best fit.

> plot(village)
> abline(lm(height~age,data=village))

Enjoy!

We hope you enjoyed this introduction to the use of dataframes in the R system. We encourage you to explore their use further. We will certainly do so in upcoming activities.