Importing Data in R

In this activity we will learn how to import external data files into R. This activity builds upon the lessons learned in Dataframes in R. If you are not familiar with dataframes in R, we encourage you to first work through the activity Dataframes in R.

In the table that follows, we list average heights of children versus their ages. This data set was collected over a period of time in an Egyptian village and first explored in Linear Regression in R, then again in Dataframes in R.

Mean Height versus Age
Age in Months Average Height in Centimeters
18 76.1
19 77
20 78.1
21 78.2
22 78.8
23 79.7
24 79.9
25 81.1
26 81.2
27 81.8
28 82.8
29 83.5

In the activity Dataframes in R, we created vectors age and height.

> age=18:29
> height=c(76.1,77,78.1,78.2,78.8,79.7,79.9,81.1,81.2,81.8,82.8,83.5)

We then used R's data.frame command to create a dataframe and stored the result in the variable village.

> village=data.frame(age=age,height=height)
> village
   age height
1   18   76.1
2   19   77.0
3   20   78.1
4   21   78.2
5   22   78.8
6   23   79.7
7   24   79.9
8   25   81.1
9   26   81.2
10  27   81.8
11  28   82.8
12  29   83.5

Managing the Workspace

We can examine the contents of our workspace with the command ls().

> ls()
[1] "age"     "height"  "village"

And we can delete the age and height vectors with the remove command.

> remove(age,height)
> ls()
[1] "village"

We can "attach" the dataframe stored in village, which allows easy access to its columns.

> attach(village)
> ls()
[1] "village"
> age
 [1] 18 19 20 21 22 23 24 25 26 27 28 29
> height
 [1] 76.1 77.0 78.1 78.2 78.8 79.7 79.9 81.1 81.2 81.8 82.8 83.5
 

When we are through working with the dataframe, it is a good practice to "detach" the dataframe.

> detach(village)
> age
Error: object "age" not found
> height
Error: object "height" not found
> ls()
[1] "village"

Finally, before learning how to import data in electronic format, delete the dataframe stored in village.

> remove(village)
> ls()
character(0)

Creating an External File

Open your favorite text editor on your system. For example, you might use Notepad on Windows or Textedit on the Mac. Enter the data exactly as follows.

age height
18   76.1
19   77.0
20   78.1
21   78.2
22   78.8
23   79.7
24   79.9
25   81.1
26   81.2
27   81.8
28   82.8
29   83.5

Save the file as "village.txt". Be sure to make note of the folder or directory where you save your file. In our case, we saved the file in:

/Users/darnold/Documents/MathDept/trunk/MathDept/html/R/village.txt

Caution: If you use Microsoft Word, you must save the file in text format. Do not use Word's proprietary doc format.

The Current Working Directory

An important questions arises. How will R find the file "village.txt"? Before we provide specific instructions, we must first address the concept of R's working directory. The current working directory is found with the getwd() command. The response will differ on your system.

> getwd()
[1] "/Users/darnold"

The response tells us that the current working directory is /Users/darnold.

We can change the current working directory with the command setwd(). Simply enter the desired destination as a string (delimited with quotes).

> setwd("/Users/darnold/Documents")

You can check the result of this command with the getwd() command.

> getwd()
[1] "/Users/darnold/Documents"

Alternative -- Using the GUI

Alternately, we can use the menus provided by R's GUI interfaces on Windows and the Mac operating systems. For example, on the Mac there is a Misc menu that contains submenus for setting and getting the current working directory:

Importing Data from an External File

It's time to import the data in the file "village.txt". You will recall that we stored the file in:

/Users/darnold/Documents/MathDept/trunk/MathDept/html/R/village.txt

And the file had this format.

age height
18   76.1
19   77.0
20   78.1
21   78.2
22   78.8
23   79.7
24   79.9
25   81.1
26   81.2
27   81.8
28   82.8
29   83.5

Note that the first line of the file contains "headers", that is, "names" for each of the columns.

The R command read.table is used to read data from external sources and place the result in a data frame. By default, read.table reads data from a file where the data is separated by white space (one or more spaces, tabs, newlines, or carriage returns).

> village=read.table(file="/Users/darnold/Documents/MathDept/trunk/MathDept/
  html/R/village.txt",header=TRUE)

Note that this command was successful.

> village
   age height
1   18   76.1
2   19   77.0
3   20   78.1
4   21   78.2
5   22   78.8
6   23   79.7
7   24   79.9
8   25   81.1
9   26   81.2
10  27   81.8
11  28   82.8
12  29   83.5

Trying to remember the path to your data file ("village.txt") can be frustrating, so there is an alternative approach, one that allows the user to browse their directory structure in the usual manner.

> village=read.table(file=file.choose())

The option file.choose() will pop open a dialog that allows the user to browse through their directory structure in an accustomed manner. Locate the file in your directory structure, then click the Open button. The result is identical to the first use of read.table above.

> village
   age height
1   18   76.1
2   19   77.0
3   20   78.1
4   21   78.2
5   22   78.8
6   23   79.7
7   24   79.9
8   25   81.1
9   26   81.2
10  27   81.8
11  28   82.8
12  29   83.5

Another approach involves changing the current working directory. Using the setwd() command or the Change Working Directory submenu of the Misc menu, change the working directory to where you stored the file "village.txt".

> setwd("/Users/darnold/Documents/MathDept/trunk/MathDept/html/R")
> getwd()
[1] "/Users/darnold/Documents/MathDept/trunk/MathDept/html/R"

The dir() command will reveal the files stored in this directory.

> dir()
 [1] "ImportingData.php"  "SampleDiscrete.php" "barplot.php"       
 [4] "boxplot.php"        "code"               "dataframe.php"     
 [7] "graphics"           "hist.php"           "index.php"         
[10] "pie.php"            "regression.php"     "simple.php"        
[13] "village.txt"    

Note that "village.txt" is among the files stored in this directory. Because we've changed the current directory to the folder containing the file "village.txt", we can greatly simplify the use of read.table to read the file and store the result in a dataframe.

> village=read.table(file="village.txt",header=TRUE)

Because we've changed the current directory to the folder containing the file "village.txt", the file argument of the read.table no longer requires the long pathname we used above. Note that the result is identical to the previous results above.

> village
   age height
1   18   76.1
2   19   77.0
3   20   78.1
4   21   78.2
5   22   78.8
6   23   79.7
7   24   79.9
8   25   81.1
9   26   81.2
10  27   81.8
11  28   82.8
12  29   83.5

Importing Data from the Internet

Data can be stored in files in many different formats. This short tutorial does not cover all of the possibilities, but if the file is stored on the internet in the same form as we used to store data in the file "village.txt", then it is a simple matter to import that data into R.

For practice, we've stored the file "village.txt" at the following URL:

http://online.redwoods.edu/instruct/darnold/Math15/RData/village.txt

We read this data file into a dataframe as follows.

> village=read.table(file="http://online.redwoods.edu/instruct/darnold/
  Math15/RData/village.txt",header=TRUE)

The result is identical to previous readings of the file "village.txt".

> village
   age height
1   18   76.1
2   19   77.0
3   20   78.1
4   21   78.2
5   22   78.8
6   23   79.7
7   24   79.9
8   25   81.1
9   26   81.2
10  27   81.8
11  28   82.8
12  29   83.5

It is comforting that the syntax for importing the data from the internet is identical to the syntax for importing a file from the local computer system.

One can now use the dataframe to perform a variety of analyses. For example, one might want a scatterplot of height versus age and a superimposed line of best fit.

> plot(village)
> abline(lm(height~age,data=village))

These commands were introduced in the activity Linear Regression in R and Dataframes in R. The resulting scatterplot and line of best fit are shown in Figure 1.

The <strong>abline</strong> command superimposes the "Line of Best Fit" on our previous scatterplot.

Figure 2. Superimposing the "Line of Best Fit" on a scatterplot.

Enjoy!

We hope you enjoyed this introduction to importing data into the R system. We encourage you to explore further. Use the command ?read.table to learn more about using this important command.