Transforming Data in R
In the activity Linear Regression in R, we showed how to calculate and plot the "line of best fit" for a set of data. As a quick reminder, consider the normal average January minimum temperatures in 56 American cities, presented at the following URL:
http://lib.stat.cmu.edu/DASL/Datafiles/USTemperatures.html
This file is one of many data sets stored at the Data and Story Library. We've adapted the file to make it easier to import into R. You can download the adapted file at http://msenux.redwoods.edu/mathdept/R/USTemperatures.txt. Take note of the folder name where you save this file as USTemperatures.txt.
In the activity Importing Data in R, we showed one of the simplest methods to import a datafile into a dataframe in R.
> USTemps=read.table(file=file.choose(),header=TRUE)
The option file.choose() will pop open a dialog that allows the user to browse through their directory structure in an accustomed manner. Locate the file USTemperatures.txt in your directory structure, then click the Open button.
> USTemps=read.table(file=file.choose(),header=TRUE) > USTemps Temp Lat Long 1 44 31.2 88.5 2 38 32.9 86.8 3 35 33.6 112.5 4 31 35.4 92.8 5 47 34.3 118.7 ... ... ... 56 14 41.2 104.9
The first variable Temp contains the average January low temperature for a particular US city having latitude and longtitude stored in the second and third variables, Lat and Long, respectively.
We now calculate the equation of the line of best fit.
> res=lm(Temp~Lat,data=USTemps)
> res
Call:
lm(formula = Temp ~ Lat, data = USTemps)
Coefficients:
(Intercept) Lat
108.728 -2.110
Hence, the equation of the line of best fit is given by:
Temp = -2.110 Lat + 108.728.
It's a simple matter to produce a scatterplot and the line of best fit.
> plot(Temp~Lat,data=USTemps) > abline(res)
The above commands produce the scatterplot and the line of best fit in Figure 1.

Figure 1. Plot of temperature versus latitude and the line of best fit.
In Figure 1, we see a general downward linear trend (which explains the negative slope). As the latitude increases, we move northward from the equator (which has latitude zero) and the temperature gets colder.
The Power Function
A power function has the following form.
The Power Function
![]()
Figure 2. Note the location of the variable x in the power function.
It's important to note the location of the variable x in the power function in Figure 2. Later in this activity, we will contrast the power function with the exponential function, where the variable x is an exponent, rather than the base as in the equation in Figure 2.
Let's look at an example, choosing the power function y = 3x2. First, let's produce a plot.
> x=seq(1,5,0.5) > y=3*x^2 > plot(x,y)
The above sequence produces the plot shown in Figure 3.

Figure 3. An example of a power function.
It could be argued that the data is somewhat linear, showing a general upward trend. However, it is more likely (based on Figure 3) that the function is nonlinear, due to the general bend in curve shown in Figure 3.
Tranforming the Data
Let's take the logarithm of both sides of the power function y = 3x2. The base of the logarithm is irrelevant; however, log x is understood to be the natural logarithm (base e) in R. That is, log x = loge x in R.
log y = log 3x2
The log of a product is the sum of the logs, so we can write the following.
log y = log 3 + log x2
Another property of logs allows us to move the exponent down.
log y = log 3 + 2 log x
This last form shows that if we plot the log of y versus the log of x, the graph will be linear with slope 2 and intercept log 3.
> plot(log(x),log(y))
The above command produces the plot shown in Figure 4.

Figure 4. Plotting the log of y versus the log of x produces a line.
Note that the plot of log y versus log x is linear!
It's interesting to find the line of best fit for the transformed data.
> res=lm(log(y)~log(x))
> res
Call:
lm(formula = log(y) ~ log(x))
Coefficients:
(Intercept) log(x)
1.099 2.000
The slope is 2, which is the slope indicated in log y = log 3 + 2 log x. Supposedly, the intercept should be log 3.
> log(3) [1] 1.098612
This result agrees with the intercept found using R's lm command as reported in the variable res.
An important lesson has been learned.
Resting Metabolic Rate
Let's apply what we've learned about power functions to a concrete example. In the table that follows, the resting metabolic rates of several classs of primates are presented along with the mass of the primate class. The resting metabolic rate (RMR) is defined to be the amount of energy required by the body when the body is doing nothing. This data is taken from Leonard and Robinson (1997, AJPA) and repeated in and article by Marcus Hamilton, Model-Fitting with Linear Regression: Power Functions.
| Resting Metabolic Rate in Primates | ||
| Species | Weight | RMR |
|---|---|---|
| A. palliata | 8.5 | 363 |
| A. palliata | 6.4 | 293 |
| A. trivirgatus | 0.85 | 46 |
| A. geoffroyi | 8.41 | 346 |
| C. molloch | 0.7 | 54 |
| C. apella | 2.6 | 143 |
| C. albifrons | 2.4 | 135 |
| S. imperator | 0.4 | 35 |
| S. fusicollis | 0.3 | 28 |
| S. sciureus | 0.8 | 66 |
| C. albigena | 7.9 | 327 |
| C. guereza | 7 | 265 |
| M. fascicularis | 5.5 | 331 |
| P. anubis | 29.3 | 956 |
| P. anubis | 13 | 520 |
| H. lar | 6 | 292 |
| P. troglodytes | 39.5 | 1036 |
| P. troglodytes | 29.8 | 839 |
| P. pygmaeus | 83.6 | 1948 |
| P. pygmaeus | 37.8 | 1074 |
| S. syndactylus | 10.5 | 408 |
| !Kung | 46 | 1383 |
| !Kung | 41 | 1099 |
| Ache | 59.6 | 1591 |
| Ache | 51.8 | 1394 |
We've massaged the data and stored it in the file RMR.txt. Download the file and save it as RMR.txt. Take note of the folder in which you save the file. Read the data into R as follows:
> primates=read.table(file=file.choose(),header=TRUE) > primates Weight RMR 1 8.50 363 2 6.40 293 3 0.85 46 4 8.41 346 5 0.70 54 6 2.60 143 7 2.40 135 8 0.40 35 9 0.30 28 10 0.80 66 11 7.90 327 12 7.00 265 13 5.50 331 14 29.30 956 15 13.00 520 16 6.00 292 17 39.50 1036 18 29.80 839 19 83.60 1948 20 37.80 1074 21 10.50 408 22 46.00 1383 23 41.00 1099 24 59.60 1591 25 51.80 1394
Plot the data.
> plot(RMR~Weight,data=primates)
The above command produces the plot shown in Figure 5.

Figure 5. Plotting RMR versus Weight.
One could argue that the data has a general linear appearance. However, there is evidence in Figure 5 of a slightly concave down bend to the data, suggesting that we might use a power function (perhaps the square root function) to fit the data. This encourages us to try plot the logarithm of the RMR versus the logarithm of the Weight as follows.
> plot(log(RMR)~log(Weight),data=primates)
The above command produces the plot shown in Figure 6.

Figure 6. Plotting the logarithm of the RMR versus the logarithm of the Weight.
Aha! The plot in Figure 6 shows a definite linear trend. Therefore, our suspicion that a power function might fit the original data set is probably valid (there are actual statistical tests for this assumption which we will explore in later activities). Let's find the line of best fit for the transformed data in Figure 6.
> res=lm(log(RMR)~log(Weight),data=primates)
> res
Call:
lm(formula = log(RMR) ~ log(Weight), data = primates)
Coefficients:
(Intercept) log(Weight)
4.2409 0.7599
This result tells us that we can fit a linear model to the transformed data as follows:
log RMR = 4.2409 + 0.7599 log Weight
Exponentiate both sides of the above equation.
elog RMR = e4.2409 + 0.7599 log Weight
On the left, the exponential and the logarithm are inverses. On the right, we use a property of exponents.
RMR = e4.2409 e0.7599 log Weight
We use a property of logarithms to move the 0.7599 into the exponent.
RMR = e4.2409 elog Weight0.7599
We evaluate e4.2409.
> exp(4.2409) [1] 69.47035
This simplifies further our result.
RMR = 69.47035 elog Weight0.7599
Finally, we again use the fact that the logarithm and exponential are inverses.
RMR = 69.47035 Weight0.7599
Note that this final function has the form y = a xb, a power function that should "fit" the original data set.
Let's provide some visual verification of our model. First, find the range (min and max) of the Weight data.
> range(primates$Weight) [1] 0.3 83.6
Plot the original data, then superimpose a plot of the power function that we think will "fit" the data.
> plot(RMR~Weight,data=primates) > x=seq(0.3,83.6,0.1) > y=69.47035*x^0.7599 > lines(x,y,type="l",lwd=2,col="red")
These commands produce the scatterplot and the power function that "fits" the data shown in Figure 7.

Figure 7. The power function "fits" the original data.
The Exponential Function
An exponential function has the following form.
The Exponential Function
![]()
Figure 8. In the exponential function, the independent variable is the exponent.
In the power function, the independent variable was the base, as in y = a xb. In the exponential function, the independent variable is now an exponent, as in y = a bx. This is a subtle but important difference.
Let's look at an example of an exponential function, choosing the example y = 3(2x). First, let's produce a plot.
> x=seq(1,5,0.5) > y=3*2^x > plot(x,y)
The above sequence produces the plot shown in Figure 9.

Figure 9. An example of a exponential function.
The function in Figure 9 is clearly nonlinear.
Tranforming the Data
Let's take the logarithm of both sides of the exponential function y = 3(2x). The base of the logarithm is irrelevant; however, log x is understood to be the natural logarithm (base e) in R. That is, log x = loge x in R.
log y = log 3(2x)
The log of a product is the sum of the logs, so we can write the following.
log y = log 3 + log 2x
Another property of logs allows us to move the exponent down.
log y = log 3 + x log 2
This last form shows that if we plot the log of y versus x, the graph will be linear with slope log 2 and intercept log 3.
> plot(log(x),log(y))
The above command produces the plot shown in Figure 4.

Figure 4. Plotting the log of y versus x produces a line.
Note that the plot of log y versus x is linear!
It's interesting to find the line of best fit for the transformed data.
> res=lm(log(y)~x)
> res
Call:
lm(formula = log(y) ~ x)
Coefficients:
(Intercept) x
1.0986 0.6931
The slope is 0.6931, which agrees with the slope indicated in log y = log 3 + x log 2; that is, the slope is log 2.
> log(2) [1] 0.6931472
The intercept is 1.098, which agrees with the intercept indicated in log y = log 3 + x log 2; that is, the intercept is log 3.
> log(3) [1] 1.098612
This result agrees with the intercept found using R's lm command as reported in the variable res.
An important lesson has been learned.
Cell Phone Subscribers
Let's apply what we've learned to a concrete example. In the table that follows, the number of cell phone subscriptions (in millions) is presented as a function of the number of years that have passed since 1987.
| Cell Phone Subscribers | |
| t (years since 1987) | subscribers (in millions) |
|---|---|
| 1 | 1.6 |
| 2 | 2.7 |
| 3 | 4.4 |
| 4 | 6.4 |
| 5 | 8.9 |
| 6 | 13.1 |
| 7 | 19.3 |
| 8 | 28.2 |
| 9 | 38.2 |
| 10 | 48.7 |
Enter and plot the data.
> t=seq(1,10) > subscribers=c(1.6,2.7,4.4,6.4,8.9,13.1,19.3,28.2,38.2,48.7) > plot(t,subscribers)
These commands produce the scatterplot shown in Figure 11.

Figure 11. Cell phone subscribers versus years since 1987.
We suspect exponential growth. This leads us to plot the logarithm of the subscriber data versus the year since 1987.
> plot(t,log(subscribers)) > res=lm(log(subscribers)~t) > abline(res)
These commands produce the scatterplot and the line of best fit shown in Figure 12.

Figure 12. Plotting log of subscribers versus the years since 1987.
The data of Figure 12 certainly appear to be linear. Let's examine the model of the line of best fit.
> res
Call:
lm(formula = log(subscribers) ~ t)
Coefficients:
(Intercept) t
0.2630 0.3774
This leads us to the following model.
log subscribers = 0.2630 + 0.3774t
Exponentiate both sides of the last equation.
elog subscribers = e0.2630 + 0.3774t
On the left, the exponential and logarithm are inverses. On the right, the exponential of a sum is the product of the exponentials.
subscribers = e0.2630 e0.3774t
Finally, we can evaluate e0.2630.
> exp(0.2630) [1] 1.300827
This leads to the final form of the exponential model.
subscribers = 1.300827 e0.3774t
Let's provide some visual verification of our model.
> plot(t,subscribers) > t=seq(1,10,0.1) > y=1.300827*exp(0.3774*t) > lines(t,y,type="l",lwd=2,col="red")
The above commands will plot the original data, then superimpose a plot of the exponential function that we found to "fit" the data.

Figure 13. The exponential function 'fits' the original data.
Not bad!
Enjoy!
We hope you enjoyed this introduction to the principles of transforming data in the R system. In upcoming activities, we will discuss further what is meant by a "good fit."
