Saturday, February 9, 2013

A quick GNU R tutorial to statistical models and graphics

http://how-to.linuxcareer.com/a-quick-gnu-r-tutorial-to-statistical-models-and-graphics


1. Introduction

In this quick GNU R tutorial to statistical models and graphics we will provide a simple linear regression example and learn how to perform such basic statistical analysis of data. This analysis will be accompanied by graphical examples, which will take us closer to producing plots and charts with GNU R. If you are not familiar with using R at all please have a look at the prerequisite tutorial: A quick GNU R tutorial to basic operations, functions and data structures.

2. Models and Formulas in R

We understand a model in statistics as a concise description of data. Such presentation of data is usually exhibited with a mathematical formula. R has its own way to represent relationships between variables. For instance, the following relationship y=c0+c1x1+c2x2+...+cnxn+r is in R written as
y~x1+x2+...+xn,
which is a formula object.

3. Linear regression example

Let us now provide a linear regression example for GNU R, which consists of two parts. In the first part of this example we will study a relationship between the financial index returns denominated in the US dollar and such returns denominated in the Canadian dollar. Additionally in the second part of the example we add one more variable to our analysis, which are returns of the index denominated in Euro.

3.1. Simple linear regression

Download the example data file to your working directory: regression-example-gnu-r.csv
Let us now run R in Linux from the location of the working directory simply by
$ R
and read the data from our example data file:
> returns<-read .csv="" header="TRUE)</pre" regression-example-gnu-r.csv="">
You can see the names of the variables typing
>names(returns)
[1] "USA"     "CANADA"  "GERMANY"
It is time to define our statistical model and run linear regression. This can be done in the following few lines of code:
> y<-returns br="">> x1<-returns br="">> returns.lm<-lm formula="y~x1)</pre">
To display the summary of the regression analysis we execute the summary() function on the returned object returns.lm. That is,
> summary(returns.lm)

Call:
lm(formula = y ~ x1)

Residuals:
      Min        1Q    Median        3Q       Max 
-0.038044 -0.001622  0.000001  0.001631  0.050251 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 3.174e-05  3.862e-05   0.822    0.411    
x1          9.275e-01  4.880e-03 190.062   <2e-16 br="">---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Residual standard error: 0.003921 on 10332 degrees of freedom
Multiple R-squared: 0.7776,    Adjusted R-squared: 0.7776 
F-statistic: 3.612e+04 on 1 and 10332 DF,  p-value: < 2.2e-16 
This function outputs the above corresponding result. The estimated coefficients are here c0~3.174e-05 and c1 ~9.275e-01. The above p-values suggest that the estimated intercept c0 is not significantly different from zero, therefore it can be neglected. The second coefficient is significantly different from zero since the p-value<2e-16 .="" by:="" estimated="" is="" model="" our="" represented="" sub="" therefore="" x="" y="0.93">1
. Moreover, R-squared is 0.78, meaning about 78% of the variance in the variable y is explained by the model.

3.2. Multiple linear regression

Let us now add one more variable into our model and perform a multiple regression analysis. The question now is whether adding one more variable to our model produces a more reliable model.
> x2<-returns br="">> returns.lm<-lm formula="y~x1+x2)<br">> summary(returns.lm)

Call:
lm(formula = y ~ x1 + x2)

Residuals:
       Min         1Q     Median         3Q        Max 
-0.0244426 -0.0016599  0.0000053  0.0016889  0.0259443 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 2.385e-05  3.035e-05   0.786    0.432    
x1          6.736e-01  4.978e-03 135.307   <2e-16 br="">x2          3.026e-01  3.783e-03  80.001   <2e-16 br="">---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Residual standard error: 0.003081 on 10331 degrees of freedom
Multiple R-squared: 0.8627,    Adjusted R-squared: 0.8626 
F-statistic: 3.245e+04 on 2 and 10331 DF,  p-value: < 2.2e-16 
Above, we can see the result of the multiple regression analysis after adding the variable x2. This variable represents the returns of the financial index in Euro. We now obtain a more reliable model, since the adjusted R-squared is 0.86, which is greater then the value obtained before equal to 0.76. Note, that we compared the adjusted R-squared because it takes the number of values and the sample size into account. Again the intercept coefficient is not significant, therefore, the estimated model can be represented as: y=0.67x1+0.30x2.
Linux Career Jobs and AdsNote also that we could have referred to our data vectors by their names, for instance
> lm(returns$USA~returns$CANADA)

Call:
lm(formula = returns$USA ~ returns$CANADA)

Coefficients:
   (Intercept)  returns$CANADA  
     3.174e-05       9.275e-01  

4. Graphics

In this section we will demonstrate how to use R for visualization of some properties in the data. We will illustrate figures obtained by such functions as plot(), boxplot(), hist(), qqnorm().

4.1. Scatter plot

Probably the simplest of all graphs you can obtain with R is the scatter plot. To illustrate the relationship between the US dollar denomination of the financial index returns and the Canadian dollar denomination we use the function plot() as follows:
> plot(returns$USA, returns$CANADA)
As a result of the execution of this function we obtain a scatter diagram as exhibited below
example scatter plot GNU R
One of the most important arguments you can pass to the function plot() is ‘type’. It determines what type of plot should be drawn.  Possible types are:
                • ‘"p"’ for *p*oints
                • ‘"l"’ for *l*ines
                • ‘"b"’ for *b*oth
                • ‘"c"’ for the lines part alone of ‘"b"’
                • ‘"o"’ for both ‘*o*verplotted’
                • ‘"h"’ for ‘*h*istogram’ like (or ‘high-density’) vertical lines
                • ‘"s"’ for stair *s*teps
                • ‘"S"’ for other type of *s*teps
                • ‘"n"’ for no plotting
To overlay a regression line over the scatter diagram above we use the curve() function with the argument 'add' and 'col', which determines that the line should be added to the existing plot and the color of the line plotted, respectively.
> curve(0.93*x,-0.1,0.1,add=TRUE,col=2)
Consequently, we obtain the following changes in our graph:
scatter graph with regression line gnu R

For more information on the function plot() or lines() use function help(), for instance
>help(plot)

4.2. Box plot

Let us now see how to use the boxplot() function to illustrate the data descriptive statistics. First, produce a summary of descriptive statistics for our data by the summary() function and then execute the boxplot() function for our returns:
> summary(returns)
      USA                 CANADA              GERMANY          
 Min.   :-0.0928805   Min.   :-0.0792810   Min.   :-0.0901134  
 1st Qu.:-0.0036463   1st Qu.:-0.0038282   1st Qu.:-0.0046976  
 Median : 0.0005977   Median : 0.0005318   Median : 0.0005021  
 Mean   : 0.0003897   Mean   : 0.0003859   Mean   : 0.0003499  
 3rd Qu.: 0.0046566   3rd Qu.: 0.0047591   3rd Qu.: 0.0056872  
 Max.   : 0.0852364   Max.   : 0.0752731   Max.   : 0.0927688 
Note that the descriptive statistics are similar for all three vectors, therefore we can expect similar boxplots for all of the sets of financial returns. Now, execute the boxplot() function as follows
> boxplot(returns)
As a result we obtain the following three boxplots.
boxplots example gnu r

4.3. Histogram

In this section we will take a look at histograms. The frequency histogram was already introduced in Introduction to GNU R on Linux Operating System. We will now produce the density histogram for normalized returns and compare it with the normal density curve.
Let us, first, normalize the returns of the index denominated in US dollars to obtain zero mean and variance equal to one in order to be able to compare the real data with the theoretical standard normal density function.
> retUS.norm<- br="" mean="" returns="" sqrt="" var="">> mean(retUS.norm)
[1] -1.053152e-17
> var(retUS.norm)
[1] 1
Now, we produce the density histogram for such normalized returns and plot a standard normal density curve over such histogram. This can be achieved by the following R expression
> hist(retUS.norm,breaks=50,freq=FALSE)
> curve(dnorm(x),-10,10,add=TRUE,col=2)
density histogram gnu r
Visually, the normal curve does not fit the data well. A different distribution may be more suitable for financial returns. We will learn how to fit a distribution to the data in later articles. At the moment we can conclude that the more suitable distribution will be more picked in the middle and will have heavier tails.

4.4. QQ-plot

Another useful graph in statistical analysis is the QQ-plot. The QQ-plot is a quantile quantile plot, which compares the quantiles of the empirical density to the quantiles of the theoretical density. If these match well we should see a straight line.  Let us now compare the distribution of the residuals obtained by our regression analysis above. First, we will obtain a QQ-plot for the simple linear regression and then for the multiple linear regression. The type of the QQ-plot we will use is the normal QQ-plot, which means that the theoretical quantiles in the graph correspond to quantiles of the normal distribution.
The first plot corresponding to the simple linear regression residuals is obtained by the function qqnorm() in the following way:
> returns.lm<-lm br="" returns="">> qqnorm(returns.lm$residuals)
The corresponding graph is displayed below:
qq plot in gnu r 1
The second plot corresponds to the multiple linear regression residuals and is obtained as:
> returns.lm<-lm br="" returns="">> qqnorm(returns.lm$residuals)
This plot is displayed below:
qq plot in gnu r 2
Note that the second plot is closer to the straight line. This suggests that the residuals produced by the multiple regression analysis are closer to normally distributed. This supports further the second model as more useful over the first regression model.

5. Conclusion

In this article we have introduced the statistical modeling with GNU R on the example of linear regression. We have also discussed some frequently used in statistics graphs. I hope this has opened a door to statistical analysis with GNU R for you. We will, in later articles, discuss more complex applications of R for statistical modeling as well as programming so keep reading.

No comments:

Post a Comment