http://how-to.linuxcareer.com/a-quick-gnu-r-tutorial-to-statistical-models-and-graphics. Moreover, R-squared is 0.78, meaning about 78% of the variance in the variable y is explained by the model.
Let us now run R in Linux from the location of the working directory simply by
1. IntroductionIn this quick GNU R tutorial to statistical models and graphics we will provide a simple linear regression example and learn how to perform such basic statistical analysis of data. This analysis will be accompanied by graphical examples, which will take us closer to producing plots and charts with GNU R. If you are not familiar with using R at all please have a look at the prerequisite tutorial: A quick GNU R tutorial to basic operations, functions and data structures.
2. Models and Formulas in RWe understand a model in statistics as a concise description of data. Such presentation of data is usually exhibited with a mathematical formula. R has its own way to represent relationships between variables. For instance, the following relationship y=c0+c1x1+c2x2+...+cnxn+r is in R written as
y~x1+x2+...+xn,which is a formula object.
3. Linear regression exampleLet us now provide a linear regression example for GNU R, which consists of two parts. In the first part of this example we will study a relationship between the financial index returns denominated in the US dollar and such returns denominated in the Canadian dollar. Additionally in the second part of the example we add one more variable to our analysis, which are returns of the index denominated in Euro.
3.1. Simple linear regressionDownload the example data file to your working directory: regression-example-gnu-r.csv
Let us now run R in Linux from the location of the working directory simply by
$ Rand read the data from our example data file:
> returns<-read .csv="" header="TRUE)</pre" regression-example-gnu-r.csv=""> You can see the names of the variables typing
>names(returns)  "USA" "CANADA" "GERMANY"It is time to define our statistical model and run linear regression. This can be done in the following few lines of code:
> y<-returns br="">> x1<-returns br="">> returns.lm<-lm formula="y~x1)</pre"> To display the summary of the regression analysis we execute the summary() function on the returned object returns.lm. That is,-read>
> summary(returns.lm) Call: lm(formula = y ~ x1) Residuals: Min 1Q Median 3Q Max -0.038044 -0.001622 0.000001 0.001631 0.050251 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 3.174e-05 3.862e-05 0.822 0.411 x1 9.275e-01 4.880e-03 190.062 <2e-16 br="">--- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.003921 on 10332 degrees of freedom Multiple R-squared: 0.7776, Adjusted R-squared: 0.7776 F-statistic: 3.612e+04 on 1 and 10332 DF, p-value: < 2.2e-16 2e-16>This function outputs the above corresponding result. The estimated coefficients are here c0~3.174e-05 and c1 ~9.275e-01. The above p-values suggest that the estimated intercept c0 is not significantly different from zero, therefore it can be neglected. The second coefficient is significantly different from zero since the p-value<2e-16 .="" by:="" estimated="" is="" model="" our="" represented="" sub="" therefore="" x="" y="0.93">12e-16>-lm>-returns>-returns>
3.2. Multiple linear regressionLet us now add one more variable into our model and perform a multiple regression analysis. The question now is whether adding one more variable to our model produces a more reliable model.
> x2<-returns br="">> returns.lm<-lm formula="y~x1+x2)<br">> summary(returns.lm) Call: lm(formula = y ~ x1 + x2) Residuals: Min 1Q Median 3Q Max -0.0244426 -0.0016599 0.0000053 0.0016889 0.0259443 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 2.385e-05 3.035e-05 0.786 0.432 x1 6.736e-01 4.978e-03 135.307 <2e-16 br="">x2 3.026e-01 3.783e-03 80.001 <2e-16 br="">--- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.003081 on 10331 degrees of freedom Multiple R-squared: 0.8627, Adjusted R-squared: 0.8626 F-statistic: 3.245e+04 on 2 and 10331 DF, p-value: < 2.2e-16 2e-16>2e-16>-lm>-returns>Above, we can see the result of the multiple regression analysis after adding the variable x2. This variable represents the returns of the financial index in Euro. We now obtain a more reliable model, since the adjusted R-squared is 0.86, which is greater then the value obtained before equal to 0.76. Note, that we compared the adjusted R-squared because it takes the number of values and the sample size into account. Again the intercept coefficient is not significant, therefore, the estimated model can be represented as: y=0.67x1+0.30x2.
Note also that we could have referred to our data vectors by their names, for instance
> lm(returns$USA~returns$CANADA) Call: lm(formula = returns$USA ~ returns$CANADA) Coefficients: (Intercept) returns$CANADA 3.174e-05 9.275e-01
4. GraphicsIn this section we will demonstrate how to use R for visualization of some properties in the data. We will illustrate figures obtained by such functions as plot(), boxplot(), hist(), qqnorm().
4.1. Scatter plotProbably the simplest of all graphs you can obtain with R is the scatter plot. To illustrate the relationship between the US dollar denomination of the financial index returns and the Canadian dollar denomination we use the function plot() as follows:
> plot(returns$USA, returns$CANADA)As a result of the execution of this function we obtain a scatter diagram as exhibited below
• ‘"p"’ for *p*oints
• ‘"l"’ for *l*ines
• ‘"b"’ for *b*oth
• ‘"c"’ for the lines part alone of ‘"b"’
• ‘"o"’ for both ‘*o*verplotted’
• ‘"h"’ for ‘*h*istogram’ like (or ‘high-density’) vertical lines
• ‘"s"’ for stair *s*teps
• ‘"S"’ for other type of *s*teps
• ‘"n"’ for no plotting
To overlay a regression line over the scatter diagram above we use the curve() function with the argument 'add' and 'col', which determines that the line should be added to the existing plot and the color of the line plotted, respectively.
> curve(0.93*x,-0.1,0.1,add=TRUE,col=2)Consequently, we obtain the following changes in our graph:
For more information on the function plot() or lines() use function help(), for instance
4.2. Box plotLet us now see how to use the boxplot() function to illustrate the data descriptive statistics. First, produce a summary of descriptive statistics for our data by the summary() function and then execute the boxplot() function for our returns:
> summary(returns) USA CANADA GERMANY Min. :-0.0928805 Min. :-0.0792810 Min. :-0.0901134 1st Qu.:-0.0036463 1st Qu.:-0.0038282 1st Qu.:-0.0046976 Median : 0.0005977 Median : 0.0005318 Median : 0.0005021 Mean : 0.0003897 Mean : 0.0003859 Mean : 0.0003499 3rd Qu.: 0.0046566 3rd Qu.: 0.0047591 3rd Qu.: 0.0056872 Max. : 0.0852364 Max. : 0.0752731 Max. : 0.0927688Note that the descriptive statistics are similar for all three vectors, therefore we can expect similar boxplots for all of the sets of financial returns. Now, execute the boxplot() function as follows
> boxplot(returns)As a result we obtain the following three boxplots.
4.3. HistogramIn this section we will take a look at histograms. The frequency histogram was already introduced in Introduction to GNU R on Linux Operating System. We will now produce the density histogram for normalized returns and compare it with the normal density curve.
Let us, first, normalize the returns of the index denominated in US dollars to obtain zero mean and variance equal to one in order to be able to compare the real data with the theoretical standard normal density function.
> retUS.norm<- br="" mean="" returns="" sqrt="" var="">> mean(retUS.norm)  -1.053152e-17 > var(retUS.norm)  1->Now, we produce the density histogram for such normalized returns and plot a standard normal density curve over such histogram. This can be achieved by the following R expression
> hist(retUS.norm,breaks=50,freq=FALSE) > curve(dnorm(x),-10,10,add=TRUE,col=2)
4.4. QQ-plotAnother useful graph in statistical analysis is the QQ-plot. The QQ-plot is a quantile quantile plot, which compares the quantiles of the empirical density to the quantiles of the theoretical density. If these match well we should see a straight line. Let us now compare the distribution of the residuals obtained by our regression analysis above. First, we will obtain a QQ-plot for the simple linear regression and then for the multiple linear regression. The type of the QQ-plot we will use is the normal QQ-plot, which means that the theoretical quantiles in the graph correspond to quantiles of the normal distribution.
The first plot corresponding to the simple linear regression residuals is obtained by the function qqnorm() in the following way:
> returns.lm<-lm br="" returns="">> qqnorm(returns.lm$residuals)-lm>The corresponding graph is displayed below:
> returns.lm<-lm br="" returns="">> qqnorm(returns.lm$residuals)-lm>This plot is displayed below: