LFS:Seminars/R

From UBC Wiki

The R Project for Statistical Computing

R Project Web Site

http://www.r-project.org

The Comprehensive R Archive Network.

http://cran.stat.sfu.ca

R Studio Web Site

http://www.rstudio.com

Video Resources

Online resource available on YouTube by Mike Marin.

https://www.youtube.com/user/marinstatlectures?feature=g-high-rec

Video series by the Google engineers.

http://www.youtube.com/playlist?list=PLOU2XLYxmsIK9qQfztXeybpHvru-TrqAP

What is R

 R is a programming language and software environment for statistical computing and graphics.
 Windows, MAC, and Linux platform.

Load packages

  Menu -> Packages -> Install Packages
  Menu -> Packages -> Load Packages

Examples

 >example(lm)
 >demo(persp)
 >demo(graphics)
 >demo(Hershery)
 >demo(plotmath)

Data in

 >x <- 1:12
 >dim(x) <- c(3,4)
 >x
 >Matrix(1:12, nrow=3, byrow=T)
 >t(x)
 >dd <- data.frame()
 >fix(dd)

Edit a table

 >aq <- edit(airquality)
 >fix(airquality)

Read from a text file

 >x <- read.table(“c:\\folder\\filename.txt”, header=T)
 Note:  header=T, the first row in the data file is the variable name

Read from Excel files

 Install xlsx package,
 >require(xlsx)
 >mydata <- read.xlsx(“h:\\aa.xls”)
 >mydata
 >fix(mydata)

Get help

 From menu
    Menu -> help
 >?read.table
 >help.search(“data input”)
 >find(“lowess”)
 >apropos(“lm”)

Statistical Operations

Simple statistic

 >x <- rnorm(50)
 >x
 >mean(x)
 >sd(x)
 >var(x)
 >median(x)
 >quantile(x)
 >hist(x)

Normal distribution

 >x <- seq(-4,4,0.1)
 >plot(x,dnorm(x),type=‘l’)
 >curve(dnorm(x),from=-4, to=4)

Binomial distribution

 >x <- 0:50
 >plot(x,dbinom(x,size=50,prob=0.33),type=‘h’)

Example

 >weight <- c(60,72,57,90,95,72)
 >height <- c(1.75,1.80,1.65,1.90,1.74,1.91)
 >bmi <- weight/height^2
 >sum(weight)
 >sum(weight)/length(weight)
 >mean(weight) 
 
 >xbar <- sum(weight)/length(weight)
 >sqrt(sum((weight-xbar)^2)/(length(weight)-1))
 >sd(weight)
Formular of SD
 SD = √(Σ(Xi –X)^2/(n-1))

T test

 >t.test(bmi,mu=22.5)

Calculate t-value

 >qt(0.95, df=5)

Plotting

 >plot(height, weight)
 >plot(height, weight, pch=2)
 >Lines(height, 22.5*height^2)

Regression

 Please Install and Load package ISwR

 >attach(thuesen)
 >y <- short.velocity
 >x <- blood.glucose
 >lm(y~x)
 >summary(lm(y~x))
 Short.velocity=1.098 + 0.022 * blood.glucose

Regression

 >beta<-sum((x-mean(x))*(y-mean(y)))/sum((x-mean(x))^2)
 >alfa <- mean(y)-beta*mean(x)
 >SSX<-sum(x^2)-(sum(x))^2/23
 >SSY<-sum(y^2)-(sum(y))^2/23
 >SSXY<-sum(x*y)-(sum(y)*sum(x))/23
 >SSR<-SSXY^2/SSX
 >SSE<- SSY-SSR
 >s2<-SSE/21                        #s2   error variance
 >seb <- sqrt(s2/SSX)               #seb  Standard Error of slope
 >sea <- sqrt(s2*sum(x^2)/23/SSX)   #sea  Standard Error of the intercept
 >summary(aov(lm(y~x)))


Regression(more example) and ANOVA

# Grouth ~ tannin
> reg <- read.table("s:\\temp\\tannin.txt",header=T)
> attach(reg)
> names(reg)
> plot(tannin,grouth,pch=16)
> lm(grouth~tannin)
> plot(tannin,grouth,pch=16)
> abline(lm(grouth~tannin))
> SSX<-sum(tannin^2)-sum(tannin)^2/length(tannin)
> SSX
> SSY<-sum(grouth^2)-sum(grouth)^2/length(grouth)
> SSY
> SSXY <- sum(tannin*grouth)-sum(tannin)*sum(grouth)/length(tannin)
> SSXY
> b <- SSXY/SSX
> b
> a<-sum(grouth)/length(grouth)-b*sum(tannin)/length(tannin)
> a
> SSR <- b*SSXY
> SSR
> SSE <- SSY-SSR
> SSE
> model<-lm(grouth~tannin)
> summary.aov(model)    # Anova
> summary(model)    #T test
> plot(model)

Correlation(example)

>cor(1:10,2:11)
>cor(1:10,11:2)

Correlation Matrix of Multivariable

>longley
>Ci <- cor(longley)

Graphical display

>Symnum(Ci)

Correlation

>attach(thuesen)
>thuesen
>cor(blood.glucose,short.velocity)
>cor(blood.glucose,short.velocity,use="complete.obs")
>cor.test(blood.glucose,short.velocity,use="complete.obs")

Google Visualization

 googleVis is an R package providing an interface between R and the Google Chart Tools to visualise data.
 The output of googleVis functions is html code that contains the data and references to JavaScript. To view the output a browser with Flash and Internet connection is required.

Examples:

 http://code.google.com/p/google-motion-charts-with-r/
 http://wmc.landfood.ubc.ca/webapp/VWM/course/global-food-challenges/introduction/

Principle Component Analysis

  *reduce dimantions 
  *create a new coordinate system for the data, the principal components being the unit vectors along the axes. The first principal component gives the direction of the maximum spread of the data. The second gives the direction of maximum spread perpendicular to the first direction.

Example 1

  Matrix: 100 X 2     
  data file :  marks.txt   
  >dat = read.table("marks.txt".head=T)   
  >dim(dat)   
  >names(dat)   
  >pc1 = princomp(~Stat + Phys, dat)   
  >pc1$loadings   
  >pc1   
  >names(pc1)   
  >pc1$scores

Example 2

 Matrix 46420 X 23     
 data file: quasar.txt   
 >quas = read.table("quasar.txt", head=T)   
 >dim(quas)   
 >names(quas)   
 >pc2 = princomp(quas[,-1], scores=T)   
 >pc2   
 >plot(pc2)   
 >screeplot(pc2)   
 >screeplot(pc2, type="lines")    
 >pc2$loadings[,1:2]