Sunday, 1 June 2014

Planned contrasts and ANCOVA in Excel / LibreOffice Calc

I just finished TA'ing Advanced Quantitative Methods in Psychology at Murdoch University, and I'd like to share an Excel/Calc file (note its an .ods) to help students visualize ANCOVA/regression. Users can tweak regression coefficients to minimize the Sum of Square Residuals and/or find the lines of best fit. For the math enthusiasts, it also demos the underlying matrix multiplication needed to solve the problem analytically. The Experiment: we are interested in whether treatments of sucrose (8%, 32%, 64%) effect rats' speed at completing a maze (DV) while controlling for the rats age (aka, the 'covariate' in ANCOVA).

One should never 'do stats' in excel (so I also provide R code below), but this is a great, quick way to visualize several statistics concepts that many non-staticians may not understand, including:
- how a mix of continuous and categorical variables can be visualized and different parallel lines, - how categorical variables are represented as indicator variables in the model matrix. - how Planned Constrasts determine the type of indicator columns (the default in the .ods file is coded using the 'first' treatment as the reference treatment, like in R, but users can do their own. Try the Helmert!)
- how to solve least squares regression coefficients via matrix multiplication

If I were a teacher in an intro stats course, I'd ask students to reproduce the above. One operation I couldn't figure out was how to calculate the S.E. of the regression coefficients: this requires more matrix operations than Excel offers.

# data: DV is speed to complete maze
# categorical variables: 3 sucrose treatments
# covariate: rate AGE
d<-data.frame(speed=c(1.4,2,3.2,1.4,2.3,4,5,4.7,6.2,3.1,3.2,4,4.5,6.4,4.4,4.1,5.8,6.6,6.5,5.9,5.9,3,5.9,5.6), trt=factor(rep(c(0,1,2),each=8),labels=c("sucrose8","sucrose32","sucrose64")), age=c(12,5,8,7,11,2,3,4,7,5,12,12,8,4,3,2,5,8,12,2,7,11,4,3))
# calculate regression coefficients and statistics manually
# then compare to worksheet and R
X <- model.matrix(~age+trt,data=d) # model matrix
betas<-solve(t(X)%*%X)%*%(t(X)%*%d$speed) # regression coefficients
SSm<-sum(((X%*%betas) - mean(d$speed))**2) # explained variation
SSr<- sum(((X%*%betas) - d$speed)**2) # residual sum of squares
df_m<-ncol(X)-1; df_r<-nrow(X)-ncol(X) # degrees of freedom
MSm<-SSm/df_m # Mean Square of Model
MSr <-SSr/df_r # means square error / estimate of variance
F.stat <- MSm/MSr

# standard error of regression coefficients<- sqrt(diag(solve(t(X)%*%X)*MSr)) # MSr is an estimate of sigma^2
betas.t <- betas/ # chance of seeing this large of effect ~ t distr
betas.p <- 2*(1-pt(q=abs(betas.t),df=df_r)) # p-values of regression coefs

# redo above, but with R's linear model function
m1 <- lm(speed~age+trt,d)
F.stat1<-summary(m1)$fstatistic # compare F.stat and F.stat1
coef(m1) # compare with betas
summary(m1)$coefficients # compare with,t,and p

# try with different Planned Constrasts:
# use Helmert contrast to test the hypothesis that 8% is different from
# 32% and 64% is different from both 8 and 32%
m.helm <- lm(speed~ age+trt, data=d, contrasts = list(trt = "contr.Helmert"))