One should never 'do stats' in excel (so I also provide R code below), but this is a great, quick way to visualize several statistics concepts that many non-staticians may not understand, including:
- how a mix of continuous and categorical variables can be visualized and different parallel lines, - how categorical variables are represented as indicator variables in the model matrix. - how Planned Constrasts determine the type of indicator columns (the default in the .ods file is coded using the 'first' treatment as the reference treatment, like in R, but users can do their own. Try the Helmert!)
- how to solve least squares regression coefficients via matrix multiplication
If I were a teacher in an intro stats course, I'd ask students to reproduce the above. One operation I couldn't figure out was how to calculate the S.E. of the regression coefficients: this requires more matrix operations than Excel offers.
# ANCOVA in R
# data: DV is speed to complete maze
# categorical variables: 3 sucrose treatments
# covariate: rate AGE
library(car)
d<-data.frame(speed=c(1.4,2,3.2,1.4,2.3,4,5,4.7,6.2,3.1,3.2,4,4.5,6.4,4.4,4.1,5.8,6.6,6.5,5.9,5.9,3,5.9,5.6),
trt=factor(rep(c(0,1,2),each=8),labels=c("sucrose8","sucrose32","sucrose64")),
age=c(12,5,8,7,11,2,3,4,7,5,12,12,8,4,3,2,5,8,12,2,7,11,4,3))
# calculate regression coefficients and statistics manually
# then compare to worksheet and R
X <- model.matrix(~age+trt,data=d) # model matrix
betas<-solve(t(X)%*%X)%*%(t(X)%*%d$speed) # regression coefficients
SSm<-sum(((X%*%betas) - mean(d$speed))**2) # explained variation
SSr<- sum(((X%*%betas) - d$speed)**2) # residual sum of squares
df_m<-ncol(X)-1; df_r<-nrow(X)-ncol(X) # degrees of freedom
MSm<-SSm/df_m # Mean Square of Model
MSr <-SSr/df_r # means square error / estimate of variance
F.stat <- MSm/MSr
# standard error of regression coefficients
betas.se<- sqrt(diag(solve(t(X)%*%X)*MSr)) # MSr is an estimate of sigma^2
betas.t <- betas/betas.se # chance of seeing this large of effect ~ t distr
betas.p <- 2*(1-pt(q=abs(betas.t),df=df_r)) # p-values of regression coefs
# redo above, but with R's linear model function
m1 <- lm(speed~age+trt,d)
F.stat1<-summary(m1)$fstatistic # compare F.stat and F.stat1
coef(m1) # compare with betas
summary(m1)$coefficients # compare with betas.se,t,and p
# try with different Planned Constrasts:
# use Helmert contrast to test the hypothesis that 8% is different from
# 32% and 64% is different from both 8 and 32%
m.helm <- lm(speed~ age+trt, data=d, contrasts = list(trt = "contr.Helmert"))