**Highlights**:

- as sample size decreases (heading down the graphic), both the KS-test and the Shapiro-Wilk loose their power to 'detect normality' (i.e., p values > 0.05)

- Shapiro-Wilk is more powerful than the more general KS-test.

- for QQ plots, one should focus on systematic trends along the inner-quantiles and ignore large deviations in the tails. The tails for the samples drawn from the Normal distribution (Right) can be wonky, but nonetheless the Shapiro-Wilk tests are not significant (as they shouldn't be).

- Subplotting in R is terrible (i.e., the nice histograms in the upper corners of the QQ plots): if you look at my code, most of the nastiest that detracts from the simple qqnorm and qqline functions is to make the inset histogram (on a positive note, I learned out to finely control where plots are positioned, rather than relying on the par(mfrow) command.

- Histograms are very difficult to interpret at low sample sizes. Herein lies the beauty of QQ-plots to show subtle patterns.

- I haven't shown a follow up simulation, but the KS-test is very sensitive to bimodal distributions of 2 mixtures of Normal distributions. The Shapiro-Wilk is not!

*Left, different samples from a Gamma distribution (shape=5,scale=5), and Right, draws from a Normal distribution with the same mean and variance as the draws from the Gamma distribution.*

Here is some R code to run the above simulation and get a feel for the QQ-plots and tests.

```
```

# SETUP SIMULATION PARAMETERS

N <- 100# sample size

shp_ <- 5; scale_ <- 5 # shape and scale parameters for Gamma distr

dat.skew <- as.numeric(scale(rgamma(n=N,shp_,scale_))) # random skewed data

dat.norm <- rnorm(N,mean(dat.skew),sqrt(var(dat.skew)))# random normal data

d <- list(skew=dat.skew, norm=dat.norm)

pltreg <- matrix(c(0,0.5,2/3,1, # x1,x2,y1,y2

0.5,1,2/3,1,

0,0.5,1/3,2/3,

0.5,1,1/3,2/3,

0,0.5,0,1/3,

0.5,1,0,1/3

),nrow=6,ncol=4,byrow=T)# arduous way to handle subplotting

# LOOP THROUGH DIFFERENT SAMPLE SIZES, PLOT 3x2

j <- 0 # counter to track which subplot we're on

invisible(par()) # make blank plot canvas

for(subn in c(1,0.3,0.15)){ # loop through different samples sizes

for(i in c("skew","norm")){ # loop through Gamma and Normal data

j <- j+1 # tracking plot region

sub.d <- sort(sample(d[[i]],round(N*subn))) # take subsample

par(fig=pltreg[j,],mar=c(4,4,1,1),new=TRUE*(j>1),mgp=c(1.2,0.5,0),bty="l")# pain in the ass way to position plots along the 3x2 canvas layout.

qqnorm(sub.d, datax=TRUE, col="forestgreen", pch=19,main=c(skew=paste("QQ (N=",round(N*subn)," from Gamma)",sep=""),norm=paste("QQ (N=",round(N*subn)," from Normal)",sep=""))[i]) # MAIN FUNCTION

qqline(sub.d,datax=TRUE) # 1:1 line

sw_p <- round(shapiro.test(sub.d)$p.value,3) # test of normality

ks_p <- round(ks.test(x=sub.d,y="pnorm")$p.value,3) # ''

text(0.8*diff(range(sub.d))+min(sub.d),y=0,labels=paste("N:",round(N*subn),"\nShapiro-Wilk:",sw_p,"\n","KS-test:",ks_p),pos=1)

par(fig=c(pltreg[j,c(1,1,4,4)]-c(0,-0.25,0.33/1.5,0)),new=TRUE)# setup inset histogram

hist(sub.d,col="grey",breaks=15,main="",xlab="",ylab="",xaxt="n",yaxt="n") # histogram

} # inner loop through two distributions

} # loop through sub-setting sample

*Made using R v3, ESS and EMACS24 on Bodhilinux 2.4*

## No comments:

## Post a Comment