Friday, 26 February 2016

Hierarchical Bayesian: Automatic Occam's Razor

There is a crisis of "reproducibility" right now in science (Prinz et al 2011; Simmons et al 2011; Begley et al 2012): conservative estimates are that more than 50% of scientific articles are false and cannot be reproduced. Bad estimates are rampant, and false-positives are the norm! This new 'digitial dark age' has many culprits. One contributor is, I think, a misuse and misunderstanding of the dominant statistical paradigm called "Frequentism". Frequentism is a great intellectual achievement, and has such lofty ideals as 'objectivity', a.k.a., let the data speak for itself. Let's not inject our own hopes and desires into our scientific research. Sounds great doesn't it? Why would we want otherwise?

Well, for one thing, researchers (especially dolphin researchers) rarely have enough data to use Frequentist tools. Thus, we violate the core principle of Frequentism, for which its name is derived:

In frequentism, it is the observed data that is the random variable. Had we repeated the experiment in a parallel universe, we would get different data. We don't care about the actual data values (like individual dolphin's measurements); instead, we care about the true population-level process which gives rise to this data. We try to make inferences about this true process by sampling from the data distribution, and then try to generalize over the entire data distribution.

The core principle is that the Frequentist is trying to use a small sample to characterize the entire data distribution. This requires a lot of data. And when we lack such data (because, heck, dolphins are difficult to find!), Frequentist estimates start to take on rediculous values. For example, I can't tell you how many times a dolphin researcher using a Mark-Recapture model has estimated a survival of 100%. I am serious! There is a big problem with our supposedly 'objective' statistics when they lead us to believe that dolphins are immortal. But here is the catch: my knowledge that dolphins age and die is a type of "prior information", and for me to use this prior biological knowledge leads us to another rival paradigm called Bayesianism.

I won't get into the philosophy of Bayesianism (partially because I don't quite understand it). The best I can do is imagine a Bayesian as a poker player: they don't consider the data in their hand (e.g., their current cards) to be random: they consider it a given, and they want to make the best predictions possible about the next cards to be drawn, conditional on the evidence in their hand. Seems reasonable enough, but it turns out this is in direct opposition to Frequentism.

Parsimony and Hierarchical Bayes


But, one nice feature about Bayesianism is that we can encode into our statistical models an old ideal from Occam: "parsimony". Basically, we say "Don't tell me a dolphin is immortal unless you have a lot of data!" Technically, we can use "hyper-prior" distributions on parameters like survival to force simplicity in lieu of strong evidence, but allow for more complex models if there is a lot of data and larger sample size. This is demonstrated nicely in the following animation.


Look at the red Maximum Likelihood Estimates on the Right (a Frequentist tool). They are estimating a probability p  from a series of 0-1 outcomes (like, the number of dolphins captured from N available), and this is done in parallel during 10 different "groups" (like 10 independent experiments). Notice that the ML Estimates  stay the same over a wide range of data/evidence (even though the "true" mean is around 0.5), and they vary from p=0.2 to p=1. However, the Hierarchical Bayesian posteriors (in purple) behave very differently. When the sample size is small (5 observations per group), each group is shrunk to the overall mean at 0.5. It is only as the sample size increases that each group is allowed to take on their own individual values; and eventually, with a lot of data, each groups' posterior expectation becomes nearly the same as the MLE.

This is a type of "model parsimony": with only a little bit of data, our model is simple (all groups have the same estimate, and the estimate is the group mean at 0.5). With more data, our model becomes more complex (all groups share a mean, but they can also have individual variation, each with their own parameter p).

The abuse of MLE is nicely illustrated with the following experiment: Imagine you want to estimate the fairness of a coin, but flip it only 3 times. Let's say you observed TTT (a rare event, but not impossible). The Maximum Likelihood Estimate would be that the probability of a tail is 100%, and the coin is definitely not fair. Of course, we know this is a terrible estimate and a stupid example, because we have solid prior information about the behaviour of coins. However, the sad fact is that many Dolphin Mark-Recapture researchers are making similarly stupid experiments, under small sample sizes and overally complex Frequentist models which estimate of 100% detection probability or 100% survival or 0% movement. These terrible estimates are polluting our scientific literature.

The MLE's are indeed the most 'objective', but they make bad predictions for new data and give us rediculous results under low-sample sizes. In contrast, a little bit of skepticism encoded in a Bayesian prior (like Beta(1,1)) can temper our estimates. Returning to the coin flipping example, the posterior would say that the probability of a tail is just 80%, given the observed sequence TTT. That's still not the true 0.5, but it is a better prediction than the frequentist's 100%

Unified Under Hierarchical Bayes


The poor performance of MLE was noticed by non-statisticians a long time ago in the Machine Learning and predictive analytics communities. They obsessed over "regularization": finding ways to constrain the complexity of models, and deliberately bias the estimates towards simplicity. Techniques like the LASSO, Ridge Regression, Elastic Net, Boosting, AIC optimization, Bayesian Model averaging, etc, were devised to make better predictions on new data, rather than philosophical notions of probability and objectivity.

The eerie thing is that most of these techniques have a Bayesian interpretation. The famous Lasso is just a Bayesian model with a Double Exponential prior on the regression coefficients. The idea is explored in an accessible article by Hooten and Hoobs 2015.

Subjective Pseudo Bayesian and the Degradation of Science


If ecologists are willing to inject a little bit of bias into their analyses in order to get better predictions and sensible results, then what of objectivity? What is to stop one from 'cooking the books'? Talking about cavalier 'pseudo'-Bayesian analysis, the Bayesian heavy-weight James Berger wrote in 2004:
Statistics owes its central presence in science and life to the facts that (i) it is enormously useful for prediction; (ii) it is viewed as providing an objective validation for science. Discarding the latter would lead to a considerable downgrading of the value placed on statistics.

At the time of his warning, few non-statisticians were using subjective Bayesian models, and even then, they seemed to want "objective" reference priors which conveyed no prior information, and yielded nearly identical results as Frequentist models. It is a bit ironic that the opposite seems to have happened: the crisis of reproducibility have been unveiled, and it was under the watch of the so-called objective school of statistics, and not from subjectively biased Bayesian models. Now that the tables of turned, and more and more researchers are turning to Bayesian models, Berger's 2004 warning sits, waiting to be realized. The solution is probably everyone adopting a prediction mentality. Dolphin ecologists have already sort of adopted this practise by using the AIC for model selection, but have yet to grapple with its philosophical underpinnings, nor rival ideas.

Hifime UH1 DAC under Linux and Jack

The Hifime UH1 USB DAC and Headphone Amp combo has some great features like:
- reference-quality DAC chip: Texas Instruments PCM5102 (a flagship chip used in other >$1000 DAC's)
- Samples rates from 44.1, 48, 88.2, 96, 176.4, 192, 384kHz, and bit rates 16, 24 and 32.
- optional headphone amplifier (TPA6120, which I couldn't get much information about), as well as a direct line-out just from the DAC itself. This a great feature for someone who may have to drive high-impedance headphones, but stills wants the flexibility of using other neutral amplifiers.
- Asynchronous USB transfer with two TCXO as master clock
- under $200 USB, ships from Hong Kong (not from the USA-based HifimeDIY)

What's really nice is that Hifime explicitly mentions support for Linux (without drivers!), a major selling point for me as I don't want to play around with kernel patches, etc.

Using with GNU/Linux JACK, Pulseaudio, and ALSA.
The DAC shows up as HiFimeDIY SA9227 USB Audio. For most linux distributions, one will have to contend with Pulseaudio to get any sound out of the DAC. The easiet way is to use 'pavucontrol' (lightweight Pulseaudio GUI control) to explicitly change the 'Audio Interfaces'.

Unfortunately, Pulseaudio automatically down-samples audio. So, for bit perfect audio, one needs to use a music player with ALSA support. I use gmusicbrowser, which is highly customizable (reminiscent of my only Win7 program Foobar2000). Click on Main (menu) -> Settings->Audio (Tab)-> Output device: set to "Alsa"; Click "advanced options" and enter "hw:Audio" as the Alsa option. Viola! Enjoy audio at 192000Hz and 24bit!

- See more at: http://hifimediy.com/hifime-uh1#sthash.twSvxSOI.dpuf

Tuesday, 23 February 2016

Bayesian Mark-Recapture: Parsimony and Pollock's Closed Robust Design

At the Bienniel Marine Mammal Conference 2016, I presented a Hierachical Bayesian version of Pollock's Closed Robust Design (see presentation here). This was motivated by some bad behaviour of Program Mark in estimating detection probabilities and migration parameters for "sparse data" -- I say sparse, but for my sample sizes with dolphins, failure is all too common. In both real data, and through an extensive simulation exercise, I found that Mark was highly prone to "boundary-level" MLEs for survival and migration and sometimes detection probabilities, leading to erratic abundance estimates.

Interestingly, mildy weak Bayesian priors (Beta(1,1)) help smooth over this bad behaviour ("Objective" priors do not).

The image to the right shows simulations of the PCRD, comparing Maximum Likelihood estimation versus a Bayesian model... notice the infinite CI's and boundary level MLEs for Program MARK! (See the full article link below)

Hierarchical Bayesian PCRD
The world is going Bayesian, but few people understand why. Here, the true power of Bayesian Inference is revealed in a hierarchical specification for the survival, detection and migration parameters: in particular, we shrink between "time-invariant theta(.)" models and "time-varying theta(t)" models. I found that applying a particular type of shrinkage-inducing hyperprior, called the "Scaled Half Student-t distribution", lead to somewhat similar estimates as "model-averaging" over many MLE-based fixed effect models (run in Mark).

Hierarchical Bayes provides a type of automatic "Occam's Razor" (Berger 2004), in that when there is sparse/weak data, the hyprprior shrinks the model to a simple model, but when there is a lot of evidence in the data or a large sample size, the hyperprior is easily overwhelmed and the model takes on a lot of complexity. Reassuringly, the final estimates seem insensitive to exact hyper-parameter values. As shown in the following animation...



I decided to upload an R & JAGS tutorial on github: https://github.com/faraway1nspace/PCRD_JAGS_demo Users can run simple fixed effects Bayesian PCRD models, or they can run a Hierarchical Bayesian version that includes: i) Individual Heterogeneity in detection probabilities; and ii) recruitment processes; and iii) uses a hyperprior to apply a type of "multimodel" inference. All things seem to unify in HB.

JAGS code: Fixed effects Bayesian PCRD
Here JAGS code to run the simple fixed model of the PCRD: phi(.)gamma'(.)gamma''(t)p(t,s). This is just for interested readers. Serious users should consult the github demonstration
. This implements the PCRD as a Bayesian HMM. Anyone who uses this JAGS code should cite: "Rankin RW, Nicholson K, Allen SJ, Krützen M, Bejder L, Pollock KH. 2016. A full-capture Hierarchical Bayesian model of Pollock's Closed Robust Design and application to dolphins. Frontiers in Marine Science, in Press. DOI: 10.3389/fmars.2016.00025". The article is available here

model{
# priors
phi ~ dbeta(1,1) # apparent survival probability (adjust for ecological
gamma1 ~ dbeta(1.2,1,2) # temp migration: probabilty stays out conditional on being out
# loop through primary periods: parameters for detection probability pd_t, and gamma2_t
for(t_ in 1:T){ # T primary periods...
pd_a[t_] ~ dgamma(3,2) # hyperprior on detect probs
pd_b[t_] ~ dgamma(3,2) # hyperprior on detect probs
for(tt_ in 1:T2[t_]){ # loop through secondary periods...
pd[t_,tt_] ~ dbeta(pd_a[t_],pd_b[t_]) # secondard-period-level detection probs
}
gamma2[t_] ~ dbeta(1,1) # temp migration: prob migrates outside conditional on being inside
# recruitment process from 'eigenvector decomposition'
lambda[1,t_] <- (1-gamma1)/(gamma2[t_]-gamma1+1) # recruitment ratio, or long-term prob of being inside
lambda[2,t_] <- 1-lambda[1,t_] #
psi[t_] ~ dunif(0,1) # inclusion probability
# trmat: transition matrix for Markovian latent-states
# 1 =not yet in population;2=dead;3=offsite;4=onsite (only observable state)
# transition are from the column --> rows
# trmat[row,column,time] = [state at time=t_; state at time t_-1; time=t_]
trmat[1,1,t_] <- 1-psi[t_] # excluded from pop
trmat[2,1,t_] <- 0 # dead
trmat[3,1,t_] <- psi[t_]*lambda[2,t_] # inclusion into pop, outside study are
trmat[4,1,t_] <- psi[t_]*lambda[1,t_] # inclusion into pop, inside study area
trmat[1,2,t_]<- 0
trmat[2,2,t_]<- 1 # stay dead
trmat[3,2,t_]<- 0
trmat[4,2,t_]<- 0
trmat[1,3,t_]<- 0
trmat[2,3,t_]<- 1-phi # dies outside
trmat[3,3,t_]<- gamma1*phi # stays outside | outside
trmat[4,3,t_]<- (1-gamma1)*phi # reenters study area | outside
trmat[1,4,t_]<- 0 #
trmat[2,4,t_]<- 1-phi # dies inside
trmat[3,4,t_]<- gamma2[t_]*phi # leaves study area | inside
trmat[4,4,t_]<- (1-gamma2[t_])*phi # stays inside | inside
} # t_
# likelihood: loop through M individuals, both real and pseudo-individuals
for (i in 1:M){
#draw latent state at primary period 1:
# ... by definition, everyone starts in z=1 (not-in-population) at time=0
z[i,1]~ dcat(trmat[1:4,1,1]) # first z strictly excluded from pop
# likelihood for first primary period
for(tt_ in 1:T2[1]){ # loop through secondary periods
# Bernouilli process, conditional on z=4, otherwise no observation
y[i,tt_,1] ~ dbern(pd[1,tt_]*equals(z[i,1],4))
}
alive_i[i,1] <- step(z[i,1]-3) # count if i is alive or not
Nin_i[i,1] <- equals(z[i,1],4) # count if i is within study area
# loop through primary periods after 1st primary periods
for(t_ in 2:T){
# state process: draw z(t_) conditional on z(t_-1)
z[i,t_] ~ dcat(trmat[1:4, z[i,t_-1] , t_])
# likelihood: loop through secondary period observations
for(tt_ in 1:T2[t_]){
# Bernouilli observation error, conditional on z=4
y[i,tt_,t_] ~ dbern(pd[t_,tt_]*equals(z[i,t_],4))
}
# tally population size
alive_i[i,t_] <- step(z[i,t_]-3) # check alive or not
Nin_i[i,t_] <- equals(z[i,t_],4) # count if i is within study area
} # t_
} # i
# estimate population size per primary periods
for(t_ in 1:T){
alive[t_] <- sum(alive_i[,t_]) # number alive
Nin[t_] <- sum(Nin_i[,t_]) # number in study area
} # t_
}


JAGS code: Hierarchical Bayesian PCRD
Here is JAGS code for a Hierarchical Bayesian Model, phi(t)gamma'(t)gamma''(t)p(t,s). The Hierarchical Part uses 'hyperpriors' to deliberately shrink time-varying specifications, like phi(t), to a simpler time-invariant specifications, phi(.). The result is a type of "model parsimony", with a similar type of smoothing as "model averaging". The following is just a visual demo for the casual reader; serious users should refer to the github page and download the full R code. Anyone who uses this JAGS code should cite: "Rankin RW, Nicholson K, Allen SJ, Krützen M, Bejder L, Pollock KH. 2016. A full-capture Hierarchical Bayesian model of Pollock's Closed Robust Design and application to dolphins. Frontiers in Marine Science, in Press. DOI:10.3389/fmars.2016.00025". The article is available here
model{
#hyperpriors: logit(theta_mu)~Normal and theta_sd ~ half-t
phi.mu ~ dunif(pr.phiunif[1],pr.phiunif[2]) #mean survival with a Uniform prior
sigma.phi ~ dt(0,pr.tauphi[1],pr.tauphi[2]) T(0,) #mean survival dispersion, half-t hyperprior
g1.mu ~ dnorm(0,pr.g1mu) #mean gamma1, temp migration out | out
sigma.g1~dt(0,pr.taug1[1],pr.taug1[2]) T(0,) #mean gamma1 dispersion, half-t hyperprior
g2.mu ~ dnorm(0,pr.g2mu) #mean gamma2, temp migration out | in
sigma.g2~dt(0,pr.taug2[1],pr.taug2[2]) T(0,) #mean gamma2 dispersion, half-t hyperprior
pd.mu ~ dnorm(0,pr.pdmu) #mean detection prob, overall
sigma.pdmu~dt(0,pr.taupdmu[1],pr.taupdmu[2]) T(0,) #primary period detection prob dispersion
sigma.pd2nd~dt(0,pr.taupd2nd[1],pr.taupd2nd[2]) T(0,) #secondary periods detection prob dispersion
sigma.eps ~ dt(0,pr.taueps[1],pr.taueps[2]) T(0,) #individual detection prob dispersion
#time-variant parameters
for(t_ in 1:T){ #loop through primary periods
pd_mu[t_]~dnorm(pd.mu,pow(sigma.pdmu,-2)) #primary period mean detaction prob (logit)
lgamma1[t_]~dnorm(g1.mu,pow(sigma.g1,-2)) #prob migrate out|out (logit)
gamma1[t_] <- ilogit(lgamma1[t_]) #prob migrate out|out (probability)
lgamma2[t_]~dnorm(g2.mu,pow(sigma.g2,-2)) #prob migrate out|in (logit)
gamma2[t_] <- ilogit(lgamma2[t_]) #prob migrate out|in (probability)
#RECRUITMENT: psi is the 'inclusion probability' and lambda is the 'recruitment ratio'
psi[t_]~dunif(0,1) #inclusion probability
lambda[t_] <- (1-gamma1[t_])/(gamma2[t_]-gamma1[t_]+1) #long-term probability inside study area
#NOTE, lambda could also be a random variable with a beta prior
#secondary-period detection probabilities
for(tt_ in 1:T2[t_]){ #loop through secondary periods
pd[t_,tt_] ~ dnorm(pd_mu[t_],pow(sigma.pd2nd,-2))
} #tt_
} #t_
#first state transition (pure nusance; strictly from outside-pop to part of marked-pop)
trmat0[1] <- (1-psi[1]) #remains not-yet-in-pop
trmat0[2] <- 0
trmat0[3] <- psi[1]*(1-lambda[1]) #inclusion into pop, goes outside study are
trmat0[4] <- psi[1]*lambda[1] #inclusion into pop, goes inside study
#state transitions (2:T)
for(t_ in 1:(T-1)){
lphi[t_]~dnorm(log(phi.mu/(1-phi.mu)), pow(sigma.phi,-2)) #survival prob (logit)
phi[t_]<-ilogit(lphi[t_])
#state transitions
#trmat: transition matrix for Markovian latent-states
#1 =not yet in population; 2=dead;3=offsite;4=onsite (only observable state)
#transition are from the column --> rows
#trmat[row,column,time] = [state at time=t_; state at time t_-1; time=t_]
#notice that the primary periods are offset by 1 (because we already dealt with T=1)
trmat[1,1,t_]<- 1-psi[t_+1] #excluded from pop
trmat[2,1,t_] <- 0 #dead
trmat[3,1,t_] <- psi[t_+1]*(1-lambda[t_+1]) #inclusion into pop,outside study
trmat[4,1,t_] <- psi[t_+1]*lambda[t_+1] #inclusion into pop,inside study
trmat[1,2,t_]<- 0
trmat[2,2,t_]<- 1 #stay dead
trmat[3,2,t_]<- 0
trmat[4,2,t_]<- 0
trmat[1,3,t_]<- 0
trmat[2,3,t_]<- 1-phi[t_] #dies outside
trmat[3,3,t_]<- gamma1[t_+1]*phi[t_] #stays outside | outside
trmat[4,3,t_]<- (1-gamma1[t_+1])*phi[t_] #reenters study area | outside
trmat[1,4,t_]<- 0 #
trmat[2,4,t_]<- 1-phi[t_] #dies inside
trmat[3,4,t_]<- gamma2[t_+1]*phi[t_] #leaves study area | inside
trmat[4,4,t_]<- (1-gamma2[t_+1])*phi[t_] #stays inside | inside
} #t_
#loop through M individuals
for (i in 1:M){
#state transitions and likelihood for the first primary period
z[i,1]~ dcat(trmat0) #z at time 0 is strictly 'not-yet-in-pop'
alive_i[i,1] <- step(z[i,1]-3) #count if i is alive or not
Nin_i[i,1] <- equals(z[i,1],4) #count if i is within study area
eps_i[i] ~ dnorm(0,pow(sigma.eps,-2)) #random effects at individual levels
#Observation error y[i,tt_,t_] ~ Bernoulli conditional on being inside z=4
for(tt_ in 1:T2[1]){ #loop through secondary periods
y[i,tt_,1] ~ dbern(equals(z[i,1],4)/(1+exp(-pd[1,tt_]-eps_i[i]))) #inverse-logit transform
}
#state transition and likelihood for primary periods 2:T
for(t_ in 2:T){
#State process: draw z(t_) conditional on z(t_-1)
z[i,t_] ~ dcat(trmat[1:4, z[i,t_-1] , t_-1])
#Observation error y[i,tt_,t_] ~ Bernoulli condition on being inside z=4
for(tt_ in 1:T2[t_]){ #loop through secondary periods
y[i,tt_,t_] ~ dbern(equals(z[i,t_],4)/(1+exp(-pd[t_,tt_]-eps_[i]))) #inverse-logit transform
}
#check whether i individual is alive and inside
alive_i[i,t_] <- step(z[i,t_]-3) #check alive
Nin_i[i,t_] <- equals(z[i,t_],4) #count if i is within study area
} #t_
} #i
#tally population size
for(t_ in 1:T){
alive[t_] <- sum(alive_i[,t_]) #number alive
Nin[t_] <- sum(Nin_i[,t_]) #number in study area
} #t_
} #end model