Auk and Seal

Friday, 26 February 2016

Hierarchical Bayesian: Automatic Occam's Razor

There is a crisis of "reproducibility" right now in science (Prinz et al 2011; Simmons et al 2011; Begley et al 2012): conservative estimates are that more than 50% of scientific articles are false and cannot be reproduced. Bad estimates are rampant, and false-positives are the norm! This new 'digitial dark age' has many culprits. One contributor is, I think, a misuse and misunderstanding of the dominant statistical paradigm called "Frequentism". Frequentism is a great intellectual achievement, and has such lofty ideals as 'objectivity', a.k.a., let the data speak for itself. Let's not inject our own hopes and desires into our scientific research. Sounds great doesn't it? Why would we want otherwise?

Well, for one thing, researchers (especially dolphin researchers) rarely have enough data to use Frequentist tools. Thus, we violate the core principle of Frequentism, for which its name is derived:

In frequentism, it is the observed data that is the random variable. Had we repeated the experiment in a parallel universe, we would get different data. We don't care about the actual data values (like individual dolphin's measurements); instead, we care about the true population-level process which gives rise to this data. We try to make inferences about this true process by sampling from the data distribution, and then try to generalize over the entire data distribution.

The core principle is that the Frequentist is trying to use a small sample to characterize the entire data distribution. This requires a lot of data. And when we lack such data (because, heck, dolphins are difficult to find!), Frequentist estimates start to take on rediculous values. For example, I can't tell you how many times a dolphin researcher using a Mark-Recapture model has estimated a survival of 100%. I am serious! There is a big problem with our supposedly 'objective' statistics when they lead us to believe that dolphins are immortal. But here is the catch: my knowledge that dolphins age and die is a type of "prior information", and for me to use this prior biological knowledge leads us to another rival paradigm called Bayesianism.

I won't get into the philosophy of Bayesianism (partially because I don't quite understand it). The best I can do is imagine a Bayesian as a poker player: they don't consider the data in their hand (e.g., their current cards) to be random: they consider it a given, and they want to make the best predictions possible about the next cards to be drawn, conditional on the evidence in their hand. Seems reasonable enough, but it turns out this is in direct opposition to Frequentism.

Parsimony and Hierarchical Bayes

But, one nice feature about Bayesianism is that we can encode into our statistical models an old ideal from Occam: "parsimony". Basically, we say "Don't tell me a dolphin is immortal unless you have a lot of data!" Technically, we can use "hyper-prior" distributions on parameters like survival to force simplicity in lieu of strong evidence, but allow for more complex models if there is a lot of data and larger sample size. This is demonstrated nicely in the following animation.

Look at the red Maximum Likelihood Estimates on the Right (a Frequentist tool). They are estimating a probability p from a series of 0-1 outcomes (like, the number of dolphins captured from N available), and this is done in parallel during 10 different "groups" (like 10 independent experiments). Notice that the ML Estimates stay the same over a wide range of data/evidence (even though the "true" mean is around 0.5), and they vary from p=0.2 to p=1. However, the Hierarchical Bayesian posteriors (in purple) behave very differently. When the sample size is small (5 observations per group), each group is shrunk to the overall mean at 0.5. It is only as the sample size increases that each group is allowed to take on their own individual values; and eventually, with a lot of data, each groups' posterior expectation becomes nearly the same as the MLE.

This is a type of "model parsimony": with only a little bit of data, our model is simple (all groups have the same estimate, and the estimate is the group mean at 0.5). With more data, our model becomes more complex (all groups share a mean, but they can also have individual variation, each with their own parameter p).

The abuse of MLE is nicely illustrated with the following experiment: Imagine you want to estimate the fairness of a coin, but flip it only 3 times. Let's say you observed TTT (a rare event, but not impossible). The Maximum Likelihood Estimate would be that the probability of a tail is 100%, and the coin is definitely not fair. Of course, we know this is a terrible estimate and a stupid example, because we have solid prior information about the behaviour of coins. However, the sad fact is that many Dolphin Mark-Recapture researchers are making similarly stupid experiments, under small sample sizes and overally complex Frequentist models which estimate of 100% detection probability or 100% survival or 0% movement. These terrible estimates are polluting our scientific literature.

The MLE's are indeed the most 'objective', but they make bad predictions for new data and give us rediculous results under low-sample sizes. In contrast, a little bit of skepticism encoded in a Bayesian prior (like Beta(1,1)) can temper our estimates. Returning to the coin flipping example, the posterior would say that the probability of a tail is just 80%, given the observed sequence TTT. That's still not the true 0.5, but it is a better prediction than the frequentist's 100%

Unified Under Hierarchical Bayes

The poor performance of MLE was noticed by non-statisticians a long time ago in the Machine Learning and predictive analytics communities. They obsessed over "regularization": finding ways to constrain the complexity of models, and deliberately bias the estimates towards simplicity. Techniques like the LASSO, Ridge Regression, Elastic Net, Boosting, AIC optimization, Bayesian Model averaging, etc, were devised to make better predictions on new data, rather than philosophical notions of probability and objectivity.

The eerie thing is that most of these techniques have a Bayesian interpretation. The famous Lasso is just a Bayesian model with a Double Exponential prior on the regression coefficients. The idea is explored in an accessible article by Hooten and Hoobs 2015.

Subjective Pseudo Bayesian and the Degradation of Science

If ecologists are willing to inject a little bit of bias into their analyses in order to get better predictions and sensible results, then what of objectivity? What is to stop one from 'cooking the books'? Talking about cavalier 'pseudo'-Bayesian analysis, the Bayesian heavy-weight James Berger wrote in 2004:

Statistics owes its central presence in science and life to the facts that (i) it is enormously useful for prediction; (ii) it is viewed as providing an objective validation for science. Discarding the latter would lead to a considerable downgrading of the value placed on statistics.

At the time of his warning, few non-statisticians were using subjective Bayesian models, and even then, they seemed to want "objective" reference priors which conveyed no prior information, and yielded nearly identical results as Frequentist models. It is a bit ironic that the opposite seems to have happened: the crisis of reproducibility have been unveiled, and it was under the watch of the so-called objective school of statistics, and not from subjectively biased Bayesian models. Now that the tables of turned, and more and more researchers are turning to Bayesian models, Berger's 2004 warning sits, waiting to be realized. The solution is probably everyone adopting a prediction mentality. Dolphin ecologists have already sort of adopted this practise by using the AIC for model selection, but have yet to grapple with its philosophical underpinnings, nor rival ideas.

Hifime UH1 DAC under Linux and Jack

The Hifime UH1 USB DAC and Headphone Amp combo has some great features like:
- reference-quality DAC chip: Texas Instruments PCM5102 (a flagship chip used in other >$1000 DAC's)
- Samples rates from 44.1, 48, 88.2, 96, 176.4, 192, 384kHz, and bit rates 16, 24 and 32.
- optional headphone amplifier (TPA6120, which I couldn't get much information about), as well as a direct line-out just from the DAC itself. This a great feature for someone who may have to drive high-impedance headphones, but stills wants the flexibility of using other neutral amplifiers.
- Asynchronous USB transfer with two TCXO as master clock
- under $200 USB, ships from Hong Kong (not from the USA-based HifimeDIY)

What's really nice is that Hifime explicitly mentions support for Linux (without drivers!), a major selling point for me as I don't want to play around with kernel patches, etc.

Using with GNU/Linux JACK, Pulseaudio, and ALSA.
The DAC shows up as HiFimeDIY SA9227 USB Audio. For most linux distributions, one will have to contend with Pulseaudio to get any sound out of the DAC. The easiet way is to use 'pavucontrol' (lightweight Pulseaudio GUI control) to explicitly change the 'Audio Interfaces'.

Unfortunately, Pulseaudio automatically down-samples audio. So, for bit perfect audio, one needs to use a music player with ALSA support. I use gmusicbrowser, which is highly customizable (reminiscent of my only Win7 program Foobar2000). Click on Main (menu) -> Settings->Audio (Tab)-> Output device: set to "Alsa"; Click "advanced options" and enter "hw:Audio" as the Alsa option. Viola! Enjoy audio at 192000Hz and 24bit!

- See more at: http://hifimediy.com/hifime-uh1#sthash.twSvxSOI.dpuf

Tuesday, 23 February 2016

Bayesian Mark-Recapture: Parsimony and Pollock's Closed Robust Design

At the Bienniel Marine Mammal Conference 2016, I presented a Hierachical Bayesian version of Pollock's Closed Robust Design (see presentation here). This was motivated by some bad behaviour of Program Mark in estimating detection probabilities and migration parameters for "sparse data" -- I say sparse, but for my sample sizes with dolphins, failure is all too common. In both real data, and through an extensive simulation exercise, I found that Mark was highly prone to "boundary-level" MLEs for survival and migration and sometimes detection probabilities, leading to erratic abundance estimates.

Interestingly, mildy weak Bayesian priors (Beta(1,1)) help smooth over this bad behaviour ("Objective" priors do not).

The image to the right shows simulations of the PCRD, comparing Maximum Likelihood estimation versus a Bayesian model... notice the infinite CI's and boundary level MLEs for Program MARK! (See the full article link below)

Hierarchical Bayesian PCRD
The world is going Bayesian, but few people understand why. Here, the true power of Bayesian Inference is revealed in a hierarchical specification for the survival, detection and migration parameters: in particular, we shrink between "time-invariant theta(.)" models and "time-varying theta(t)" models. I found that applying a particular type of shrinkage-inducing hyperprior, called the "Scaled Half Student-t distribution", lead to somewhat similar estimates as "model-averaging" over many MLE-based fixed effect models (run in Mark).

Hierarchical Bayes provides a type of automatic "Occam's Razor" (Berger 2004), in that when there is sparse/weak data, the hyprprior shrinks the model to a simple model, but when there is a lot of evidence in the data or a large sample size, the hyperprior is easily overwhelmed and the model takes on a lot of complexity. Reassuringly, the final estimates seem insensitive to exact hyper-parameter values. As shown in the following animation...

I decided to upload an R & JAGS tutorial on github: https://github.com/faraway1nspace/PCRD_JAGS_demo Users can run simple fixed effects Bayesian PCRD models, or they can run a Hierarchical Bayesian version that includes: i) Individual Heterogeneity in detection probabilities; and ii) recruitment processes; and iii) uses a hyperprior to apply a type of "multimodel" inference. All things seem to unify in HB.

JAGS code: Fixed effects Bayesian PCRD
Here JAGS code to run the simple fixed model of the PCRD: phi(.)gamma'(.)gamma''(t)p(t,s). This is just for interested readers. Serious users should consult the github demonstration
. This implements the PCRD as a Bayesian HMM. Anyone who uses this JAGS code should cite: "Rankin RW, Nicholson K, Allen SJ, Krützen M, Bejder L, Pollock KH. 2016. A full-capture Hierarchical Bayesian model of Pollock's Closed Robust Design and application to dolphins. Frontiers in Marine Science, in Press. DOI: 10.3389/fmars.2016.00025". The article is available here


model{

  # priors

  phi ~ dbeta(1,1) # apparent survival probability (adjust for ecological 

  gamma1 ~ dbeta(1.2,1,2) #  temp migration: probabilty stays out conditional on being out

  # loop through primary periods: parameters for detection probability pd_t, and gamma2_t

  for(t_ in 1:T){ # T primary periods...

     pd_a[t_] ~ dgamma(3,2) # hyperprior on detect probs

     pd_b[t_] ~ dgamma(3,2) # hyperprior on  detect probs

     for(tt_ in 1:T2[t_]){ # loop through secondary periods...

        pd[t_,tt_] ~ dbeta(pd_a[t_],pd_b[t_]) # secondard-period-level detection probs

     }

     gamma2[t_] ~ dbeta(1,1) # temp migration: prob migrates outside conditional on being inside

     # recruitment process from 'eigenvector decomposition'

     lambda[1,t_] <- (1-gamma1)/(gamma2[t_]-gamma1+1) # recruitment ratio, or long-term prob of being inside

     lambda[2,t_] <- 1-lambda[1,t_] # 

     psi[t_] ~ dunif(0,1) # inclusion probability

     # trmat: transition matrix for Markovian latent-states

     # 1 =not yet in population;2=dead;3=offsite;4=onsite (only observable state)

     # transition are from the column --> rows

     # trmat[row,column,time] = [state at time=t_; state at time t_-1; time=t_]

     trmat[1,1,t_] <- 1-psi[t_] # excluded from pop

     trmat[2,1,t_] <- 0 # dead

     trmat[3,1,t_] <- psi[t_]*lambda[2,t_] # inclusion into pop, outside study are

     trmat[4,1,t_] <- psi[t_]*lambda[1,t_] # inclusion into pop, inside study area

     trmat[1,2,t_]<- 0

     trmat[2,2,t_]<- 1 # stay dead

     trmat[3,2,t_]<- 0

     trmat[4,2,t_]<- 0

     trmat[1,3,t_]<- 0

     trmat[2,3,t_]<- 1-phi # dies outside

     trmat[3,3,t_]<- gamma1*phi # stays outside | outside

     trmat[4,3,t_]<- (1-gamma1)*phi # reenters study area | outside

     trmat[1,4,t_]<- 0 # 

     trmat[2,4,t_]<- 1-phi # dies inside

     trmat[3,4,t_]<- gamma2[t_]*phi # leaves study area | inside

     trmat[4,4,t_]<- (1-gamma2[t_])*phi # stays inside | inside

  } # t_ 

  # likelihood: loop through M individuals, both real and pseudo-individuals

  for (i in 1:M){ 

    #draw latent state at primary period 1: 

    # ... by definition, everyone starts in z=1 (not-in-population) at time=0 

    z[i,1]~ dcat(trmat[1:4,1,1]) # first z strictly excluded from pop

    # likelihood for first primary period

    for(tt_ in 1:T2[1]){ # loop through secondary periods

         # Bernouilli process, conditional on z=4, otherwise no observation

         y[i,tt_,1] ~ dbern(pd[1,tt_]*equals(z[i,1],4)) 

    }

    alive_i[i,1] <- step(z[i,1]-3) # count if i is alive or not

    Nin_i[i,1] <- equals(z[i,1],4) # count if i is within study area

    # loop through primary periods after 1st primary periods

    for(t_ in 2:T){

      # state process: draw z(t_) conditional on z(t_-1)

      z[i,t_] ~ dcat(trmat[1:4, z[i,t_-1] , t_])

      # likelihood: loop through secondary period observations

      for(tt_ in 1:T2[t_]){

         # Bernouilli observation error, conditional on z=4

         y[i,tt_,t_] ~ dbern(pd[t_,tt_]*equals(z[i,t_],4)) 

      }

      # tally population size 

      alive_i[i,t_] <- step(z[i,t_]-3) # check alive or not

      Nin_i[i,t_] <- equals(z[i,t_],4) # count if i is within study area

    } # t_

  } # i

  # estimate population size per primary periods

  for(t_ in 1:T){

     alive[t_] <- sum(alive_i[,t_]) # number alive

     Nin[t_] <- sum(Nin_i[,t_]) # number in study area

  } # t_

}

JAGS code: Hierarchical Bayesian PCRD
Here is JAGS code for a Hierarchical Bayesian Model, phi(t)gamma'(t)gamma''(t)p(t,s). The Hierarchical Part uses 'hyperpriors' to deliberately shrink time-varying specifications, like phi(t), to a simpler time-invariant specifications, phi(.). The result is a type of "model parsimony", with a similar type of smoothing as "model averaging". The following is just a visual demo for the casual reader; serious users should refer to the github page and download the full R code. Anyone who uses this JAGS code should cite: "Rankin RW, Nicholson K, Allen SJ, Krützen M, Bejder L, Pollock KH. 2016. A full-capture Hierarchical Bayesian model of Pollock's Closed Robust Design and application to dolphins. Frontiers in Marine Science, in Press. DOI:10.3389/fmars.2016.00025". The article is available here


model{

  #hyperpriors: logit(theta_mu)~Normal and theta_sd ~ half-t

  phi.mu ~ dunif(pr.phiunif[1],pr.phiunif[2]) #mean survival with a Uniform prior

  sigma.phi ~ dt(0,pr.tauphi[1],pr.tauphi[2]) T(0,) #mean survival dispersion, half-t hyperprior

  g1.mu ~ dnorm(0,pr.g1mu) #mean gamma1, temp migration out | out

  sigma.g1~dt(0,pr.taug1[1],pr.taug1[2]) T(0,) #mean gamma1 dispersion, half-t hyperprior

  g2.mu ~ dnorm(0,pr.g2mu) #mean gamma2,  temp migration out | in

  sigma.g2~dt(0,pr.taug2[1],pr.taug2[2]) T(0,) #mean gamma2 dispersion, half-t hyperprior

  pd.mu ~ dnorm(0,pr.pdmu) #mean detection prob, overall 

  sigma.pdmu~dt(0,pr.taupdmu[1],pr.taupdmu[2]) T(0,) #primary period detection prob dispersion

  sigma.pd2nd~dt(0,pr.taupd2nd[1],pr.taupd2nd[2]) T(0,) #secondary periods detection prob dispersion 

  sigma.eps ~ dt(0,pr.taueps[1],pr.taueps[2]) T(0,) #individual detection prob dispersion

  #time-variant parameters 

  for(t_ in 1:T){ #loop through primary periods

    pd_mu[t_]~dnorm(pd.mu,pow(sigma.pdmu,-2)) #primary period mean detaction prob (logit)

    lgamma1[t_]~dnorm(g1.mu,pow(sigma.g1,-2)) #prob migrate out|out (logit)

    gamma1[t_] <- ilogit(lgamma1[t_]) #prob migrate out|out (probability)

    lgamma2[t_]~dnorm(g2.mu,pow(sigma.g2,-2)) #prob migrate out|in (logit)

    gamma2[t_] <- ilogit(lgamma2[t_]) #prob migrate out|in (probability)

    #RECRUITMENT: psi is the 'inclusion probability' and lambda is the 'recruitment ratio'

    psi[t_]~dunif(0,1) #inclusion probability

    lambda[t_] <- (1-gamma1[t_])/(gamma2[t_]-gamma1[t_]+1) #long-term probability inside study area

    #NOTE, lambda could also be a random variable with a beta prior

    #secondary-period detection probabilities

    for(tt_ in 1:T2[t_]){ #loop through secondary periods

      pd[t_,tt_] ~ dnorm(pd_mu[t_],pow(sigma.pd2nd,-2))

    } #tt_

  } #t_

  #first state transition (pure nusance; strictly from outside-pop to part of marked-pop)

  trmat0[1] <- (1-psi[1]) #remains not-yet-in-pop

  trmat0[2] <- 0

  trmat0[3] <- psi[1]*(1-lambda[1]) #inclusion into pop, goes outside study are

  trmat0[4] <- psi[1]*lambda[1] #inclusion into pop, goes inside study

  #state transitions (2:T)

  for(t_ in 1:(T-1)){

    lphi[t_]~dnorm(log(phi.mu/(1-phi.mu)), pow(sigma.phi,-2)) #survival prob (logit)

    phi[t_]<-ilogit(lphi[t_])

    #state transitions 

     #trmat: transition matrix for Markovian latent-states

     #1 =not yet in population; 2=dead;3=offsite;4=onsite (only observable state)

     #transition are from the column --> rows

     #trmat[row,column,time] = [state at time=t_; state at time t_-1; time=t_]

     #notice that the primary periods are offset by 1 (because we already dealt with T=1)

     trmat[1,1,t_]<- 1-psi[t_+1] #excluded from pop

     trmat[2,1,t_] <- 0 #dead

     trmat[3,1,t_] <- psi[t_+1]*(1-lambda[t_+1]) #inclusion into pop,outside study

     trmat[4,1,t_] <- psi[t_+1]*lambda[t_+1] #inclusion into pop,inside study

     trmat[1,2,t_]<- 0

     trmat[2,2,t_]<- 1 #stay dead

     trmat[3,2,t_]<- 0

     trmat[4,2,t_]<- 0

     trmat[1,3,t_]<- 0

     trmat[2,3,t_]<- 1-phi[t_] #dies outside

     trmat[3,3,t_]<- gamma1[t_+1]*phi[t_] #stays outside | outside

     trmat[4,3,t_]<- (1-gamma1[t_+1])*phi[t_] #reenters study area | outside

     trmat[1,4,t_]<- 0 #

     trmat[2,4,t_]<- 1-phi[t_] #dies inside

     trmat[3,4,t_]<- gamma2[t_+1]*phi[t_] #leaves study area | inside

     trmat[4,4,t_]<- (1-gamma2[t_+1])*phi[t_] #stays inside | inside

  } #t_

  #loop through M individuals

  for (i in 1:M){

    #state transitions and likelihood for the first primary period

    z[i,1]~ dcat(trmat0) #z at time 0 is strictly 'not-yet-in-pop'

    alive_i[i,1] <- step(z[i,1]-3) #count if i is alive or not

    Nin_i[i,1] <- equals(z[i,1],4) #count if i is within study area

    eps_i[i] ~ dnorm(0,pow(sigma.eps,-2)) #random effects at individual levels

    #Observation error y[i,tt_,t_] ~ Bernoulli conditional on being inside z=4

    for(tt_ in 1:T2[1]){  #loop through secondary periods

       y[i,tt_,1] ~ dbern(equals(z[i,1],4)/(1+exp(-pd[1,tt_]-eps_i[i]))) #inverse-logit transform

    }

    #state transition and likelihood for primary periods 2:T

    for(t_ in 2:T){ 

      #State process: draw z(t_) conditional on  z(t_-1)

      z[i,t_] ~ dcat(trmat[1:4, z[i,t_-1] , t_-1])

      #Observation error y[i,tt_,t_] ~ Bernoulli condition on being inside z=4

      for(tt_ in 1:T2[t_]){ #loop through secondary periods

        y[i,tt_,t_] ~ dbern(equals(z[i,t_],4)/(1+exp(-pd[t_,tt_]-eps_[i]))) #inverse-logit transform

      }

      #check whether i individual is alive and inside

      alive_i[i,t_] <- step(z[i,t_]-3) #check alive

      Nin_i[i,t_] <- equals(z[i,t_],4) #count if i is within study area

    } #t_

  } #i

  #tally population size

  for(t_ in 1:T){

     alive[t_] <- sum(alive_i[,t_]) #number alive

     Nin[t_] <- sum(Nin_i[,t_]) #number in study area

  } #t_

} #end model

Thursday, 24 December 2015

The Bandcamp scavenger hunt for jobs: API, cat, and wav

I was bored and jetlagged this Xmas, so I decided to follow Bandcamps job-hunting scavenger hunt: they lead prospective hires through several Unix-savvy tasks across the internet. I'd love to work in an applied musicology setting, but because I'm an ecologist, its highly unlikely they'd actually hire me. So, as pre-rejection revenge, I'm going to spoil the scavenger hunt! Well, not really, but its a great demostration for my students WHAT skills are needed in the massive-information economy, and how unconventional (and applied) their hiring process can be.

Task 1: source code
On the Bandcamp jobsite, prospects are asked to look at the Websites source code. Tool: wget
$ wget --output-document-file /tmp/bd.html https://bandcamp.com/jobs/#systems
... Upon downloading, one must open the file and scour the annotations for the next instruction...

Task 2: Bandcamp API
"Job hunters, use the Bandcamp API (use API key 'buithagsitannagalladrodlask') to get the url of band id 1 and go there. (Hint: not all API versions are the same."
Which is an annoying detour through the particulars of their API: its a simple URL-based information retreival system. Send a URL, receive JSON information. Answer: http://api.bandcamp.com/api/band/1/info?band_id=1&key=buithagsitannagalladrodlask which leads us to http://bandcamp.bandcamp.com for our next clue.

Task 3: ruby decrypting
http://bandcamp.bandcamp.com invites the job-hunter to download a music file, like any other Bandcamp download site. However, the output is not a FLAC, it is 7 large (seemingly) junk files in some scrambled ordered, variously called: "Bandcamp - You’re Getting Warm - part51.txt". Presumeably, the mystery files are something to be concatentated, in some order, the clue being a string of ruby code: "srand(42); 7.times {puts rand(100)}". This seemingly random operation is seeded, for a reproducible output. 51,92,14,71,60,20,82

Task 4: cat a wav
The ruby code spits out the order of the files. Now what? Lets use Unix 'cat' to make a new file. The header in the first file '51' is "RIFF": some quick internet-searching reveals that the intended format is a WAV audio file. So, we output the concatenation as a wav file and it plays. A synth-voice tells us where to send resumes and what is the secret-codeword to add so they don't delete our application.

Hurray! That's a lot of work for an ecologist. But what's a more impressive prospect? An ecologist who knows Linux and APIs, or a comp-sci student who is idle enough to follow a silly scavenger hunt?

Saturday, 2 August 2014

remote connect R, SSH, EMACS, ESS + persistent sessions

I'm heading to Duke University on an exchange to study statistics, and hope to maintain access to my high-performance desktop at Murdoch University in Australia (aka, 'The Legend') which runs Ubuntu. After googling around a couple of options (and thanks in particular to this helpful exchange with Dirk Eddelbuettel at stackoverflow) I've set up a means to keep R processes running persistently over ssh, even after logging out. I thank the Great Speggetti Monster that I decided to learn EMACS! (see below for an alternative way to create persistent R sessions withouts using EMACS)

The following instructions assume that the remote computer is running in a Linux Environment, is available to the outside world with an IP address, has an ssh server installed and IS running (I installed openssh-server), and of course has a version of EMACS 24 installed.

On the local machine, open a terminal and enter the command:



ssh -X user@hostipaddress

... then go through prompts for passwords and passkeys.
On the first log-in, activate the emacs-daemon and start emacs buffers



emacs --daemon

emacsclient -nw

The daemon is what allows processes to keeping running after disconnection and to resume the session later when you log back in. The second line will open EMACS while displaying the buffers in the ssh terminal. One can proceed to open R scripts and start R sessions, etc., using familar commands suggest as M-x R-mode, etc.

Now, you can simply exit EMACS (C-x C-c) and logout of the remote computer (logout), but the buffers and processes will persist! To reconnect simply restart ssh access to the remote computer and redo the emacsclient -nw.



ssh -X user@hostipaddress 

emacsclient -nw

... R should still be running :) For some reason, my C-space and C-S space keybindings do not work (though all other customized keybindings do), so be prepared to C-c C-n and C-c C-j.

Reconnecting a remote R session (without EMACS)

For those of you who don't use EMACS, I found tmux to serve as an easy alternative way to disconnect and reconnect to R sessions. While in a remote ssh session, one must simply start the tmux session by typing the command "tmux", then type "R" to initialize an R instance (This assumes the remote computer has installed tmux). When you want to disconnect the ssh connection, type "CTRL-b d" to exit tmux (while the R session nonetheless persists), then you're free to logout and close the remote connection. Later, when you want to re-connect to R, reestablish the ssh connection and type "tmux attach" to re-enter tmux. Your R session will still be there, chugging along as if nothing happened.

Sunday, 1 June 2014

Planned contrasts and ANCOVA in Excel / LibreOffice Calc

I just finished TA'ing Advanced Quantitative Methods in Psychology at Murdoch University, and I'd like to share an Excel/Calc file (note its an .ods) to help students visualize ANCOVA/regression. Users can tweak regression coefficients to minimize the Sum of Square Residuals and/or find the lines of best fit. For the math enthusiasts, it also demos the underlying matrix multiplication needed to solve the problem analytically. The Experiment: we are interested in whether treatments of sucrose (8%, 32%, 64%) effect rats' speed at completing a maze (DV) while controlling for the rats age (aka, the 'covariate' in ANCOVA).

One should never 'do stats' in excel (so I also provide R code below), but this is a great, quick way to visualize several statistics concepts that many non-staticians may not understand, including:
- how a mix of continuous and categorical variables can be visualized and different parallel lines, - how categorical variables are represented as indicator variables in the model matrix. - how Planned Constrasts determine the type of indicator columns (the default in the .ods file is coded using the 'first' treatment as the reference treatment, like in R, but users can do their own. Try the Helmert!)
- how to solve least squares regression coefficients via matrix multiplication

If I were a teacher in an intro stats course, I'd ask students to reproduce the above. One operation I couldn't figure out was how to calculate the S.E. of the regression coefficients: this requires more matrix operations than Excel offers.


# ANCOVA in R

# data: DV is speed to complete maze

# categorical variables: 3 sucrose treatments

# covariate: rate AGE

library(car)

d<-data.frame(speed=c(1.4,2,3.2,1.4,2.3,4,5,4.7,6.2,3.1,3.2,4,4.5,6.4,4.4,4.1,5.8,6.6,6.5,5.9,5.9,3,5.9,5.6),
              trt=factor(rep(c(0,1,2),each=8),labels=c("sucrose8","sucrose32","sucrose64")),
              age=c(12,5,8,7,11,2,3,4,7,5,12,12,8,4,3,2,5,8,12,2,7,11,4,3))

# calculate regression coefficients and statistics manually

# then compare to worksheet and R

X <- model.matrix(~age+trt,data=d) # model matrix

betas<-solve(t(X)%*%X)%*%(t(X)%*%d$speed) # regression coefficients

SSm<-sum(((X%*%betas) - mean(d$speed))**2) # explained variation

SSr<- sum(((X%*%betas) - d$speed)**2) # residual sum of squares

df_m<-ncol(X)-1; df_r<-nrow(X)-ncol(X) # degrees of freedom

MSm<-SSm/df_m # Mean Square of Model

MSr <-SSr/df_r # means square error / estimate of variance

F.stat <- MSm/MSr



# standard error of regression coefficients

betas.se<- sqrt(diag(solve(t(X)%*%X)*MSr)) # MSr is an estimate of sigma^2

betas.t <- betas/betas.se # chance of seeing this large of effect ~ t distr

betas.p <- 2*(1-pt(q=abs(betas.t),df=df_r)) # p-values of regression coefs




# redo above, but with R's linear model function

m1 <- lm(speed~age+trt,d)

F.stat1<-summary(m1)$fstatistic # compare F.stat and F.stat1

coef(m1) # compare with betas

summary(m1)$coefficients # compare with betas.se,t,and p



# try with different Planned Constrasts:

# use Helmert contrast to test the hypothesis that 8% is different from

# 32% and 64% is different from both 8 and 32%

m.helm <- lm(speed~ age+trt, data=d,  contrasts = list(trt = "contr.Helmert"))

Tuesday, 18 March 2014

lmms: powerful Industrial music studio

Last Tuesday I saw Nine Inch Nails live in Perth (with guest appearance by How to Destroy Angels!), and they inspired me to finally compose synth / industrial music. This is surprisingly possible when you have a GNU/Linux distro and the industry-grade software lmms, Linux MultiMedia Studio, something akin to the commercial Protools or FL Studio. You can engineer your own sounds and compose your own music with a large collection of instruments, effects, and synthesizers. For many years I have played around with fragments of melodies and drumbeats, but here is my first "song". With dose of creativity and OCD, this shows what you can do in a mere 24 hours.

For the uninitiated, the lingo and layout of lmms or Ardour3 or other Digital Audio workstations can be very intimidating. I hardly know where to start, nor how to give advice. Thanks to the rise of Dubstep, there are hundreds of very helpful youtube tutorials such as: http://www.youtube.com/watch?v=naQW2RkpfYY. I like this one because it is a nice visual intro to detuning and combing saw-tooth sound waves via an 'oscillator' to make a sound (instrument?) recognizable in, say, Skrillex.

Composing music with a DAW has revolutionized how I listen to songs: I can't help but listen for the subtle signatures of familiar saw-waves or moog-filters or peculiar sounds and then try to recreate them. It feels great, like becoming an active participant rather than just a passive consumer.