Wednesday, 6 September 2017

Workshop on Bayesian Capture-Recapture: 2017 Conference for the Society of Marine Mammology, Halifax NS Canada

My colleague Krista Nicholson and I will be running a full-day practical workshop on Bayesian Capture-Recapture. The workshop will be held at the 2017 Conference for the Society of Marine Mammology in Halifax, NS, Canada, on Sunday, October 29, 2017.

More information about registration and the price can be found here. The workshop is only available for conference attendees.

Description. This workshop is run in response to recent advances in CR modelling showing that there are number of advantages in implementing Bayesian CR models over other popular methods. These advantages will be highlighted and discussed throughout the workshop.

The workshop will include an introduction to Bayesian theory and beginner/intermediate tutorials on R and JAGS. JAGS is the popular and flexible Bayesian scripting language that can code CR models and much more! The workshop is suitable for students, researchers, professors, veteran program MARK users, and anyone who is already familiar with CR and would like to learn how to implement Bayesian CR models. Intermediate familiarity with R and CR is expected.

Outline of the workshop: i) Bayesian philosophy; ii) introduction to the BUGS/JAGS language; iii-v) practical tutorials to implement common CR models for cetaceans (POPAN, PCRD, individual-covariate models); vi) Bayesian model-selection; vii) Hierarchical Bayesian models; viii) open-session to be determined by pre-workshop participant outreach (spatial capture-recapture, cloud-computing and high-performance computing, multi-event).

You can register for this workshop online at the conference website http://www.smmconference.org/.

We welcome participants to express their interest in any particular topics to be covered or discussed during the workshop and invite questions on and submissions of difficult data sets, which we may be able to work on and discuss in the workshop.

If you have any inquiries about the workshop or its content, please do not hesitate to contact Dr. Robert Rankin at robertw.rankin@gmail.com

Tuesday, 10 January 2017

A Primer on the Bias-Variance Trade-off in ecology

This blog is a originally from the Appendix of Rankin RW. 2016. EM and component-wise boosting for Hidden Markov Models: a machine-learning approach to capture-recapture. bioRxiv. doi: 10.1101/052266. Available from 10.1101/052266


I will use simulations to illustrate the ``bias-variance trade-off'' for ecologists, which is one of the most important ideas in statistical modelling in the past half-century. The goal is to show how the AICc and another model-selection/model-averaging technique called "boosting" each negotiate the trade-off in order to improve estimation. Specifically, in order to minimize the expected error of estimating survival phi over T capture periods. The trade-off is fundamental to understanding the optimality of Frequentist shrinkage estimators and AIC model-selection. The illustrations are inspired by figure 6.5 in Murphy 2012, but is adapted to a type of capture-mark-recapture model called the Cormack-Jolly-Seber model.

The trade-off is an old idea without a citable origin (although Geman et al 1992, is often considered to be a definitive reference, but the phenomenon is clearly discussed as early as 1970 by Hoerl considering Ridge Regression, and likely earlier). Despite being an old and fundamental concept of statistical estimation, I have noticed that it poorly understood among academics and government scientists. In particular, it is my experience that ecologists are unduly wedded to the idea of being unbiased (in estimation), such that when they are presented with visual and quantitative evidence about the optimality of biased shrinkage estimators, they recoil at the sight of systematic bias, and ignore the crucial role of variance. Of course, bias is not desirable in and of itself, but so long as the bias goes to zero at a rate proportional to that of the variance, we may be able to improve our overall estimation performance by incurring a little bias.

In the following simulations, the goal is to minimize the Expected Error of estimating survival, as quantified by the Mean Square Error (MSE). It is a population-level abstract quantity that can only be measured in simulations when we know to the ``true'' process. It is Frequentist in the sense that we hope to minimize the error over all possible data-sets that one might sample from the true population Y. These multiple realizations are shown as grey lines in Figures 1 and 2. Of course, an analyst only has one dataset, and his goal is to get his estimates as close as possible to the truth.

Figure 1: Decomposing the error of estimation (MSE) into its bias and variance components. An estimation procedure will negotiate the bias and variance so to minimize the MSE. Top, a simulation of a true survival process (red line). Each grey line represents one dataset sampled from the population and an analyst's attempt to estimate survival using multi-model inference procedures, such as boosting. The dashed black line is the mean estimate over all 30 independent grey-lines. Middle, a visualization of the variance component, showing the variability of point-wise estimates due to randomness in the sampled data and a procedure's sensitivity to such differences. Bottom, a visualization of the bias: the expected difference between the truth and the procedure's estimates, over all realizations of the data.


The bias-variance trade-off arises from a classic decomposition of the expected error: MSE = E[phi_hat-phi_true]^2 + Var[phi] + constant. Figure 1 also shows this decomposition. The first term is the expected difference between an estimate and the true value, i.e. the bias. This difference is visualized as the red polygon in Figure 1. In the same figure, the bias manifests as shrinkage from the true red line towards a flat global mean. Quantifying the bias requires knowledge of the truth phi_true, and is therefore inaccessible in real-life situations. The second term is the variance and it does not depend on knowledge of the truth. Rather, it arises due to the vagaries of random sampling as well as the complexity of the estimation procedure: overly complex models which ``over-fit'' one dataset will vary wildly when fitted to a new dataset sampled from the same population. The variance can be visualized as the spread of the grey lines, or the green polygon in Figure 1.

The MSE decomposition has a naive meaning: in order to optimize our estimation performance, we should reduce the bias and/or the variance. Clearly, most ecologists see the value of tackling either of these two terms. But the nature of a trade-off has a more elusive importance: we cannot, in general, minimize both terms for a given sample-size, and we may deliberately increase one term in order to decrease the other. Shrinkage estimators incur a little bias and have lower variance (i.e., the red polygon is bigger but the green polygon is smaller). This strategy results in much smaller MSE values than complex unbiased estimators. In contrast, the MLEs of the complex full-model are unbiased but they typically have very high variance. This strategy is often worse at minimizing the MSE, for small-to-moderate samples sizes.

The following simulations show how different statistical methods have different strategies in negotiating the bias-variance trade-off. Imagine an analyst who is confronted with four different methods to estimate survival. The first is estimation by Maximum Likelihood using the full-model p(t)phi(t). The second method is AICc model-selection, and the third is AICc model-averaging; both use the following fixed-effects models: p(.)phi(.), p(t)phi(.), p(.)phi(t), and p(t)phi(t) with obvious constraints on the last terms. The fourth method is called CJSboost, from statistical boosting. It uses submodels called "base-learners" equivalent to the aforementioned fixed-effect models (but without the previous constraints). The AICc-methods should theoretically do best because they are fundamentally motivated by trying to minimize an objective function that is very closely related to MSE called the KL-loss (see Akaike 1979 and Akaike 1998). Likewise, CJSboost is trying to minimize a related generalization-error called the negative Expected log-Likelihood, which is approximated through bootstrapping the capture-histories.

The fake data-sets were generated according to the following. The time-varying survival values were:
phi_t = cos((t-2.3)/1.2)/11+0.75. The time-varying capture-probabilities p_t were drawn from a beta distribution with shape parameters A=12 and B=12, resulting in an average capture-probability of 0.5. The p_t values were the same for all simulations. MLE and AICc analyses were run in Program MARK. For CJSboost, a ten-times 70-fold bootstrap-validation exercise was run per dataset to tune the CJSboost regularization parameters. The CJSboost program is available on my Github site. The simulations and analyses were repeated 40 times for three scenarios pertaining to the number of capture-histories n = {50,200,800).

Visualizing the bias-variance trade-off and the error of estimating survival in a Cormack-Jolly-Seber analysis, using four procedures (panel rows): i) the shrinkage estimator CJSboost; ii) AICc model-averaging based on four fixed-effect models of time-varying vs. time-constant survival and capture-probabilities; iii) the best AICc model; and iv) the Maximum Likelihood Estimate using the full-model p(t)phi(t). Panel columns are different sample sizes (number of capture-histories) over T=10 primary periods. The red-lines show the true survival. Each grey line is an independently sampled dataset and an analyst's attempt to estimate survival. The dashed-lines represent each procedure's average estimate over 40 simulated data-sets and analyses. The best estimation procedure has the lowest MSE (turquoise for emphasis). Each procedure may have a high/low bias or low/high variance, but generally cannot succeed at minimizing both. The bias is the difference between the red and dashed line. The variance is represented by the dispersion among grey lines. At small sample sizes, the AICc methods and boosting are very biased but have better MSE.

The results clearly show the trade-off (Figure 2). At high sample sizes (n=800), the shrinkage estimator CJSboost has the lowest MSE and therefore wins at estimating survival. However, it has the highest bias. How can it be considered a better estimator than the other methods when it is biased? The answer is obvious when looking at the grey lines in Figure 2, where each line is an estimate of phi from an independent realization of data: compared to the other methods, each grey line from CJSboost is much more likely to be closer to the truth, despite systematic bias. In contrast, using the MLEs, one can only claim to be unbiased over all possible realizations of the data as shown by the closeness of the dashed black line to the true red line. But, for any one realization (a single grey line) the MLEs can be very far away from the truth due to much higher variance.

At smaller sample sizes, we see that the bias becomes much more extreme for both AICc methods and CJSboost. In the case of the AICc methods, the model with most support is often phi(.), in which case the estimates are a single flat line. This is also the case in CJSboost, were shrinkage is so extreme as to force a flat line. Therefore, at low sample sizes, we are much better off, in terms of MSE, to use the flat-lined phi(.) estimates rather than use the full-model MLEs, which vary so wildly as to be useless.

This primer is meant to illustrate the role of bias and variance in estimation errors. Simulations show how shrinkage estimators (CJSboost) and model-selection (by AICc) each negotiate the trade-off between bias and variance to try and minimize the Expected Error. CJSboost does particularly better by incurring a little bias.

The original article can be accessed at: http://dx.doi.org/10.1101/052266. Original blog post at mucru.org/a-primer-on-the-bias-variance-trade-off-in-ecology/.

Tuesday, 10 May 2016

New pre-print: EM and component-wise boosting for Hidden Markov Models: a machine-learning approach to capture-recapture

Rankin RW (2016) "EM and component-wise boosting for Hidden Markov Models: a machine-learning approach to capture-recapture", bioRxiv pre-print, doi:10.1101/052266, URL:http://github.com/faraway1nspace/HMMboost


A new pre-print article is available online at: http://dx.doi.org/10.1101/052266. The study proposes a new way to fit capture-recapture models based on boosting: a machine-learning method which iteratively fits simple, non-parametric learners and build a strong prediction function. The ensemble-method is a type of "multi-model inference" and "model-parsimony" technique, not unlike model-averaging by AICc. This new method, called CJSboost, is motivated by several deficiencies in the AICc model-selection approach, such as over-fitting the data and a tendency to produce ridiculous estimates, such as 100% survival. In contrast, boosting has the following benefits:

  • automatic variable selection and step-wise multimodel inference (without the sometimes-impossible task of fitting all possible fixed-effects models, as in AIC-based model averaging);

  • regularization and sparse estimation, which deflates the influence of unimportant covariates (especially important during the current crisis of reproduciblility);

  • shrinkage of estimates away from extreme values and inadmissible values (e.g., survival= 100%);

  • highly extensible (see the wide variety of base-learners available under the "mboost" R package).

  • inference based on predictive performance


  • One disadvantage of the machine-learning paradigm for capture-recapture is the computational burden of running 50 or 100 bootstrap-validation to find the optimal regularization parameters that minimize our generalization error.

    The Prediction Perspective

    Capture-recapture practitioners are perennially obsessed with model-selection and model-parsimony: the challenge of revealing important patterns in the data, without being tricked by frivolous false discoveries, i.e., not over-fitting models. The machine-learning community has addressed this challenge through the idea of generalization error: minimizing the cost of bad predictions on new, unseen data. Models that are too simple (e.g., the classic phi(.)p(.) model) are unlikely to make good predictions, whereas overly-complex models are likewise unlikely to generalize to new data.

    The boosting framework controls the complexity of the model (called regularization) by trying to minimize the generalization error (aka expected loss), and hence, avoid overfitting the data. Interestingly, most capture-recapture practitioners have been implicitly using a prediction criteria, the AIC, to rank and select models. Few ecologists seem to recognize the connections between AIC-selection and prediction, and this manuscript tries to make these connections more obvious. However, while the AIC was developed in the context of General Linear Models with Gaussian error, the CJSboost method, proposed in this manuscript, is crafted specifically for capture-recapture models.

    Sparsity and Prediction

    One the of philosophical divisions between prediction-optimal regularizers, like boosting or the AICc, and other criteria, like the BIC/SBC, is belief in a sparse "true model": whether there is a small number of truly influential variables that have large/medium effects, or, whether there are an infinite number of influential variables, each with a decreasingly important influence on our data. This division is rooted in whether we want to minimize our risk of making bad numerical predictions, or whether we want to recover the "truth". The AIC and CJSboost adopt the former persepective: they both reduce the influence of unimportant variables, but there is still some positive-weight placed on unimportant variables (this is the consequence of being prediction optimal!).

    If prediction is not one's focus, and one is interested in finding a sparse set of truly influential variables, then CJSboosted coefficients may be "hard-thresholded": discard unimportant covariates and rerun the algorithm. But how to judge which covariates are important? In the manuscript, I describe the technique of "stability selection" (Meinshausen and Bühlmann, 2010) for estimating (approximate) inclusion probabilities, based on repeatedly resampling the data and re-running the CJSboost algorithm. Probabilities lead to straight-forward inference: covariates with high inclusion probabilities are probably important; covariates with low probabilities are probably not important. An interesting thing about l1-regularizers, like boosting algorithms, is that by hard-thresholding the covariates with stability-selection probabilities, we can transform a prediction-optimal procedure into a "model-selection consistent" procedure. This may not be exactly true for capture-recapture models, but the manuscript explores the issue with simulations.

    The animation below shows 30 simulations and their stability selection pathways. Each slide is a different simulated dataset with 3 truly influential covariates (in red) and 18 other frivolous covariates, just to try and trick the algorithm. The pathways show that as the regularization gets weaker (larger m, to the right), there is a larger probability that any covariate will be selected by the algorithm. At strong regularization (small m, left), only the 3 most influential covariates (in red) enter the algorithm. Visually, this is an interesting way to discriminate between important and unimportant covariates, and can also help one achieve model-selection consistency.


    Simulation: How to discriminate between important covariates (in red) and unimportant covariates (in grey) with stability selection curves. By repeated re-sampling the capture-histories (bootstrapping), and re-training the CJSboost model, we can estimate the posterior probability that a covariate is included in the model.

    Online Tutorial

    The manuscript includes an online tutorial in R at: http://github.com/faraway1nspace/HMMboost. Interested readers can step through the analyses in the manuscript.

    Peer-review

    Readers should note that the manuscript is only a preprint: it is currently in peer-review.

    @article{Rankin052266,
    author = {Rankin, Robert William},
    title = {EM and component-wise boosting for Hidden Markov Models: a machine-learning approach to capture-recapture},
    year = {2016},
    doi = {10.1101/052266},
    publisher = {Cold Spring Harbor Labs Journals},
    abstract = {This study presents a new boosting method for capture-recapture models, rooted in predictive-performance and machine-learning. The regularization algorithm combines Expectation-Maximization and boosting to yield a type of multimodel inference, including automatic variable selection and control of model complexity. By analyzing simulations and a real dataset, this study shows the qualitatively similar estimates between AICc model-averaging and boosted capture-recapture for the CJS model. I discuss a number of benefits of boosting for capture-recapture, including: i) ability to fit non-linear patterns (regression-trees, splines); ii) sparser, simpler models that are less prone to over-fitting, singularities or boundary-value estimates than conventional methods; iii) an inference paradigm that is rooted in predictive-performance and free of p-values or 95\% confidence intervals; and v) estimates that are slightly biased, but are more stable over multiple realizations of the data. Finally, I discuss some philosophical considerations to help practitioners motivate the use of either prediction-optimal methods (AIC, boosting) or model-consistent methods. The boosted capture-recapture framework is highly extensible and could provide a rich, unified framework for addressing many topics in capture-recapture, such as spatial capture-recapture, individual heterogeneity, and non-linear effects.},
    URL = {http://dx.doi.org/10.1101/052266},
    eprint = {http://www.biorxiv.org/content/early/2016/05/09/052266.full.pdf},
    journal = {bioRxiv}
    }

    Monday, 21 March 2016

    New publication: A Bayesian Hierarchical Pollock's closed robust design

    Rankin, R. W., Nicholson, K. E., Allen, S. J., Krützen, M., Bejder, L., & Pollock, K. H. (2016). A full-capture Hierarchical Bayesian model of Pollock’s Closed Robust Design and application to dolphins. Frontiers in Marine Science, 3(25). doi:10.3389/fmars.2016.00025

    We are pleased to announce our new publication on Shark Bay bottlenose dolphins which benchmarks model-averaging in Program MARK and a Bayesian Hierarchical model for temporary-migration Robust Design mark-recapture models. See below for the citation, link to the free full-text PDF, and an online R/JAGS tutorial for Bayesian mark-capture.

    Alternative to AIC Model-Averaging

    The paper will be of interest to cetacean researchers who use Program MARK for temporary-migration Robust Design models. In particular that we show that a Hierarchical Bayesian model can yield similar estimates as model-averaging by AICc, the latter being the current best-practise to deal with the vast number of 'fixed-effects' models that one typically considers. Model-averaging and Bayesian frameworks have some similar philosophical underpinnings, such as conditioning on the data (Burnham and Anderson 2014). However, the HB framework is also highly extensible and can deal with other challenges where the AIC is undefined, such as random-effects and individual-level heterogeneity in capture-probabilities.

    Mark-Recapture and low-sample sizes: the Bayesian Answer

    Bayesian models are a solid answer to a perennial dilemma among cetacean researchers: photo-ID datasets are typically sparse or have low-sample sizes. In contrast, researchers typically want complex data-hungry model to increase ecological realism. For example, a simple temporary-migration model or individual heterogeneity model will demand >30 - 70 variables for a mid-sized dataset. Frequentist and AICc-based inference will be overly-confident in such situations, and yield ridiculous estimates such as 100% detection, or 0% migration, or 100% survival, or just fail altogether. Alternatively, Hierarchical Bayesian models provide exact inference under low-sample sizes: they just depend more on the prior distributions, which, if set-up thoughtfully, are more conservative, make better predictions, and can automatically safeguard against over-parametrization (Berger 2006, Gelman 2013).

    Individual Heterogeneity

    As the distribution of individual capture probabilities gets wider, the least-captureable individuals are more likely to be missed altogether (grey area), leading to a  biased high estimate of the mean capture parameter (p-hat), and biased low population abundance estimates (top-right corner)
    As the distribution of individual capture probabilities gets wider, the least-captureable individuals are more likely to be missed altogether (grey area), leading to a biased high estimate of the mean capture parameter (p-hat), and biased low population abundance estimates (top-right corner)
    Individual heterogeneity in capture probabilities will result in biased-low population abundance estimates (see animation), and therefore it is a primary preoccupation of most capture-recapture practitioners. Under a Hierarchical Bayesian full-capture framework, it is trivial to model individuals as coming from a distribution, without a large increase in complexity. In contrast, the comparable fixed-effect version in Program MARK, the 'two-point finite mixture model', typically yields over-parametrized models and unreliable capture-estimates (e.g., p=1).

    R and JAGS code

    See our online R/JAGS tutorial at Github for code to run the Hierarchical Bayesian Pollock's Closed Robust Design. The tutorial includes an example photo-ID bottlenose dolphin dataset from Krista et al. 2012. We use the flexible BUGS-like Bayesian syntax called "JAGS", which makes Bayesian models accessible to almost anyone with rudimentary scripting skills.

    Key Findings


  • full-capture, non-hierarchical Bayesian PCRD models had slightly better estimation performance than equivalent fixed-effects Maximum-Likelihood estimation (in MARK), mainly due to the latter's susceptibility to singularities (although there was no clear champion);
  • we propose a Hierarchical Bayesian PCRD which can lead to similar estimates as AICc model-averaging and serve as a type of multi-model inference;
  • we showed how heterogeneity in detection probabilities can lead to a 8-24% increase in bottlenose dolphin abundance estimates, as compared to ML and Bayesian models that assume homogeneous detection probabilities;
  • we explored the partial non-identifiability and high correlation among parameter estimates, especially between survival and temporary-migration which has serious consequences for ones' ability to use these parameters for inference, and which should influence researchers' study design and modelling strategies;
  • we proposed two posterior predictive checks to help diagnose poor model fitting, in lieu of a formal goodness-of-fit procedure in popular CMR software.


  • But Aren't Bayesian's Biased?

    Some Mark users who are new to Bayesian inference may worry about prior information and the inherent bias of subjective Bayesian models. A subjective Prior contains information. This information competes with the information in the data to influence the Posterior distribution (the probability distribution which is used to make conclusions about a variable of interesheterogenietyt, like population abundance or survival). The more information in a Prior, the more a posterior expectation will be pulled away from the "objective" MLE and towards the prior expectation; conversely, the more information in our data (e.g., a large sample size), the prior becomes less influential and inferences are practically the same between ML-based and Bayesian paradigms.

    But, there is strong evidence from the machine-learning and predictive analytics community that slightly conservatively biased models yield better predictions, especially in the face of low-sample sizes and very complex models (Murphy KP, 2012). In the Learning community, this is called "Regularization", such as the Lasso or Ridge Regression or Boosting: these techniques impose a penalty on model complexity and favour simpler models than "objective" ML models estimate. Interestedly, many of the Learning communities' regularization techniques are actually a type of Bayesian model (Hooten and Hobbs 2015).

    The phenomenon is nicely illustrated with a coin toss: Imagine you want to estimate the fairness of a coin (H vs T), but you flip it only 3 times. Let's say you observed TTT (a rare, but not impossible event). The Maximum Likelihood Estimate would be that the probability of a tail is 100%, and the coin is definitely not fair. Of course, we know this is a terrible estimate and a stupid example, because we have solid prior information about the behaviour of coins. However, the sad fact is that many Dolphin Mark-Recapture researchers are making similarly stupid experiments, under small sample sizes and overly complex Frequentist models which estimate of 100% detection probability or 100% survival or 0% movement.

    In contrast, a little bit of skepticism encoded in a Bayesian prior (like Beta(1,1)) can temper our estimates. For our coin example, the posterior would say that the probability of a tail is just 80%, given the observed sequence TTT. That's still not the true 0.5, but it is a better prediction than the frequentist's 100%

    Abstract

    We present a Hierarchical Bayesian version of Pollock's Closed Robust Design for studying the survival, temporary-migration, and abundance of marked animals. Through simulations and analyses of a bottlenose dolphin photo-identification dataset, we compare several estimation frameworks, including Maximum Likelihood estimation (ML), model-averaging by AICc, as well as Bayesian and Hierarchical Bayesian (HB) procedures. Our results demonstrate a number of advantages of the Bayesian framework over other popular methods. First, for simple fixed-effect models, we show the near-equivalence of Bayesian and ML point-estimates and confidence/credibility intervals. Second, we demonstrate how there is an inherent correlation among temporary-migration and survival parameter estimates in the PCRD, and while this can lead to serious convergence issues and singularities among MLEs, we show that the Bayesian estimates were more reliable. Third, we demonstrate that a Hierarchical Bayesian model with carefully thought-out hyperpriors, can lead to similar parameter estimates and conclusions as multi-model inference by AICc model-averaging. This latter point is especially interesting for mark-recapture practitioners, for whom model-uncertainty and multi-model inference have become a major preoccupation. Lastly, we extend the Hierarchical Bayesian PCRD to include full-capture histories (i.e., by modelling a recruitment process) and individual-level heterogeneity in detection probabilities, which can have important consequences for the range of phenomena studied by the PCRD, as well as lead to large differences in abundance estimates. For example, we estimate 8%-24% more bottlenose dolphins in the western gulf of Shark Bay than previously estimated by ML and AICc-based model-averaging. Other important extensions are discussed. Our Bayesian PCRD models are written in the BUGS-like JAGS language for easy dissemination and customization by the community of capture-mark-recapture practitioners.


    Import into Reference Manager


    Copy the following Bibtex into your favourite Reference Manager.

    @article{rankin_full-capture_2016, title = {A full-capture {Hierarchical} {Bayesian} model of {Pollock}'s {Closed} {Robust} {Design} and application to dolphins}, volume = {3}, url = {http://journal.frontiersin.org/article/10.3389/fmars.2016.00025}, doi = {10.3389/fmars.2016.00025}, abstract = {We present a Hierarchical Bayesian version of Pollock's Closed Robust Design for studying the survival, temporary migration, and abundance of marked animals. Through simulations and analyses of a bottlenose dolphin photo-identification dataset, we compare several estimation frameworks, including Maximum Likelihood estimation (ML), model-averaging by AICc, as well as Bayesian and Hierarchical Bayesian (HB) procedures. Our results demonstrate a number of advantages of the Bayesian framework over other popular methods. First, for simple fixed-effect models, we show the near-equivalence of Bayesian and ML point-estimates and confidence/credibility intervals. Second, we demonstrate how there is an inherent correlation among temporary migration and survival parameter estimates in the PCRD, and while this can lead to serious convergence issues and singularities among MLEs, we show that the Bayesian estimates were more reliable. Third, we demonstrate that a Hierarchical Bayesian model with carefully thought-out hyperpriors, can lead to similar parameter estimates and conclusions as multi-model inference by AICc model-averaging. This latter point is especially interesting for mark-recapture practitioners, for whom model-uncertainty and multi-model inference have become a major preoccupation. Lastly, we extend the Hierarchical Bayesian PCRD to include full-capture histories (i.e., by modelling a recruitment process) and individual-level heterogeneity in detection probabilities, which can have important consequences for the range of phenomena studied by the PCRD, as well as lead to large differences in abundance estimates. For example, we estimate 8\%-24\% more bottlenose dolphins in the western gulf of Shark Bay than previously estimated by ML and AICc-based model-averaging. Other important extensions are discussed. Our Bayesian PCRD models are written in the BUGS-like JAGS language for easy dissemination and customization by the community of capture-mark-recapture practitioners.}, number = {25}, journal = {Frontiers in Marine Science}, author = {Rankin, Robert W. and Nicholson, Krista E. and Allen, Simon J. and Krützen, Michael and Bejder, Lars and Pollock, Kenneth H.}, year = {2016} }

    Friday, 26 February 2016

    Hierarchical Bayesian: Automatic Occam's Razor

    There is a crisis of "reproducibility" right now in science (Prinz et al 2011; Simmons et al 2011; Begley et al 2012): conservative estimates are that more than 50% of scientific articles are false and cannot be reproduced. Bad estimates are rampant, and false-positives are the norm! This new 'digitial dark age' has many culprits. One contributor is, I think, a misuse and misunderstanding of the dominant statistical paradigm called "Frequentism". Frequentism is a great intellectual achievement, and has such lofty ideals as 'objectivity', a.k.a., let the data speak for itself. Let's not inject our own hopes and desires into our scientific research. Sounds great doesn't it? Why would we want otherwise?

    Well, for one thing, researchers (especially dolphin researchers) rarely have enough data to use Frequentist tools. Thus, we violate the core principle of Frequentism, for which its name is derived:

    In frequentism, it is the observed data that is the random variable. Had we repeated the experiment in a parallel universe, we would get different data. We don't care about the actual data values (like individual dolphin's measurements); instead, we care about the true population-level process which gives rise to this data. We try to make inferences about this true process by sampling from the data distribution, and then try to generalize over the entire data distribution.

    The core principle is that the Frequentist is trying to use a small sample to characterize the entire data distribution. This requires a lot of data. And when we lack such data (because, heck, dolphins are difficult to find!), Frequentist estimates start to take on rediculous values. For example, I can't tell you how many times a dolphin researcher using a Mark-Recapture model has estimated a survival of 100%. I am serious! There is a big problem with our supposedly 'objective' statistics when they lead us to believe that dolphins are immortal. But here is the catch: my knowledge that dolphins age and die is a type of "prior information", and for me to use this prior biological knowledge leads us to another rival paradigm called Bayesianism.

    I won't get into the philosophy of Bayesianism (partially because I don't quite understand it). The best I can do is imagine a Bayesian as a poker player: they don't consider the data in their hand (e.g., their current cards) to be random: they consider it a given, and they want to make the best predictions possible about the next cards to be drawn, conditional on the evidence in their hand. Seems reasonable enough, but it turns out this is in direct opposition to Frequentism.

    Parsimony and Hierarchical Bayes


    But, one nice feature about Bayesianism is that we can encode into our statistical models an old ideal from Occam: "parsimony". Basically, we say "Don't tell me a dolphin is immortal unless you have a lot of data!" Technically, we can use "hyper-prior" distributions on parameters like survival to force simplicity in lieu of strong evidence, but allow for more complex models if there is a lot of data and larger sample size. This is demonstrated nicely in the following animation.


    Look at the red Maximum Likelihood Estimates on the Right (a Frequentist tool). They are estimating a probability p  from a series of 0-1 outcomes (like, the number of dolphins captured from N available), and this is done in parallel during 10 different "groups" (like 10 independent experiments). Notice that the ML Estimates  stay the same over a wide range of data/evidence (even though the "true" mean is around 0.5), and they vary from p=0.2 to p=1. However, the Hierarchical Bayesian posteriors (in purple) behave very differently. When the sample size is small (5 observations per group), each group is shrunk to the overall mean at 0.5. It is only as the sample size increases that each group is allowed to take on their own individual values; and eventually, with a lot of data, each groups' posterior expectation becomes nearly the same as the MLE.

    This is a type of "model parsimony": with only a little bit of data, our model is simple (all groups have the same estimate, and the estimate is the group mean at 0.5). With more data, our model becomes more complex (all groups share a mean, but they can also have individual variation, each with their own parameter p).

    The abuse of MLE is nicely illustrated with the following experiment: Imagine you want to estimate the fairness of a coin, but flip it only 3 times. Let's say you observed TTT (a rare event, but not impossible). The Maximum Likelihood Estimate would be that the probability of a tail is 100%, and the coin is definitely not fair. Of course, we know this is a terrible estimate and a stupid example, because we have solid prior information about the behaviour of coins. However, the sad fact is that many Dolphin Mark-Recapture researchers are making similarly stupid experiments, under small sample sizes and overally complex Frequentist models which estimate of 100% detection probability or 100% survival or 0% movement. These terrible estimates are polluting our scientific literature.

    The MLE's are indeed the most 'objective', but they make bad predictions for new data and give us rediculous results under low-sample sizes. In contrast, a little bit of skepticism encoded in a Bayesian prior (like Beta(1,1)) can temper our estimates. Returning to the coin flipping example, the posterior would say that the probability of a tail is just 80%, given the observed sequence TTT. That's still not the true 0.5, but it is a better prediction than the frequentist's 100%

    Unified Under Hierarchical Bayes


    The poor performance of MLE was noticed by non-statisticians a long time ago in the Machine Learning and predictive analytics communities. They obsessed over "regularization": finding ways to constrain the complexity of models, and deliberately bias the estimates towards simplicity. Techniques like the LASSO, Ridge Regression, Elastic Net, Boosting, AIC optimization, Bayesian Model averaging, etc, were devised to make better predictions on new data, rather than philosophical notions of probability and objectivity.

    The eerie thing is that most of these techniques have a Bayesian interpretation. The famous Lasso is just a Bayesian model with a Double Exponential prior on the regression coefficients. The idea is explored in an accessible article by Hooten and Hoobs 2015.

    Subjective Pseudo Bayesian and the Degradation of Science


    If ecologists are willing to inject a little bit of bias into their analyses in order to get better predictions and sensible results, then what of objectivity? What is to stop one from 'cooking the books'? Talking about cavalier 'pseudo'-Bayesian analysis, the Bayesian heavy-weight James Berger wrote in 2004:
    Statistics owes its central presence in science and life to the facts that (i) it is enormously useful for prediction; (ii) it is viewed as providing an objective validation for science. Discarding the latter would lead to a considerable downgrading of the value placed on statistics.

    At the time of his warning, few non-statisticians were using subjective Bayesian models, and even then, they seemed to want "objective" reference priors which conveyed no prior information, and yielded nearly identical results as Frequentist models. It is a bit ironic that the opposite seems to have happened: the crisis of reproducibility have been unveiled, and it was under the watch of the so-called objective school of statistics, and not from subjectively biased Bayesian models. Now that the tables of turned, and more and more researchers are turning to Bayesian models, Berger's 2004 warning sits, waiting to be realized. The solution is probably everyone adopting a prediction mentality. Dolphin ecologists have already sort of adopted this practise by using the AIC for model selection, but have yet to grapple with its philosophical underpinnings, nor rival ideas.

    Hifime UH1 DAC under Linux and Jack

    The Hifime UH1 USB DAC and Headphone Amp combo has some great features like:
    - reference-quality DAC chip: Texas Instruments PCM5102 (a flagship chip used in other >$1000 DAC's)
    - Samples rates from 44.1, 48, 88.2, 96, 176.4, 192, 384kHz, and bit rates 16, 24 and 32.
    - optional headphone amplifier (TPA6120, which I couldn't get much information about), as well as a direct line-out just from the DAC itself. This a great feature for someone who may have to drive high-impedance headphones, but stills wants the flexibility of using other neutral amplifiers.
    - Asynchronous USB transfer with two TCXO as master clock
    - under $200 USB, ships from Hong Kong (not from the USA-based HifimeDIY)

    What's really nice is that Hifime explicitly mentions support for Linux (without drivers!), a major selling point for me as I don't want to play around with kernel patches, etc.

    Using with GNU/Linux JACK, Pulseaudio, and ALSA.
    The DAC shows up as HiFimeDIY SA9227 USB Audio. For most linux distributions, one will have to contend with Pulseaudio to get any sound out of the DAC. The easiet way is to use 'pavucontrol' (lightweight Pulseaudio GUI control) to explicitly change the 'Audio Interfaces'.

    Unfortunately, Pulseaudio automatically down-samples audio. So, for bit perfect audio, one needs to use a music player with ALSA support. I use gmusicbrowser, which is highly customizable (reminiscent of my only Win7 program Foobar2000). Click on Main (menu) -> Settings->Audio (Tab)-> Output device: set to "Alsa"; Click "advanced options" and enter "hw:Audio" as the Alsa option. Viola! Enjoy audio at 192000Hz and 24bit!

    - See more at: http://hifimediy.com/hifime-uh1#sthash.twSvxSOI.dpuf

    Tuesday, 23 February 2016

    Bayesian Mark-Recapture: Parsimony and Pollock's Closed Robust Design

    At the Bienniel Marine Mammal Conference 2016, I presented a Hierachical Bayesian version of Pollock's Closed Robust Design (see presentation here). This was motivated by some bad behaviour of Program Mark in estimating detection probabilities and migration parameters for "sparse data" -- I say sparse, but for my sample sizes with dolphins, failure is all too common. In both real data, and through an extensive simulation exercise, I found that Mark was highly prone to "boundary-level" MLEs for survival and migration and sometimes detection probabilities, leading to erratic abundance estimates.

    Interestingly, mildy weak Bayesian priors (Beta(1,1)) help smooth over this bad behaviour ("Objective" priors do not).

    The image to the right shows simulations of the PCRD, comparing Maximum Likelihood estimation versus a Bayesian model... notice the infinite CI's and boundary level MLEs for Program MARK! (See the full article link below)

    Hierarchical Bayesian PCRD
    The world is going Bayesian, but few people understand why. Here, the true power of Bayesian Inference is revealed in a hierarchical specification for the survival, detection and migration parameters: in particular, we shrink between "time-invariant theta(.)" models and "time-varying theta(t)" models. I found that applying a particular type of shrinkage-inducing hyperprior, called the "Scaled Half Student-t distribution", lead to somewhat similar estimates as "model-averaging" over many MLE-based fixed effect models (run in Mark).

    Hierarchical Bayes provides a type of automatic "Occam's Razor" (Berger 2004), in that when there is sparse/weak data, the hyprprior shrinks the model to a simple model, but when there is a lot of evidence in the data or a large sample size, the hyperprior is easily overwhelmed and the model takes on a lot of complexity. Reassuringly, the final estimates seem insensitive to exact hyper-parameter values. As shown in the following animation...



    I decided to upload an R & JAGS tutorial on github: https://github.com/faraway1nspace/PCRD_JAGS_demo Users can run simple fixed effects Bayesian PCRD models, or they can run a Hierarchical Bayesian version that includes: i) Individual Heterogeneity in detection probabilities; and ii) recruitment processes; and iii) uses a hyperprior to apply a type of "multimodel" inference. All things seem to unify in HB.

    JAGS code: Fixed effects Bayesian PCRD
    Here JAGS code to run the simple fixed model of the PCRD: phi(.)gamma'(.)gamma''(t)p(t,s). This is just for interested readers. Serious users should consult the github demonstration
    . This implements the PCRD as a Bayesian HMM. Anyone who uses this JAGS code should cite: "Rankin RW, Nicholson K, Allen SJ, Krützen M, Bejder L, Pollock KH. 2016. A full-capture Hierarchical Bayesian model of Pollock's Closed Robust Design and application to dolphins. Frontiers in Marine Science, in Press. DOI: 10.3389/fmars.2016.00025". The article is available here

    model{
    # priors
    phi ~ dbeta(1,1) # apparent survival probability (adjust for ecological
    gamma1 ~ dbeta(1.2,1,2) # temp migration: probabilty stays out conditional on being out
    # loop through primary periods: parameters for detection probability pd_t, and gamma2_t
    for(t_ in 1:T){ # T primary periods...
    pd_a[t_] ~ dgamma(3,2) # hyperprior on detect probs
    pd_b[t_] ~ dgamma(3,2) # hyperprior on detect probs
    for(tt_ in 1:T2[t_]){ # loop through secondary periods...
    pd[t_,tt_] ~ dbeta(pd_a[t_],pd_b[t_]) # secondard-period-level detection probs
    }
    gamma2[t_] ~ dbeta(1,1) # temp migration: prob migrates outside conditional on being inside
    # recruitment process from 'eigenvector decomposition'
    lambda[1,t_] <- (1-gamma1)/(gamma2[t_]-gamma1+1) # recruitment ratio, or long-term prob of being inside
    lambda[2,t_] <- 1-lambda[1,t_] #
    psi[t_] ~ dunif(0,1) # inclusion probability
    # trmat: transition matrix for Markovian latent-states
    # 1 =not yet in population;2=dead;3=offsite;4=onsite (only observable state)
    # transition are from the column --> rows
    # trmat[row,column,time] = [state at time=t_; state at time t_-1; time=t_]
    trmat[1,1,t_] <- 1-psi[t_] # excluded from pop
    trmat[2,1,t_] <- 0 # dead
    trmat[3,1,t_] <- psi[t_]*lambda[2,t_] # inclusion into pop, outside study are
    trmat[4,1,t_] <- psi[t_]*lambda[1,t_] # inclusion into pop, inside study area
    trmat[1,2,t_]<- 0
    trmat[2,2,t_]<- 1 # stay dead
    trmat[3,2,t_]<- 0
    trmat[4,2,t_]<- 0
    trmat[1,3,t_]<- 0
    trmat[2,3,t_]<- 1-phi # dies outside
    trmat[3,3,t_]<- gamma1*phi # stays outside | outside
    trmat[4,3,t_]<- (1-gamma1)*phi # reenters study area | outside
    trmat[1,4,t_]<- 0 #
    trmat[2,4,t_]<- 1-phi # dies inside
    trmat[3,4,t_]<- gamma2[t_]*phi # leaves study area | inside
    trmat[4,4,t_]<- (1-gamma2[t_])*phi # stays inside | inside
    } # t_
    # likelihood: loop through M individuals, both real and pseudo-individuals
    for (i in 1:M){
    #draw latent state at primary period 1:
    # ... by definition, everyone starts in z=1 (not-in-population) at time=0
    z[i,1]~ dcat(trmat[1:4,1,1]) # first z strictly excluded from pop
    # likelihood for first primary period
    for(tt_ in 1:T2[1]){ # loop through secondary periods
    # Bernouilli process, conditional on z=4, otherwise no observation
    y[i,tt_,1] ~ dbern(pd[1,tt_]*equals(z[i,1],4))
    }
    alive_i[i,1] <- step(z[i,1]-3) # count if i is alive or not
    Nin_i[i,1] <- equals(z[i,1],4) # count if i is within study area
    # loop through primary periods after 1st primary periods
    for(t_ in 2:T){
    # state process: draw z(t_) conditional on z(t_-1)
    z[i,t_] ~ dcat(trmat[1:4, z[i,t_-1] , t_])
    # likelihood: loop through secondary period observations
    for(tt_ in 1:T2[t_]){
    # Bernouilli observation error, conditional on z=4
    y[i,tt_,t_] ~ dbern(pd[t_,tt_]*equals(z[i,t_],4))
    }
    # tally population size
    alive_i[i,t_] <- step(z[i,t_]-3) # check alive or not
    Nin_i[i,t_] <- equals(z[i,t_],4) # count if i is within study area
    } # t_
    } # i
    # estimate population size per primary periods
    for(t_ in 1:T){
    alive[t_] <- sum(alive_i[,t_]) # number alive
    Nin[t_] <- sum(Nin_i[,t_]) # number in study area
    } # t_
    }


    JAGS code: Hierarchical Bayesian PCRD
    Here is JAGS code for a Hierarchical Bayesian Model, phi(t)gamma'(t)gamma''(t)p(t,s). The Hierarchical Part uses 'hyperpriors' to deliberately shrink time-varying specifications, like phi(t), to a simpler time-invariant specifications, phi(.). The result is a type of "model parsimony", with a similar type of smoothing as "model averaging". The following is just a visual demo for the casual reader; serious users should refer to the github page and download the full R code. Anyone who uses this JAGS code should cite: "Rankin RW, Nicholson K, Allen SJ, Krützen M, Bejder L, Pollock KH. 2016. A full-capture Hierarchical Bayesian model of Pollock's Closed Robust Design and application to dolphins. Frontiers in Marine Science, in Press. DOI:10.3389/fmars.2016.00025". The article is available here
    model{
    #hyperpriors: logit(theta_mu)~Normal and theta_sd ~ half-t
    phi.mu ~ dunif(pr.phiunif[1],pr.phiunif[2]) #mean survival with a Uniform prior
    sigma.phi ~ dt(0,pr.tauphi[1],pr.tauphi[2]) T(0,) #mean survival dispersion, half-t hyperprior
    g1.mu ~ dnorm(0,pr.g1mu) #mean gamma1, temp migration out | out
    sigma.g1~dt(0,pr.taug1[1],pr.taug1[2]) T(0,) #mean gamma1 dispersion, half-t hyperprior
    g2.mu ~ dnorm(0,pr.g2mu) #mean gamma2, temp migration out | in
    sigma.g2~dt(0,pr.taug2[1],pr.taug2[2]) T(0,) #mean gamma2 dispersion, half-t hyperprior
    pd.mu ~ dnorm(0,pr.pdmu) #mean detection prob, overall
    sigma.pdmu~dt(0,pr.taupdmu[1],pr.taupdmu[2]) T(0,) #primary period detection prob dispersion
    sigma.pd2nd~dt(0,pr.taupd2nd[1],pr.taupd2nd[2]) T(0,) #secondary periods detection prob dispersion
    sigma.eps ~ dt(0,pr.taueps[1],pr.taueps[2]) T(0,) #individual detection prob dispersion
    #time-variant parameters
    for(t_ in 1:T){ #loop through primary periods
    pd_mu[t_]~dnorm(pd.mu,pow(sigma.pdmu,-2)) #primary period mean detaction prob (logit)
    lgamma1[t_]~dnorm(g1.mu,pow(sigma.g1,-2)) #prob migrate out|out (logit)
    gamma1[t_] <- ilogit(lgamma1[t_]) #prob migrate out|out (probability)
    lgamma2[t_]~dnorm(g2.mu,pow(sigma.g2,-2)) #prob migrate out|in (logit)
    gamma2[t_] <- ilogit(lgamma2[t_]) #prob migrate out|in (probability)
    #RECRUITMENT: psi is the 'inclusion probability' and lambda is the 'recruitment ratio'
    psi[t_]~dunif(0,1) #inclusion probability
    lambda[t_] <- (1-gamma1[t_])/(gamma2[t_]-gamma1[t_]+1) #long-term probability inside study area
    #NOTE, lambda could also be a random variable with a beta prior
    #secondary-period detection probabilities
    for(tt_ in 1:T2[t_]){ #loop through secondary periods
    pd[t_,tt_] ~ dnorm(pd_mu[t_],pow(sigma.pd2nd,-2))
    } #tt_
    } #t_
    #first state transition (pure nusance; strictly from outside-pop to part of marked-pop)
    trmat0[1] <- (1-psi[1]) #remains not-yet-in-pop
    trmat0[2] <- 0
    trmat0[3] <- psi[1]*(1-lambda[1]) #inclusion into pop, goes outside study are
    trmat0[4] <- psi[1]*lambda[1] #inclusion into pop, goes inside study
    #state transitions (2:T)
    for(t_ in 1:(T-1)){
    lphi[t_]~dnorm(log(phi.mu/(1-phi.mu)), pow(sigma.phi,-2)) #survival prob (logit)
    phi[t_]<-ilogit(lphi[t_])
    #state transitions
    #trmat: transition matrix for Markovian latent-states
    #1 =not yet in population; 2=dead;3=offsite;4=onsite (only observable state)
    #transition are from the column --> rows
    #trmat[row,column,time] = [state at time=t_; state at time t_-1; time=t_]
    #notice that the primary periods are offset by 1 (because we already dealt with T=1)
    trmat[1,1,t_]<- 1-psi[t_+1] #excluded from pop
    trmat[2,1,t_] <- 0 #dead
    trmat[3,1,t_] <- psi[t_+1]*(1-lambda[t_+1]) #inclusion into pop,outside study
    trmat[4,1,t_] <- psi[t_+1]*lambda[t_+1] #inclusion into pop,inside study
    trmat[1,2,t_]<- 0
    trmat[2,2,t_]<- 1 #stay dead
    trmat[3,2,t_]<- 0
    trmat[4,2,t_]<- 0
    trmat[1,3,t_]<- 0
    trmat[2,3,t_]<- 1-phi[t_] #dies outside
    trmat[3,3,t_]<- gamma1[t_+1]*phi[t_] #stays outside | outside
    trmat[4,3,t_]<- (1-gamma1[t_+1])*phi[t_] #reenters study area | outside
    trmat[1,4,t_]<- 0 #
    trmat[2,4,t_]<- 1-phi[t_] #dies inside
    trmat[3,4,t_]<- gamma2[t_+1]*phi[t_] #leaves study area | inside
    trmat[4,4,t_]<- (1-gamma2[t_+1])*phi[t_] #stays inside | inside
    } #t_
    #loop through M individuals
    for (i in 1:M){
    #state transitions and likelihood for the first primary period
    z[i,1]~ dcat(trmat0) #z at time 0 is strictly 'not-yet-in-pop'
    alive_i[i,1] <- step(z[i,1]-3) #count if i is alive or not
    Nin_i[i,1] <- equals(z[i,1],4) #count if i is within study area
    eps_i[i] ~ dnorm(0,pow(sigma.eps,-2)) #random effects at individual levels
    #Observation error y[i,tt_,t_] ~ Bernoulli conditional on being inside z=4
    for(tt_ in 1:T2[1]){ #loop through secondary periods
    y[i,tt_,1] ~ dbern(equals(z[i,1],4)/(1+exp(-pd[1,tt_]-eps_i[i]))) #inverse-logit transform
    }
    #state transition and likelihood for primary periods 2:T
    for(t_ in 2:T){
    #State process: draw z(t_) conditional on z(t_-1)
    z[i,t_] ~ dcat(trmat[1:4, z[i,t_-1] , t_-1])
    #Observation error y[i,tt_,t_] ~ Bernoulli condition on being inside z=4
    for(tt_ in 1:T2[t_]){ #loop through secondary periods
    y[i,tt_,t_] ~ dbern(equals(z[i,t_],4)/(1+exp(-pd[t_,tt_]-eps_[i]))) #inverse-logit transform
    }
    #check whether i individual is alive and inside
    alive_i[i,t_] <- step(z[i,t_]-3) #check alive
    Nin_i[i,t_] <- equals(z[i,t_],4) #count if i is within study area
    } #t_
    } #i
    #tally population size
    for(t_ in 1:T){
    alive[t_] <- sum(alive_i[,t_]) #number alive
    Nin[t_] <- sum(Nin_i[,t_]) #number in study area
    } #t_
    } #end model