Thursday, 4 October 2018

NLP and LSTM for Classifying Customer Complaints about Financial Products


Some of this work is available as python code at https://github.com/faraway1nspace/NLP_topic_embeddings

Topic Embeddings

One of the most interesting concepts to arise out of machine learning in the last decade has been the concept of "Embeddings" (see, for example, "word2vec" developed by Tomas Mikolov at Google). The general idea is to transform categories from discrete, independent objects and "embed" them into a low-dimensional vector space. Each object/category is no longer it's own discrete entity, but is a vector in some space. In the word2vec example, each English word is a vector of size 300, such that variations along each of the 300 dimensions has inferred semantic importance (e.g,. "girl" vs. "boy" vary along a dimension, just as as "queen" vs. "king"). Another example is the famous instacart model, where the analysts sought to embed 10 million grocery-items into 10 dimensions. For my Insight Data project, I sought an embedding for customer-feedback topics from a propriety customer survey dataset (e.g., to organize the various things customers complain about).

From the perspective of a data-analyst, the immediate utility of the embedding approach is as an alternative means of vectorizing categorical variables, and, in particular, finding a vectorization that is low-dimensional. Consider the traditional route of vectorizing categorical data via "One-hot-coding": each category becomes a binary dummy variable in a matrix of size K-1. In contrast, the embedding approach uses deep-learning to vectorize each category into continuous latent vectors, whereby the embedding dimension D is typically much lower than the number of categories D less than K. In my case, I was dealing with about 200 consumer-feedback categories (and growing exponentially!), while my embedding space just had 6 dimensions (see Fig 1) below.
Fig1: Example embedding: various consumer-complaint categories vectorized into 6 dimensions.

Why all the fuss? In each discipline, there is a flurry creative uses for the embedding technology. For my use-case, I was interested in embeddings for two applications:
  • finding redundant categories; and
  • grouping categories into meaningful super-groups (clustering).
For the first goal, it is possible to find redundant/related categories according to the following intuition: vectors that are close in the embedding space are strongly related, or even redundant. We can cluster these putative different categories into super-categories through clustering algorithms, like K-means.

I will walk through a particular problem which invites embedding solution, and another which perhaps does not benefit from embeddings.

Example 1: Multi-class Classification

The following example comes from a propriety dataset, so the context and results are all general, and the code cannot be shared. Nonethless, the results are interesting for two reasons: the use of embeddings, and the pivotal role of the multi-class loss.

Goal

While studying ML & data science with Insight Data 2018, I played consultant for a local startup company in Toronto. They had a large data-set of customer feedback in text form. Each row of text was a customer complaint or recommendation regarding a large variety of products, companies and shopping experiences. They employed an army of humans to go through each text feedback and extract customer sentiments (-,0,+). The problem was that these categories/labels were increasing exponentially in time: the number of labels was growing from 50 to 200 (to perhaps 1000?) in a matter of months. How could I scale up this classification & sentiment analysis?

Fig 2: Left: Exponential increase in the number of labels assigned to the customer-feedback text. Right: extreme skew to the distribution of labels in the data: most are rare (perhaps redundant?) a few are well represented.


The scaling issue became very apparent after trying a generic 'go-to' model with a simple natural language processing (NLP) component and deep-learning model. Using python, tensorsflow, and the keras API, the 'go-to' model had the following pipeline: pre-process the text (stemming words, remove stopwords, etc.); vectorize the words of the text with a word-embedding (like word2vec, but trained within the context of the this problem); run the word-vectors through a recurrent neural network (e.g., LSTM); finally, use Multi-class Classification (and conditional sentiment analysis) to assign category-labels to each customer-review.

Problem

The go-to model performed well for the most abundant 20 categories, but it didn't scale well for entire dataset of >200 categories. My worry was that the exponential increase in the number of categories meant that there was considerable redundancy in the ever-increasing quantity of (putative) categories.

Solution: Embedding

Inspired by Instacart and the lessons at fast.ai, I thought it might be helpful to try and reduce the number of "effective" categories through embeddings. Basically, the algorithm learns an embedding space, different regions of the space have "meaning", and each category is merely a that point exists in this space. Related Categories like "free samples" and "sign-up incentives" are close together in this space, while categories like "website layout" and "shipping costs" occupy a different part of the space.

The neural architecture is shown in figure 3. And the keras API looks like:

# INPUTS and OUPUTS
# X_train: the vectorized text data (training set)
# Y_train: is a 3D tensor of targets of conditional targets: 
# ... : the rows represent observation
# ... : the cols represent categories
# ... : the slices represent the sentiments per category [NA,-,0,+]
# X_topics_train : just a vector of integers 1:N_labels
# W_train : sample-weights for unbalanced design

embed_dim_lstm = 194 # word-embedding dimensions
lstm_OutputDim = 100 # output dimensions of the LSTM
embed_dim_topic = 6 # category embedding dimensions
batch_size = 128 # mini-batch size
hidden_nodes_final = (np.linspace(lstm_OutputDim,Ymultclass.shape[1],3).round().astype(int))[1] # dimensions of the final hidden layer

lstm_input_layer = keras.layers.Input(shape=(X_train.shape[1],), dtype='int32', name='lstm_input',) # lstm input
lstm_embed_layer = keras.layers.Embedding(input_dim=max_features, output_dim=embed_dim_lstm, input_length=X_train.shape[1])(lstm_input_layer) # input_dim = vocab size,
lstm_output = keras.layers.LSTM(lstm_OutputDim)(lstm_embed_layer) # the output has dimension (None, 12)
# need to Repeat the LSTM output from a 2D matrix to a 3D tensor in order to concatenate to the category-embedding
lstm_reshape = keras.layers.RepeatVector(nTopics)(lstm_output)
topic_input_layer = keras.layers.Input(shape=(X_topics_train.shape[1],), dtype='int32', name='topic_input') # input the topics
topic_embed_layer = keras.layers.Embedding(input_dim=nTopics, output_dim = embed_dim_topic, input_length=X_topics_train.shape[1])(topic_input_layer) # topic embedding
# need to reshape
x = keras.layers.concatenate([lstm_reshape,topic_embed_layer],axis=2) # is this axis 1 or 2??
hidden_layer = keras.layers.Dense(hidden_nodes_final, activation='relu')(x)
#x = keras.layers.BatchNormalization()(x)
main_output = keras.layers.Dense(4, activation='softmax', name='main_output')(hidden_layer) # main output for categories
model = keras.models.Model(inputs=[lstm_input_layer,topic_input_layer], outputs=[main_output])
model.compile(loss = "categorical_crossentropy", optimizer='adam',metrics = ['accuracy'])
print(model.summary()) # I guess I can use both
model.fit({'lstm_input': X_train, 'topic_input': X_topics_train }, # inputs
          {'main_output': Y_train}, # targets
          epochs = 50, # 
          batch_size=batch_size, verbose = 2,
          validation_data=([X_val, X_topics_val], Y_val), # validation data
          sample_weight = W_train) # sample-weights for unbalanced design

After compiling the model, the keras API returns the following list of layers and their shape.

______________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
==================================================================================================
lstm_input (InputLayer)         (None, 198)          0                                            
__________________________________________________________________________________________________
embedding_1 (Embedding)         (None, 198, 194)     582000      lstm_input[0][0]                 
__________________________________________________________________________________________________
lstm_1 (LSTM)                   (None, 100)          118000      embedding_1[0][0]                
__________________________________________________________________________________________________
topic_input (InputLayer)        (None, 145)          0                                            
__________________________________________________________________________________________________
repeat_vector_1 (RepeatVector)  (None, 145, 100)     0           lstm_1[0][0]                     
__________________________________________________________________________________________________
embedding_2 (Embedding)         (None, 145, 6)       870         topic_input[0][0]                
__________________________________________________________________________________________________
concatenate_1 (Concatenate)     (None, 145, 106)     0           repeat_vector_1[0][0]            
                                                                 embedding_2[0][0]                
__________________________________________________________________________________________________
dense_1 (Dense)                 (None, 145, 122)     13054       concatenate_1[0][0]              
__________________________________________________________________________________________________
main_output (Dense)             (None, 145, 4)       492         dense_1[0][0]                    
==================================================================================================
Total params: 714,416
Trainable params: 714,416
Non-trainable params: 0
__________________________________________________________________________________________________
None

Figure 3: Diagram of the above Keras API model. The top arm is a generic text-classification model (word-tokens -> word embedding -> LSTM), while the bottom arm includes the "category embeddings". The outputs of the LSTM and the category-embeddings are concatenated before running through a final Dense layer. The final targets are multi-class labels (x-axis) and their conditional sentiments (NA,-,0,+) along the z-axis

According to the out-of-sample AUC statistics, the model performed quite well (>0.95). What is especially interesting is the learned relationships among the different categories, as visualized in figure 4 (below). The fact that related categories cluster together in space demonstrates that the model does indeed learning meaningful embeddings of the categories.

Fig 4: Consumer feedback categories and their learned position in a low-dimensional embedding space (here, visualized in 2D with t-SNE. 
There are a couple interesting things to note about the neural architecture:

  1. the 'generic' LSTM model is still present in this embedding model: it merely represents the chain of layers: `lstm_input_layer`, `lstm_embed_layer`, `lstm_output` (representing the word-embedding and the LSTM layers).
  2. the category-embedding goes through the chain: `topic_input_layer`,  `topic_embed_layer`
  3. in order to combine outputs of the LSTM chain with the outputs from the category-embedding chain, I had to repeat the outputs of the LSTM so that it had the same dimensions as the category-embedding in the X and Y axes (i.e., concatenating along the Z axis): `lstm_reshape = keras.layers.RepeatVector(nTopics)(lstm_output)`

Loss function:  Importance of the Multi-class loss


Notice also, that both the final layer (`hidden_layer`) and the Y targets are 3D dimensional. This has a very important implication for the `categorical_crossentropy` loss function: the unit-vector which sums to 1 is taken along the z-axis; in the model specification, this z-axis represents the 4 different sentiment categories per a particular category: [NA = not-present; negative; neutral; positive]. The columns are just different possible categories, each with their own sentiment-unit-vector heading off in the Z-axis direction. This is summarized in the figure 5. This has two important points:

  • First, the entire analysis is "multi-class" classification, such that each piece of text (a row in the matrix) can have multiple labels (which makes sense, because people can and do talk about different things in their customer feedback.
  • Second, the rows (different texts) and columns (different categories) are essentially treated as independent observations, according to the loss function!

Figure 5: structure of data-targets and the loss function. In Keras, if the targets is 3D (here, text on y-axis, categories on x-axis, and conditional sentiments on z-axis), and you choose a "categorical_crossentropy" loss function, then the distribution is assumed multinomial over the Z-axis. This means, the probability of each element in the z-axis sums to one (sentiments) while the columns and rows are all independent.


Therefore, we see why the embeddings may help the predictive performance of the model, and why the model is able to learn meaningful embeddings of the categories: were the model not to learn the associations among the categories (via the embeddings), then each category would contribute an independent amount to the loss function, irrespective of all the other categories, per sentence. With the embeddings, the model is learning a structure that helps it predict which columns/categories are related.

The specification of this loss function (such that each category is independent in the eyes of the loss function) is the crux of the embeddings usefulness. In the next section, I will do another seemingly similar analysis according to a different loss function, with very different results.

Example 2: Multinomial Classification

Due to the secretative nature of company I was working with, I wanted to show an alternative analysis using publicly available data for text-classification.  I settled on the USA Consumer Financial Protection Bureau's financial complaint database


Monday, 25 September 2017

Small Workshop on JAGS and Bayesian

Southern Cross University

Download JAGS files:
- for Windows:R_jags_tutorial_MSDOS.R
- for Mac/Linux: R_jags_tutorial.R

Current version of presentation (PDF)
- Bayesian JAGS tutorial presentation

Wednesday, 6 September 2017

Workshop on Bayesian Capture-Recapture: 2017 Conference for the Society of Marine Mammology, Halifax NS Canada

My colleague Krista Nicholson and I will be running a full-day practical workshop on Bayesian Capture-Recapture. The workshop will be held at the 2017 Conference for the Society of Marine Mammology in Halifax, NS, Canada, on Sunday, October 29, 2017.

More information about registration and the price can be found here. The workshop is only available for conference attendees.

Description. This workshop is run in response to recent advances in CR modelling showing that there are number of advantages in implementing Bayesian CR models over other popular methods. These advantages will be highlighted and discussed throughout the workshop.

The workshop will include an introduction to Bayesian theory and beginner/intermediate tutorials on R and JAGS. JAGS is the popular and flexible Bayesian scripting language that can code CR models and much more! The workshop is suitable for students, researchers, professors, veteran program MARK users, and anyone who is already familiar with CR and would like to learn how to implement Bayesian CR models. Intermediate familiarity with R and CR is expected.

Outline of the workshop: i) Bayesian philosophy; ii) introduction to the BUGS/JAGS language; iii-v) practical tutorials to implement common CR models for cetaceans (POPAN, PCRD, individual-covariate models); vi) Bayesian model-selection; vii) Hierarchical Bayesian models; viii) open-session to be determined by pre-workshop participant outreach (spatial capture-recapture, cloud-computing and high-performance computing, multi-event).

You can register for this workshop online at the conference website http://www.smmconference.org/.

We welcome participants to express their interest in any particular topics to be covered or discussed during the workshop and invite questions on and submissions of difficult data sets, which we may be able to work on and discuss in the workshop.

If you have any inquiries about the workshop or its content, please do not hesitate to contact Dr. Robert Rankin at robertw.rankin@gmail.com

Tuesday, 10 January 2017

A Primer on the Bias-Variance Trade-off in ecology

This blog is a originally from the Appendix of Rankin RW. 2016. EM and component-wise boosting for Hidden Markov Models: a machine-learning approach to capture-recapture. bioRxiv. doi: 10.1101/052266. Available from 10.1101/052266


I will use simulations to illustrate the ``bias-variance trade-off'' for ecologists, which is one of the most important ideas in statistical modelling in the past half-century. The goal is to show how the AICc and another model-selection/model-averaging technique called "boosting" each negotiate the trade-off in order to improve estimation. Specifically, in order to minimize the expected error of estimating survival phi over T capture periods. The trade-off is fundamental to understanding the optimality of Frequentist shrinkage estimators and AIC model-selection. The illustrations are inspired by figure 6.5 in Murphy 2012, but is adapted to a type of capture-mark-recapture model called the Cormack-Jolly-Seber model.

The trade-off is an old idea without a citable origin (although Geman et al 1992, is often considered to be a definitive reference, but the phenomenon is clearly discussed as early as 1970 by Hoerl considering Ridge Regression, and likely earlier). Despite being an old and fundamental concept of statistical estimation, I have noticed that it poorly understood among academics and government scientists. In particular, it is my experience that ecologists are unduly wedded to the idea of being unbiased (in estimation), such that when they are presented with visual and quantitative evidence about the optimality of biased shrinkage estimators, they recoil at the sight of systematic bias, and ignore the crucial role of variance. Of course, bias is not desirable in and of itself, but so long as the bias goes to zero at a rate proportional to that of the variance, we may be able to improve our overall estimation performance by incurring a little bias.

In the following simulations, the goal is to minimize the Expected Error of estimating survival, as quantified by the Mean Square Error (MSE). It is a population-level abstract quantity that can only be measured in simulations when we know to the ``true'' process. It is Frequentist in the sense that we hope to minimize the error over all possible data-sets that one might sample from the true population Y. These multiple realizations are shown as grey lines in Figures 1 and 2. Of course, an analyst only has one dataset, and his goal is to get his estimates as close as possible to the truth.

Figure 1: Decomposing the error of estimation (MSE) into its bias and variance components. An estimation procedure will negotiate the bias and variance so to minimize the MSE. Top, a simulation of a true survival process (red line). Each grey line represents one dataset sampled from the population and an analyst's attempt to estimate survival using multi-model inference procedures, such as boosting. The dashed black line is the mean estimate over all 30 independent grey-lines. Middle, a visualization of the variance component, showing the variability of point-wise estimates due to randomness in the sampled data and a procedure's sensitivity to such differences. Bottom, a visualization of the bias: the expected difference between the truth and the procedure's estimates, over all realizations of the data.


The bias-variance trade-off arises from a classic decomposition of the expected error: MSE = E[phi_hat-phi_true]^2 + Var[phi] + constant. Figure 1 also shows this decomposition. The first term is the expected difference between an estimate and the true value, i.e. the bias. This difference is visualized as the red polygon in Figure 1. In the same figure, the bias manifests as shrinkage from the true red line towards a flat global mean. Quantifying the bias requires knowledge of the truth phi_true, and is therefore inaccessible in real-life situations. The second term is the variance and it does not depend on knowledge of the truth. Rather, it arises due to the vagaries of random sampling as well as the complexity of the estimation procedure: overly complex models which ``over-fit'' one dataset will vary wildly when fitted to a new dataset sampled from the same population. The variance can be visualized as the spread of the grey lines, or the green polygon in Figure 1.

The MSE decomposition has a naive meaning: in order to optimize our estimation performance, we should reduce the bias and/or the variance. Clearly, most ecologists see the value of tackling either of these two terms. But the nature of a trade-off has a more elusive importance: we cannot, in general, minimize both terms for a given sample-size, and we may deliberately increase one term in order to decrease the other. Shrinkage estimators incur a little bias and have lower variance (i.e., the red polygon is bigger but the green polygon is smaller). This strategy results in much smaller MSE values than complex unbiased estimators. In contrast, the MLEs of the complex full-model are unbiased but they typically have very high variance. This strategy is often worse at minimizing the MSE, for small-to-moderate samples sizes.

The following simulations show how different statistical methods have different strategies in negotiating the bias-variance trade-off. Imagine an analyst who is confronted with four different methods to estimate survival. The first is estimation by Maximum Likelihood using the full-model p(t)phi(t). The second method is AICc model-selection, and the third is AICc model-averaging; both use the following fixed-effects models: p(.)phi(.), p(t)phi(.), p(.)phi(t), and p(t)phi(t) with obvious constraints on the last terms. The fourth method is called CJSboost, from statistical boosting. It uses submodels called "base-learners" equivalent to the aforementioned fixed-effect models (but without the previous constraints). The AICc-methods should theoretically do best because they are fundamentally motivated by trying to minimize an objective function that is very closely related to MSE called the KL-loss (see Akaike 1979 and Akaike 1998). Likewise, CJSboost is trying to minimize a related generalization-error called the negative Expected log-Likelihood, which is approximated through bootstrapping the capture-histories.

The fake data-sets were generated according to the following. The time-varying survival values were:
phi_t = cos((t-2.3)/1.2)/11+0.75. The time-varying capture-probabilities p_t were drawn from a beta distribution with shape parameters A=12 and B=12, resulting in an average capture-probability of 0.5. The p_t values were the same for all simulations. MLE and AICc analyses were run in Program MARK. For CJSboost, a ten-times 70-fold bootstrap-validation exercise was run per dataset to tune the CJSboost regularization parameters. The CJSboost program is available on my Github site. The simulations and analyses were repeated 40 times for three scenarios pertaining to the number of capture-histories n = {50,200,800).

Visualizing the bias-variance trade-off and the error of estimating survival in a Cormack-Jolly-Seber analysis, using four procedures (panel rows): i) the shrinkage estimator CJSboost; ii) AICc model-averaging based on four fixed-effect models of time-varying vs. time-constant survival and capture-probabilities; iii) the best AICc model; and iv) the Maximum Likelihood Estimate using the full-model p(t)phi(t). Panel columns are different sample sizes (number of capture-histories) over T=10 primary periods. The red-lines show the true survival. Each grey line is an independently sampled dataset and an analyst's attempt to estimate survival. The dashed-lines represent each procedure's average estimate over 40 simulated data-sets and analyses. The best estimation procedure has the lowest MSE (turquoise for emphasis). Each procedure may have a high/low bias or low/high variance, but generally cannot succeed at minimizing both. The bias is the difference between the red and dashed line. The variance is represented by the dispersion among grey lines. At small sample sizes, the AICc methods and boosting are very biased but have better MSE.

The results clearly show the trade-off (Figure 2). At high sample sizes (n=800), the shrinkage estimator CJSboost has the lowest MSE and therefore wins at estimating survival. However, it has the highest bias. How can it be considered a better estimator than the other methods when it is biased? The answer is obvious when looking at the grey lines in Figure 2, where each line is an estimate of phi from an independent realization of data: compared to the other methods, each grey line from CJSboost is much more likely to be closer to the truth, despite systematic bias. In contrast, using the MLEs, one can only claim to be unbiased over all possible realizations of the data as shown by the closeness of the dashed black line to the true red line. But, for any one realization (a single grey line) the MLEs can be very far away from the truth due to much higher variance.

At smaller sample sizes, we see that the bias becomes much more extreme for both AICc methods and CJSboost. In the case of the AICc methods, the model with most support is often phi(.), in which case the estimates are a single flat line. This is also the case in CJSboost, were shrinkage is so extreme as to force a flat line. Therefore, at low sample sizes, we are much better off, in terms of MSE, to use the flat-lined phi(.) estimates rather than use the full-model MLEs, which vary so wildly as to be useless.

This primer is meant to illustrate the role of bias and variance in estimation errors. Simulations show how shrinkage estimators (CJSboost) and model-selection (by AICc) each negotiate the trade-off between bias and variance to try and minimize the Expected Error. CJSboost does particularly better by incurring a little bias.

The original article can be accessed at: http://dx.doi.org/10.1101/052266. Original blog post at mucru.org/a-primer-on-the-bias-variance-trade-off-in-ecology/.

Tuesday, 10 May 2016

New pre-print: EM and component-wise boosting for Hidden Markov Models: a machine-learning approach to capture-recapture

Rankin RW (2016) "EM and component-wise boosting for Hidden Markov Models: a machine-learning approach to capture-recapture", bioRxiv pre-print, doi:10.1101/052266, URL:http://github.com/faraway1nspace/HMMboost


A new pre-print article is available online at: http://dx.doi.org/10.1101/052266. The study proposes a new way to fit capture-recapture models based on boosting: a machine-learning method which iteratively fits simple, non-parametric learners and build a strong prediction function. The ensemble-method is a type of "multi-model inference" and "model-parsimony" technique, not unlike model-averaging by AICc. This new method, called CJSboost, is motivated by several deficiencies in the AICc model-selection approach, such as over-fitting the data and a tendency to produce ridiculous estimates, such as 100% survival. In contrast, boosting has the following benefits:

  • automatic variable selection and step-wise multimodel inference (without the sometimes-impossible task of fitting all possible fixed-effects models, as in AIC-based model averaging);

  • regularization and sparse estimation, which deflates the influence of unimportant covariates (especially important during the current crisis of reproduciblility);

  • shrinkage of estimates away from extreme values and inadmissible values (e.g., survival= 100%);

  • highly extensible (see the wide variety of base-learners available under the "mboost" R package).

  • inference based on predictive performance


  • One disadvantage of the machine-learning paradigm for capture-recapture is the computational burden of running 50 or 100 bootstrap-validation to find the optimal regularization parameters that minimize our generalization error.

    The Prediction Perspective

    Capture-recapture practitioners are perennially obsessed with model-selection and model-parsimony: the challenge of revealing important patterns in the data, without being tricked by frivolous false discoveries, i.e., not over-fitting models. The machine-learning community has addressed this challenge through the idea of generalization error: minimizing the cost of bad predictions on new, unseen data. Models that are too simple (e.g., the classic phi(.)p(.) model) are unlikely to make good predictions, whereas overly-complex models are likewise unlikely to generalize to new data.

    The boosting framework controls the complexity of the model (called regularization) by trying to minimize the generalization error (aka expected loss), and hence, avoid overfitting the data. Interestingly, most capture-recapture practitioners have been implicitly using a prediction criteria, the AIC, to rank and select models. Few ecologists seem to recognize the connections between AIC-selection and prediction, and this manuscript tries to make these connections more obvious. However, while the AIC was developed in the context of General Linear Models with Gaussian error, the CJSboost method, proposed in this manuscript, is crafted specifically for capture-recapture models.

    Sparsity and Prediction

    One the of philosophical divisions between prediction-optimal regularizers, like boosting or the AICc, and other criteria, like the BIC/SBC, is belief in a sparse "true model": whether there is a small number of truly influential variables that have large/medium effects, or, whether there are an infinite number of influential variables, each with a decreasingly important influence on our data. This division is rooted in whether we want to minimize our risk of making bad numerical predictions, or whether we want to recover the "truth". The AIC and CJSboost adopt the former persepective: they both reduce the influence of unimportant variables, but there is still some positive-weight placed on unimportant variables (this is the consequence of being prediction optimal!).

    If prediction is not one's focus, and one is interested in finding a sparse set of truly influential variables, then CJSboosted coefficients may be "hard-thresholded": discard unimportant covariates and rerun the algorithm. But how to judge which covariates are important? In the manuscript, I describe the technique of "stability selection" (Meinshausen and Bühlmann, 2010) for estimating (approximate) inclusion probabilities, based on repeatedly resampling the data and re-running the CJSboost algorithm. Probabilities lead to straight-forward inference: covariates with high inclusion probabilities are probably important; covariates with low probabilities are probably not important. An interesting thing about l1-regularizers, like boosting algorithms, is that by hard-thresholding the covariates with stability-selection probabilities, we can transform a prediction-optimal procedure into a "model-selection consistent" procedure. This may not be exactly true for capture-recapture models, but the manuscript explores the issue with simulations.

    The animation below shows 30 simulations and their stability selection pathways. Each slide is a different simulated dataset with 3 truly influential covariates (in red) and 18 other frivolous covariates, just to try and trick the algorithm. The pathways show that as the regularization gets weaker (larger m, to the right), there is a larger probability that any covariate will be selected by the algorithm. At strong regularization (small m, left), only the 3 most influential covariates (in red) enter the algorithm. Visually, this is an interesting way to discriminate between important and unimportant covariates, and can also help one achieve model-selection consistency.


    Simulation: How to discriminate between important covariates (in red) and unimportant covariates (in grey) with stability selection curves. By repeated re-sampling the capture-histories (bootstrapping), and re-training the CJSboost model, we can estimate the posterior probability that a covariate is included in the model.

    Online Tutorial

    The manuscript includes an online tutorial in R at: http://github.com/faraway1nspace/HMMboost. Interested readers can step through the analyses in the manuscript.

    Peer-review

    Readers should note that the manuscript is only a preprint: it is currently in peer-review.

    @article{Rankin052266,
    author = {Rankin, Robert William},
    title = {EM and component-wise boosting for Hidden Markov Models: a machine-learning approach to capture-recapture},
    year = {2016},
    doi = {10.1101/052266},
    publisher = {Cold Spring Harbor Labs Journals},
    abstract = {This study presents a new boosting method for capture-recapture models, rooted in predictive-performance and machine-learning. The regularization algorithm combines Expectation-Maximization and boosting to yield a type of multimodel inference, including automatic variable selection and control of model complexity. By analyzing simulations and a real dataset, this study shows the qualitatively similar estimates between AICc model-averaging and boosted capture-recapture for the CJS model. I discuss a number of benefits of boosting for capture-recapture, including: i) ability to fit non-linear patterns (regression-trees, splines); ii) sparser, simpler models that are less prone to over-fitting, singularities or boundary-value estimates than conventional methods; iii) an inference paradigm that is rooted in predictive-performance and free of p-values or 95\% confidence intervals; and v) estimates that are slightly biased, but are more stable over multiple realizations of the data. Finally, I discuss some philosophical considerations to help practitioners motivate the use of either prediction-optimal methods (AIC, boosting) or model-consistent methods. The boosted capture-recapture framework is highly extensible and could provide a rich, unified framework for addressing many topics in capture-recapture, such as spatial capture-recapture, individual heterogeneity, and non-linear effects.},
    URL = {http://dx.doi.org/10.1101/052266},
    eprint = {http://www.biorxiv.org/content/early/2016/05/09/052266.full.pdf},
    journal = {bioRxiv}
    }

    Monday, 21 March 2016

    New publication: A Bayesian Hierarchical Pollock's closed robust design

    Rankin, R. W., Nicholson, K. E., Allen, S. J., Krützen, M., Bejder, L., & Pollock, K. H. (2016). A full-capture Hierarchical Bayesian model of Pollock’s Closed Robust Design and application to dolphins. Frontiers in Marine Science, 3(25). doi:10.3389/fmars.2016.00025

    We are pleased to announce our new publication on Shark Bay bottlenose dolphins which benchmarks model-averaging in Program MARK and a Bayesian Hierarchical model for temporary-migration Robust Design mark-recapture models. See below for the citation, link to the free full-text PDF, and an online R/JAGS tutorial for Bayesian mark-capture.

    Alternative to AIC Model-Averaging

    The paper will be of interest to cetacean researchers who use Program MARK for temporary-migration Robust Design models. In particular that we show that a Hierarchical Bayesian model can yield similar estimates as model-averaging by AICc, the latter being the current best-practise to deal with the vast number of 'fixed-effects' models that one typically considers. Model-averaging and Bayesian frameworks have some similar philosophical underpinnings, such as conditioning on the data (Burnham and Anderson 2014). However, the HB framework is also highly extensible and can deal with other challenges where the AIC is undefined, such as random-effects and individual-level heterogeneity in capture-probabilities.

    Mark-Recapture and low-sample sizes: the Bayesian Answer

    Bayesian models are a solid answer to a perennial dilemma among cetacean researchers: photo-ID datasets are typically sparse or have low-sample sizes. In contrast, researchers typically want complex data-hungry model to increase ecological realism. For example, a simple temporary-migration model or individual heterogeneity model will demand >30 - 70 variables for a mid-sized dataset. Frequentist and AICc-based inference will be overly-confident in such situations, and yield ridiculous estimates such as 100% detection, or 0% migration, or 100% survival, or just fail altogether. Alternatively, Hierarchical Bayesian models provide exact inference under low-sample sizes: they just depend more on the prior distributions, which, if set-up thoughtfully, are more conservative, make better predictions, and can automatically safeguard against over-parametrization (Berger 2006, Gelman 2013).

    Individual Heterogeneity

    As the distribution of individual capture probabilities gets wider, the least-captureable individuals are more likely to be missed altogether (grey area), leading to a  biased high estimate of the mean capture parameter (p-hat), and biased low population abundance estimates (top-right corner)
    As the distribution of individual capture probabilities gets wider, the least-captureable individuals are more likely to be missed altogether (grey area), leading to a biased high estimate of the mean capture parameter (p-hat), and biased low population abundance estimates (top-right corner)
    Individual heterogeneity in capture probabilities will result in biased-low population abundance estimates (see animation), and therefore it is a primary preoccupation of most capture-recapture practitioners. Under a Hierarchical Bayesian full-capture framework, it is trivial to model individuals as coming from a distribution, without a large increase in complexity. In contrast, the comparable fixed-effect version in Program MARK, the 'two-point finite mixture model', typically yields over-parametrized models and unreliable capture-estimates (e.g., p=1).

    R and JAGS code

    See our online R/JAGS tutorial at Github for code to run the Hierarchical Bayesian Pollock's Closed Robust Design. The tutorial includes an example photo-ID bottlenose dolphin dataset from Krista et al. 2012. We use the flexible BUGS-like Bayesian syntax called "JAGS", which makes Bayesian models accessible to almost anyone with rudimentary scripting skills.

    Key Findings


  • full-capture, non-hierarchical Bayesian PCRD models had slightly better estimation performance than equivalent fixed-effects Maximum-Likelihood estimation (in MARK), mainly due to the latter's susceptibility to singularities (although there was no clear champion);
  • we propose a Hierarchical Bayesian PCRD which can lead to similar estimates as AICc model-averaging and serve as a type of multi-model inference;
  • we showed how heterogeneity in detection probabilities can lead to a 8-24% increase in bottlenose dolphin abundance estimates, as compared to ML and Bayesian models that assume homogeneous detection probabilities;
  • we explored the partial non-identifiability and high correlation among parameter estimates, especially between survival and temporary-migration which has serious consequences for ones' ability to use these parameters for inference, and which should influence researchers' study design and modelling strategies;
  • we proposed two posterior predictive checks to help diagnose poor model fitting, in lieu of a formal goodness-of-fit procedure in popular CMR software.


  • But Aren't Bayesian's Biased?

    Some Mark users who are new to Bayesian inference may worry about prior information and the inherent bias of subjective Bayesian models. A subjective Prior contains information. This information competes with the information in the data to influence the Posterior distribution (the probability distribution which is used to make conclusions about a variable of interesheterogenietyt, like population abundance or survival). The more information in a Prior, the more a posterior expectation will be pulled away from the "objective" MLE and towards the prior expectation; conversely, the more information in our data (e.g., a large sample size), the prior becomes less influential and inferences are practically the same between ML-based and Bayesian paradigms.

    But, there is strong evidence from the machine-learning and predictive analytics community that slightly conservatively biased models yield better predictions, especially in the face of low-sample sizes and very complex models (Murphy KP, 2012). In the Learning community, this is called "Regularization", such as the Lasso or Ridge Regression or Boosting: these techniques impose a penalty on model complexity and favour simpler models than "objective" ML models estimate. Interestedly, many of the Learning communities' regularization techniques are actually a type of Bayesian model (Hooten and Hobbs 2015).

    The phenomenon is nicely illustrated with a coin toss: Imagine you want to estimate the fairness of a coin (H vs T), but you flip it only 3 times. Let's say you observed TTT (a rare, but not impossible event). The Maximum Likelihood Estimate would be that the probability of a tail is 100%, and the coin is definitely not fair. Of course, we know this is a terrible estimate and a stupid example, because we have solid prior information about the behaviour of coins. However, the sad fact is that many Dolphin Mark-Recapture researchers are making similarly stupid experiments, under small sample sizes and overly complex Frequentist models which estimate of 100% detection probability or 100% survival or 0% movement.

    In contrast, a little bit of skepticism encoded in a Bayesian prior (like Beta(1,1)) can temper our estimates. For our coin example, the posterior would say that the probability of a tail is just 80%, given the observed sequence TTT. That's still not the true 0.5, but it is a better prediction than the frequentist's 100%

    Abstract

    We present a Hierarchical Bayesian version of Pollock's Closed Robust Design for studying the survival, temporary-migration, and abundance of marked animals. Through simulations and analyses of a bottlenose dolphin photo-identification dataset, we compare several estimation frameworks, including Maximum Likelihood estimation (ML), model-averaging by AICc, as well as Bayesian and Hierarchical Bayesian (HB) procedures. Our results demonstrate a number of advantages of the Bayesian framework over other popular methods. First, for simple fixed-effect models, we show the near-equivalence of Bayesian and ML point-estimates and confidence/credibility intervals. Second, we demonstrate how there is an inherent correlation among temporary-migration and survival parameter estimates in the PCRD, and while this can lead to serious convergence issues and singularities among MLEs, we show that the Bayesian estimates were more reliable. Third, we demonstrate that a Hierarchical Bayesian model with carefully thought-out hyperpriors, can lead to similar parameter estimates and conclusions as multi-model inference by AICc model-averaging. This latter point is especially interesting for mark-recapture practitioners, for whom model-uncertainty and multi-model inference have become a major preoccupation. Lastly, we extend the Hierarchical Bayesian PCRD to include full-capture histories (i.e., by modelling a recruitment process) and individual-level heterogeneity in detection probabilities, which can have important consequences for the range of phenomena studied by the PCRD, as well as lead to large differences in abundance estimates. For example, we estimate 8%-24% more bottlenose dolphins in the western gulf of Shark Bay than previously estimated by ML and AICc-based model-averaging. Other important extensions are discussed. Our Bayesian PCRD models are written in the BUGS-like JAGS language for easy dissemination and customization by the community of capture-mark-recapture practitioners.


    Import into Reference Manager


    Copy the following Bibtex into your favourite Reference Manager.

    @article{rankin_full-capture_2016, title = {A full-capture {Hierarchical} {Bayesian} model of {Pollock}'s {Closed} {Robust} {Design} and application to dolphins}, volume = {3}, url = {http://journal.frontiersin.org/article/10.3389/fmars.2016.00025}, doi = {10.3389/fmars.2016.00025}, abstract = {We present a Hierarchical Bayesian version of Pollock's Closed Robust Design for studying the survival, temporary migration, and abundance of marked animals. Through simulations and analyses of a bottlenose dolphin photo-identification dataset, we compare several estimation frameworks, including Maximum Likelihood estimation (ML), model-averaging by AICc, as well as Bayesian and Hierarchical Bayesian (HB) procedures. Our results demonstrate a number of advantages of the Bayesian framework over other popular methods. First, for simple fixed-effect models, we show the near-equivalence of Bayesian and ML point-estimates and confidence/credibility intervals. Second, we demonstrate how there is an inherent correlation among temporary migration and survival parameter estimates in the PCRD, and while this can lead to serious convergence issues and singularities among MLEs, we show that the Bayesian estimates were more reliable. Third, we demonstrate that a Hierarchical Bayesian model with carefully thought-out hyperpriors, can lead to similar parameter estimates and conclusions as multi-model inference by AICc model-averaging. This latter point is especially interesting for mark-recapture practitioners, for whom model-uncertainty and multi-model inference have become a major preoccupation. Lastly, we extend the Hierarchical Bayesian PCRD to include full-capture histories (i.e., by modelling a recruitment process) and individual-level heterogeneity in detection probabilities, which can have important consequences for the range of phenomena studied by the PCRD, as well as lead to large differences in abundance estimates. For example, we estimate 8\%-24\% more bottlenose dolphins in the western gulf of Shark Bay than previously estimated by ML and AICc-based model-averaging. Other important extensions are discussed. Our Bayesian PCRD models are written in the BUGS-like JAGS language for easy dissemination and customization by the community of capture-mark-recapture practitioners.}, number = {25}, journal = {Frontiers in Marine Science}, author = {Rankin, Robert W. and Nicholson, Krista E. and Allen, Simon J. and Krützen, Michael and Bejder, Lars and Pollock, Kenneth H.}, year = {2016} }

    Friday, 26 February 2016

    Hierarchical Bayesian: Automatic Occam's Razor

    There is a crisis of "reproducibility" right now in science (Prinz et al 2011; Simmons et al 2011; Begley et al 2012): conservative estimates are that more than 50% of scientific articles are false and cannot be reproduced. Bad estimates are rampant, and false-positives are the norm! This new 'digitial dark age' has many culprits. One contributor is, I think, a misuse and misunderstanding of the dominant statistical paradigm called "Frequentism". Frequentism is a great intellectual achievement, and has such lofty ideals as 'objectivity', a.k.a., let the data speak for itself. Let's not inject our own hopes and desires into our scientific research. Sounds great doesn't it? Why would we want otherwise?

    Well, for one thing, researchers (especially dolphin researchers) rarely have enough data to use Frequentist tools. Thus, we violate the core principle of Frequentism, for which its name is derived:

    In frequentism, it is the observed data that is the random variable. Had we repeated the experiment in a parallel universe, we would get different data. We don't care about the actual data values (like individual dolphin's measurements); instead, we care about the true population-level process which gives rise to this data. We try to make inferences about this true process by sampling from the data distribution, and then try to generalize over the entire data distribution.

    The core principle is that the Frequentist is trying to use a small sample to characterize the entire data distribution. This requires a lot of data. And when we lack such data (because, heck, dolphins are difficult to find!), Frequentist estimates start to take on rediculous values. For example, I can't tell you how many times a dolphin researcher using a Mark-Recapture model has estimated a survival of 100%. I am serious! There is a big problem with our supposedly 'objective' statistics when they lead us to believe that dolphins are immortal. But here is the catch: my knowledge that dolphins age and die is a type of "prior information", and for me to use this prior biological knowledge leads us to another rival paradigm called Bayesianism.

    I won't get into the philosophy of Bayesianism (partially because I don't quite understand it). The best I can do is imagine a Bayesian as a poker player: they don't consider the data in their hand (e.g., their current cards) to be random: they consider it a given, and they want to make the best predictions possible about the next cards to be drawn, conditional on the evidence in their hand. Seems reasonable enough, but it turns out this is in direct opposition to Frequentism.

    Parsimony and Hierarchical Bayes


    But, one nice feature about Bayesianism is that we can encode into our statistical models an old ideal from Occam: "parsimony". Basically, we say "Don't tell me a dolphin is immortal unless you have a lot of data!" Technically, we can use "hyper-prior" distributions on parameters like survival to force simplicity in lieu of strong evidence, but allow for more complex models if there is a lot of data and larger sample size. This is demonstrated nicely in the following animation.


    Look at the red Maximum Likelihood Estimates on the Right (a Frequentist tool). They are estimating a probability p  from a series of 0-1 outcomes (like, the number of dolphins captured from N available), and this is done in parallel during 10 different "groups" (like 10 independent experiments). Notice that the ML Estimates  stay the same over a wide range of data/evidence (even though the "true" mean is around 0.5), and they vary from p=0.2 to p=1. However, the Hierarchical Bayesian posteriors (in purple) behave very differently. When the sample size is small (5 observations per group), each group is shrunk to the overall mean at 0.5. It is only as the sample size increases that each group is allowed to take on their own individual values; and eventually, with a lot of data, each groups' posterior expectation becomes nearly the same as the MLE.

    This is a type of "model parsimony": with only a little bit of data, our model is simple (all groups have the same estimate, and the estimate is the group mean at 0.5). With more data, our model becomes more complex (all groups share a mean, but they can also have individual variation, each with their own parameter p).

    The abuse of MLE is nicely illustrated with the following experiment: Imagine you want to estimate the fairness of a coin, but flip it only 3 times. Let's say you observed TTT (a rare event, but not impossible). The Maximum Likelihood Estimate would be that the probability of a tail is 100%, and the coin is definitely not fair. Of course, we know this is a terrible estimate and a stupid example, because we have solid prior information about the behaviour of coins. However, the sad fact is that many Dolphin Mark-Recapture researchers are making similarly stupid experiments, under small sample sizes and overally complex Frequentist models which estimate of 100% detection probability or 100% survival or 0% movement. These terrible estimates are polluting our scientific literature.

    The MLE's are indeed the most 'objective', but they make bad predictions for new data and give us rediculous results under low-sample sizes. In contrast, a little bit of skepticism encoded in a Bayesian prior (like Beta(1,1)) can temper our estimates. Returning to the coin flipping example, the posterior would say that the probability of a tail is just 80%, given the observed sequence TTT. That's still not the true 0.5, but it is a better prediction than the frequentist's 100%

    Unified Under Hierarchical Bayes


    The poor performance of MLE was noticed by non-statisticians a long time ago in the Machine Learning and predictive analytics communities. They obsessed over "regularization": finding ways to constrain the complexity of models, and deliberately bias the estimates towards simplicity. Techniques like the LASSO, Ridge Regression, Elastic Net, Boosting, AIC optimization, Bayesian Model averaging, etc, were devised to make better predictions on new data, rather than philosophical notions of probability and objectivity.

    The eerie thing is that most of these techniques have a Bayesian interpretation. The famous Lasso is just a Bayesian model with a Double Exponential prior on the regression coefficients. The idea is explored in an accessible article by Hooten and Hoobs 2015.

    Subjective Pseudo Bayesian and the Degradation of Science


    If ecologists are willing to inject a little bit of bias into their analyses in order to get better predictions and sensible results, then what of objectivity? What is to stop one from 'cooking the books'? Talking about cavalier 'pseudo'-Bayesian analysis, the Bayesian heavy-weight James Berger wrote in 2004:
    Statistics owes its central presence in science and life to the facts that (i) it is enormously useful for prediction; (ii) it is viewed as providing an objective validation for science. Discarding the latter would lead to a considerable downgrading of the value placed on statistics.

    At the time of his warning, few non-statisticians were using subjective Bayesian models, and even then, they seemed to want "objective" reference priors which conveyed no prior information, and yielded nearly identical results as Frequentist models. It is a bit ironic that the opposite seems to have happened: the crisis of reproducibility have been unveiled, and it was under the watch of the so-called objective school of statistics, and not from subjectively biased Bayesian models. Now that the tables of turned, and more and more researchers are turning to Bayesian models, Berger's 2004 warning sits, waiting to be realized. The solution is probably everyone adopting a prediction mentality. Dolphin ecologists have already sort of adopted this practise by using the AIC for model selection, but have yet to grapple with its philosophical underpinnings, nor rival ideas.