Wednesday, 2 October 2013

Western Australia Statisticial Society -- sexy and optimistic

As an ecologist, I'm used to conferences being a bit grim: audiences and presenters lament that there is no funding, extinction continues, and we don't really know what's going on. In contrast, the WA Young Statistician's Workshop was a rare treat of optimism and self-satisfaction. Triumphant, in demand, and beloved as nerds, data professionals are 'the new rock stars' and 'sexy' (according to conference attendees). Most of the attendeers were preoccupied with jobs and skill acquisition, but I found the following panel-question to be a noteworthy insights: what stats/math principles & techniques are under-realized now, but will have a huge windfall of applications in the future; or put another way, if there was a Noble Prize in Statistics, where would you place your bets today? Some answers: * finite-mixture models * techniques to deal with high-dimensionality * entropy under minimal constraints (mine) Other re-occurring themes were: * learn R; * R tops Python for data science, but Python is easier to learn has a lot to offer.

Saturday, 28 September 2013

Starting a PhD in Australia

Time flies! I've been in Australia for two months now, and I've been lucky to have been out on surveys for Bottlenose Dolphins in Adelaide (South Australia) and Bunbury (South West Australia) and for Snubfin Dolphins (a Northwest Australia). Also, I've had the all-Aussie-holiday experience vacationing in Bali; I've built a recycled Chicken coop with my 'permie' housemates in White Gum Valley near Perth; and I even shepherd a happy sheep named Euwan around the huge backyard garden,lush with Kale and chard and peppers. My super(ish) computer arrived the other day (and despite some crashes with Gnome-Ubuntu), and so things are looking very promising rounded life of cerebral PhD'ing, homey animal husbandry, and wildlife encounters as only Australia can deliver: today a >2m long Manta ray was slowly bowriding our survey boat in the shallows of Cygnet Bay!

(future) chicken coop

Euwan the sheep

Adelaide Bottlenose dolphins on survey

Snubfin dolphins in Cygnet Bay, North Western Australia

Friday, 21 June 2013

Soltice and the Southern Hemisphere!

This June 21st, I'm off to the Southern Hemisphere -- much like the trajectory of the Solar Equator, moving settling South at the height of its Summer Solstice. Ok, its a bit of a stretch, but there is something neat about inverting hemispheres on the Solstice. Its also a goodbye to my 2+ years working and living in the United States. Farewell Hawaii, DC (and Canada). I'll be starting PhD studies at Murdoch University and the Cetacean Research Unit. Its a bit of risk leaving NOAA for a student position, but it is something I've always wanted to do. Coming on the heals of this blog post about the blight of ecology PhDs in an era of science divestment, I'm looking at it as a 'life accomplishment' rather than a serious career move. Plus, the chance to explore a whole new continent and bizarro ecosystem.

Tuesday, 18 June 2013

Zotero and R: automatically find relevant scientific articles with the Microsoft Academic Search API

Zotero is a powerful scientific-article manager that is part of my 'cannot-live-without' toolbox for research. Its one drawback compared to ReadCube or Mendeley is the lack of a 'find relevant articles' engine to smartly expand your library. But, unlike the aforementioned rivals, Zotero is an open-source project, and is naturally hackable and open to exploration with programs such as R. To make up for the deficiency, I made a script that i) provides R access to your Zotero database; ii) loops through articles and finds them online via the Microsoft Academic Search API; iii) finds the references and 'cited by' articles; iv) and outputs html files with links to Google Scholar,, and the publisher's website to make it as easy as possible to get high-relevancy articles into Zotero.

The principle is simple: find articles that are most frequently cited by the authors in your Zotero database (and which you haven't yet read) and find other articles that cite the articles in your database. Rather than using keyword similarity algorithms, this script just assumes that Authors who think and read similarly as you do probably know what's relevant.

Below is the code which can be yanked to your R terminal. There is one source-code file that should be downloaded into your working directory (download from here) and called: "SQL_zotero_query.txt" (thank you Royce). The sections which need to be customized are the location of: a) your Zotero database folder (which has zotero.sqlite) and b) the folder to save the output html files. If you use this frequently, you should also get your own free MSA API key (mine is provided below but has limited amount of queries allowed).

Enjoy! Please send me any suggestions and questions!

# Microsoft Academic Search API (please get your own)
apikey <- "88cab0fa-2dd9-4eef-9d07-67d4e0a5c933"

options(gsubfn.engine = "R")

stem <- ifelse(['sysname']=="Linux","//home/rob","C:/Users/Rob") # your user directory
msadir <- paste(stem,"/Documents/",sep="") # working directory
zot_db <- paste(stem,"/Documents/Literature/zotero.sqlite.bak",sep="") # zotero original databaset to copy
file.copy(from=zot_db, to=msadir, overwrite=TRUE) # notice backup of zotero databse
db <- paste(msadir,"/zotero.sqlite.bak",sep="")

# SQL command
SQLquery_txt_fname <- paste("SQL_zotero_query.txt",sep="") # source code for sqlcommand
SQLquery_txt <- readLines(SQLquery_txt_fname)
conn <- dbConnect("SQLite", dbname = "zotero.sqlite.bak")
dbListTables(conn)#list tables in zotero database
res2 <- dbSendQuery(conn, statement = paste(SQLquery_txt,collapse=" ") )
zd <- fetch(res2, n=-1)#
# FILTER to only journal articles
zd <- zd[which(zd$TYPE %in% c("journalArticle")),]
# remove TAGS and abstract
zd <- zd[,-(grep("TAG_",names(zd)))]
zd <- zd[,-(grep("ABSTRACT",names(zd)))]

zd$TITLE <- gsub('<i>',"",zd$TITLE)
zd$TITLE <- gsub('<i>',"",zd$TITLE)
zd$TITLE <- gsub(':',"",zd$TITLE) # remove some other
zd$TITLE <- gsub(',',"",zd$TITLE) # remove some other
zd$TITLE <- gsub('-'," ",zd$TITLE) # remove some other

gettits <- function(rmas){ lapply( rmas$d[[8]][[5]][1:length(rmas$d[[8]][[5]])], function(lx){ ti <- lx$Title
gsub('<i>',"", gsub('<i>',"", gsub(':',"",gsub(',',"",gsub('-'," ",ti)))))
}) } # get the titles of the returned objects
getDOI <- function(rmas){lapply( rmas$d[[8]][[5]][1:length(rmas$d[[8]][[5]])], function(lx) lx[["DOI"]]) } # get the DOI of the returned objects
getMSID <- function(rmas){lapply( rmas$d[[8]][[5]][1:length(rmas$d[[8]][[5]])], function(lx) lx[["ID"]]) } # MAS internal ID
getauths <- function(rmas){ lapply(rmas$d[[8]][[5]][1:length(rmas$d[[8]][[5]])], function(lx) {
ret<-unlist(lapply(lx[["Author"]], function(al) al[["LastName"]])); ifelse(is.null(ret),NA,ret)}
)} # last names of authors
getyear <- function(rmas){ lapply(rmas$d[[8]][[5]][1:length(rmas$d[[8]][[5]])], function(lx) lx[["Year"]])}
geturl <- function(rmas){ lapply(rmas$d[[8]][[5]],function(lx){ret<-lx$FullVersionURL; if(is.null(ret) | length(ret)==0){ret<-NA}; ret[1]})}
getpub <- function(rmas){ lapply(rmas$d[[8]][[5]],function(lx){ret<-c(lx$Journal[c("FullName","ShortName")],NA); ret[[which(!is.null(ret))]]})}
# fuzzy matching by agrep
fuzzycor <- function(s1,s2){
sv <- list(one= unlist(strsplit(s1," ")),two=unlist(strsplit(s2," ")))
sv$one <- sv$one[sv$one!=""]; sv$two <- sv$two[sv$two!=""]
find1in2 <- sum(sapply(sv[[1]],function(ss) length(agrep(ss,s2,>0))
find2in1 <- sum(sapply(sv[[2]],function(ss) length(agrep(ss,s1,>0))

# STEP one: match local zotero holdings with the MSAR numbers
authcol <- grep("AUTHOR_[[:digit:]]{1}_LAST",names(zd))
iddbf <- paste(msadir,"msID_db.csv",sep="") # local file of MSA id's
iddb <- data.frame(ITEMID=-1,msID=NA,ztitle=NA,mstitle=NA,zdoi=NA,msdoi=NA)

for(i in 1:nrow(zd)){

msID <- NULL # a handler for the final matching option in retrieval
ctit <- cdoi <- mj <- NA

# zotero title
ztit <-zd$TITLE[i]
zauths <-zd[i,authcol][which(![i,authcol]))]
zdoi <- zd[i,"DOI"]
zyear <- strsplit(zd$DATE[i],split="(\ )|-" ,perl=TRUE)[[1]][1]
print(paste("MSA query for '",substring(ztit,1,50),"..."))
textcall <- paste("",apikey,"&TitleQuery=",gsub(" ","+",ztit),"&ResultObjects=publication&PublicationContent=AllInfo&StartIdx=1&EndIdx=5",sep="")
con<- url(description=textcall)
rmas <- fromJSON(readLines(con))

# check if the resource is in MSA

msID <-NA; ctit<-"not_found"; cdoi<-NA; mj <- 1
print(paste("no MSA results for '",substring(ztit,1,50),"..."))
} else {
print(paste("found MSA results for '",substring(ztit,1,50),"..."))
# authors
ctit <- gettits(rmas)
cauths <- getauths(rmas)

# first try matching by the doi
cdoi <- unlist(getDOI(rmas))
mj <- which(cdoi == zdoi)[1]
msID <- getMSID(rmas)[[mj]]
# title correlation
Rtit <- unlist(lapply(ctit, function(mstit,ztit) fuzzycor(mstit,ztit), ztit=ztit))
# check the number of Zotero authrs are in the MSA listing
Rauths <- unlist(lapply(cauths, function(msauths,zauths){
mean(sapply(zauths, function(auth){ length(agrep(auth, msauths,>0}))},
# year correlations
Ryear <- unlist(lapply(getyear(rmas), function(cyr){ 1*(cyr==zyear)}))
Rs <- data.frame('cbind', list(Rtit,Rauths,Ryear)))
mincrit <- apply(Rs, 1, function(rw) all(rw > 0.8))
Rs <- Rs[which(mincrit),]
mj <- which.max(rowMeans(Rs))[1]
msID <- getMSID(rmas)[[mj]]
} else {
msID <- NA
} #is.null(msID)
} # resource found
iddb <- rbind(iddb, data.frame(ITEMID=zd$ITEMID[i], msID=msID,ztitle=as.character(ztit),mstitle=as.character(ctit[mj]),zdoi=zdoi,msdoi=cdoi[mj]))
iddb2 <- iddb[which(!$msID)),] # remove not found

# STEP TWO: query the MS id and learn all the "cited" and "cited by" references
# loop through found MSA records
refs <- list(cites=as.list(rep(NA,nrow(iddb2))),cited=as.list(rep(NA,nrow(iddb2))))
names(refs$cited) <- names(refs$cites) <- iddb2$ITEMID
# make a db to store the journal results
pdb <- data.frame(msID=NULL, cited=NULL, title=NULL, author1=NULL, year=NULL, journal=NULL, doi=NULL,url=NULL)

for(i in 1:nrow(iddb2)){

# find other works which cite the current article
msID <- iddb2[i,"msID"]
textcall <- paste("",apikey,"&PublicationID=",msID,"&ReferenceType=Citation&PublicationContent=AllInfo&StartIdx=1&EndIdx=50&OrderType=CitationCount",sep="")
con<- url(description=textcall)
rmas <- fromJSON(readLines(con))
refs$cites[[i]] <- unlist(getMSID(rmas))
pdb <- rbind(pdb,
author1=unlist(lapply(getauths(rmas), function(lx2) lx2[[1]])),
# find other works which are cited by the current article
msID <- iddb2[i,"msID"]
textcall <- paste("",apikey,"&PublicationID=",msID,"&ReferenceType=Reference&PublicationContent=AllInfoIdx=1&EndIdx=50&OrderType=CitationCount",sep="")
con<- url(description=textcall)
rmas <- fromJSON(readLines(con))
refs$cited[[i]] <- unlist(getMSID(rmas))
pdb <- rbind(pdb,
author1=unlist(lapply(getauths(rmas), function(lx2) lx2[[1]])),
names(refs$cited) <- names(refs$cites) <- iddb2$ITEMID
pdb <- unique(pdb)

# STEP THREE: tally results
citetally <- list(cites=NA,cited=NA) # storage for results (back/forwards citations)
for(i in 1:length(refs)){
tpt <- table(unlist(refs[[i]]))
tpt <- tpt[order(tpt,decreasing=TRUE)]
alreadyhave <- names(tpt)[which(names(tpt) %in% iddb2$msID)]
tpt <- tpt[which(names(tpt) %in% alreadyhave==FALSE)]
tptdb <- data.frame(msID=names(tpt),zcount=as.numeric(tpt))
citetally[[i]] <- sqldf("SELECT pdb.*,tptdb.zcount as 'zcount' FROM tptdb LEFT JOIN pdb ON tptdb.msID=pdb.msID")

# STEP FOUR A: save final output as CSV
write.csv(citetally[[1]], paste(msadir,"most_citedby.csv",sep=""),row.names=FALSE)
write.csv(citetally[[2]], paste(msadir,"most_cited.csv",sep=""),row.names=FALSE)

# STEP FOUR B: save final output as html to open links
# best way to get things into firefox :)
cat(paste("<html><body>zcount is the number of articles in your database cited by the focal article<br><table><tr><th>",
htmltxt <- apply(citetally[[1]],1,function(x){
paste("</td><td><a href='",
gsub(" ","+",x[3]),collapse="",sep=""),
"' target='_blank'>",x[3],"</a></td><td><a href='",x[7],"' target='_blank'>",x[7],
"</a></td><td><a href='",x[8],"' target='_blank'>website</a></td></tr>\n",sep="")})
# 2nd webpage
cat(paste("<html><body>zcount is the number of articles in your database which cite the focal article<table><tr><th>",
htmltxt <- apply(citetally[[2]],1,function(x){
paste("</td><td><a href='",
gsub(" ","+",x[3]),collapse="",sep=""),
"' target='_blank'>",x[3],"</a></td><td><a href='",x[7],"' target='_blank'>",x[7],
"</a></td><td><a href='",x[8],"' target='_blank'>website</a></td></tr>\n",sep="")})

And some example output...

Why R? The above script just serves as a one-stop-shop for SQL and JSON processing. One the side, I also use R's wonderful visualization tools and matrix processing facilities to play around with author and keywords. But really, the above script could probably be run more efficiently in Python or Java.

A special thanks to the post by Royce Kimmons at for the SQL command to access Zotero databases.

BTW, in case you're wondering why I'm using two open-source projects with a Microsoft project: other online and free-tools such as CiteUlike or CiteseerX do NOT provide the needed forwards-citation or backwards-citation information, neither through an API or thourgh webscrapping. I'd love some alternatives

Sunday, 19 May 2013

Happy hens of Haiku and the Boo Boo Zoo

Lindsay and I celebrated this past Earth Day by finally getting some backyard chickens in Haiku, HI. What could be better than combining composting, egg production, useless yardspace, and companionship!

In a typically Maui fashion, our hens are a bit strange, being rescue hens of unknown provenance from Maui's 'boo boo zoo'. The private animal rehab center for domestic animals is one of many controversial private animal sanctuaries, notorious for a strict no-kill-policy. Broken limbs, blindness, illness, no euthenasia whatsoever.

While such a policy is perhaps cruel, the greater tragedy is how large amounts of private money (often from one or two single rich donors) flows into so many dubious domestic animal sanctuaries, while programs targeted at endangered Hawaiian species go underfunded. Hawaii has lost the majority of its endemic species and more are on the way out. How great it would be if one or two rich Americans made it their 'pet cause' to fund the conservation efforts aimed at the Maui parrotbill (Only ~500 left) rather than rescuing species that are invasive and at zero risk of extiction.

Dispite good intentions, such suboptimal outcomes highlight the fallacy of the Libertarian (and American?) narative that we should facilitate rich people having more money through low taxation and let them reinvest the money into the economy as they see fit, rather than tax and use the resources for needed projects. The extremely rich just funnel scarce resources into crazy pet projects of dubious value.

BTW, volunteering / donating to the Maui Parrotbill project is very fun, even more fun than chickens, and extremely valuable. Check out

Tuesday, 19 February 2013

Advice for ecology grad students... (Science)

ECOLOG is the the largest listserv devoted to ecology-based topics (also one of the most bizarre listserv I've encountered). There has been a lively discussion recently about a publication by Blickley et al (2013). They provided a thoughtful analysis of the skills needed by ecology/conservation employers, inferred from job-postings on the web. The community's response has been hearty surprise by the study's emphasis on 'project management / interpersonal skills', trumping technical skills. This conclusion seems consistent with a quick head-count around the NOAA office: most of my colleagues are administrators, managers, coordinators (with impressive technical qualifications) while dedicated Quants are few and far between. In contrast, the community gave a resounding 'learn GIS' rejoinder to the study.

And because I love ordination diagrams, below is a closer look at the Blickey analysis: quantitative/technical skills and the management/interpersonal fields seem to be on opposing ends of the 1st Principal Component, suggesting that grade students may have to decide early to bet on succeeding as a technical person / senior scientist, or as a manager type. The study clearly states what it thinks is a winning strategy: '“. . . there are a lot of things you can learn, but [interpersonal skills are] the hardest to teach.”'

I think the quantitative side of my brain has cannabalized the portion devoted to interpersonal skills, so I have no advice to give on this matter. But, in terms of the responding need for GIS savvyness: here are my two cents:
1) everything in ecology has a space-time context, and colleagues without basic GIS facilities are frustratingly difficult to work or communicate with.

2) if you are serious about working with large ecological data or serious about taking up GIS, beware of classes/programmes that are little more than ESRI tutorials: you will be set up with a platform of limitation and disappointment. Even at the highest echelons of ArcMastery (and expensive licenses), you'll inevitably end up having to tell your superiors that you couldn't complete such-and-such a task because 'ArcGIS doesn't do that.' (But hey, that's a good looking map!)

Getting really good at ArcGIS is like becoming a master of Macromedia right before Flash came out: they jump from Avenue, to VB, to Python, to .... what's next? Instead, if you use R for GIS, there is always a way to do what you want. It may be difficult, but mastering R for a difficult GIS task yields transferable skills in a host of disciplines. It used to be a huge pain, but recent libraries like 'rgeos' (mixed with 'rgdal' and 'raster') give users most of the cookie-cutter facilities familiar to ESRI users. And its free, open-source (more on this later...)

I hope to have a little tutorial on GIS'ing in R. Until then, the already R-acquainted can leap into the subject with the following advice:

Getting started with GIS in R

1) for any questions, always start your Google/DuckDuckGo queries with 'R-sig-geo': the listserv archives are replete with questions and answers to the issues you will inevitably have (and far better than ESRI documentation).

2) get acquainted with he internal data structures of gridded data and vector data in the 'sp' package, e.g.,
> ?SpatialGridDataFrame
> ?SpatialPointsDataFrame
... to the point of being able to reconstruct the structures from stratch. HINT: they are lists of lists of lists of...

3) learn about the 'Proj.4' syntax of defining projections/coordinate systems. There is a much larger context to this project, but as a starting point one can just bookmark the more usual coordinate systems and projections, such as WGS84 is "+proj=longlat +ellps=WGS84 +datum=WGS84 +no_defs", or the UTM zone 9N (e.g., for British Columbia Canada)could be "+proj=utm +zone=9 +ellps=GRS80 +units=m +no_defs".

4) learn the examples in the help files of the following core GIS libraries:
rgeos: vector data basic operations, like unions, buffers, spatial sampling, etc.
rgdal: GDAL library to read and write a variety of raster datasets (see 'writeGDAL(...)') : GeoTIFFS, ESRI grids, floats, etc. It also provides the ability to reproject vector data (see 'spTransform').
maptools: basic GIS facilities, including 'readShapePoly(...)' for easy import of ESRI shapefiles.
raster: clip, shrink, reproject, resample, stack rasters -- a parallel (and better) way of representing gridded data (seemingly a rival to the SpatialGridDataFrame?). Despite the one-line-of-code annoyance of switching between SGDF class and raster class, this package takes the cake for handling of rasters. (For those of you taking note, you'll notice that, yes, there are TWO different libraries for projecting vector versus raster data).

5) learn about plotting maps with the spplot(...) function. An entire book could be written on spplot(), but start with col.regions=terrain.colors(100) for decent colours.

Linux Users
There are a few extra steps to get the GIS libraries running in Linux, in particular, installing the libraries upon which 'rgeos', 'maptools', and 'rgdal' depend. Even though the dependencies are documented in the respective packages pages, I still found it a bit tricky. First, the libraries you want are often the 'dev' versions (e.g., libproj-dev), as explained in this post. I was generally successfully in Ubuntu by with:
> sudo apt-get install libproj-dev libgdal1-dev

Mercifully, there is a dedicated repository of GIS libraries for Ubuntu and Debian flavours. You can add the repository to your source list by entering the following in your terminal:
> sudo add-apt-repository ppa:ubuntugis/ubuntugis-unstable

1 Blickley, Jessica L., Kristy Deiner, Kelly Garbach, Iara Lacher, Mariah H. Meek, Lauren M. Porensky, Marit L. Wilkerson, Eric M. Winford, and Mark W. Schwartz. “Graduate Student’s Guide to Necessary Skills for Nonacademic Conservation Careers.” Conservation Biology 27, no. 1 (2013): 24–34. doi:10.1111/j.1523-1739.2012.01956.x.

Saturday, 16 February 2013


This valentine's I'm saddened by the surprise demise of a duck--likely taken by a fox on one of these ill-weather days while I was in Canada. I live on a small hobby farm outside Washington DC, in Boyds, where simple chickens, goats, horses and duck have been companions to ground my psyche to the here and now from the abstract coding at NOAA. The duck never quite fit in among the other animals, having lost his con-specifics to a fox raid early in the year. I liked him the best: he'd actually wake me up in the mornings begging for bananas and oats.

And so my mind thinks of a cyber outsider that has me excited: DuckDuckGo. It is described as a hybrid search engine, pulling results from Yahoo, Wolfram Alpha, Wikipedia, and its own crawler. It is generating lots of buzz, perhaps being the only contender against the Google. For me, its open-source, tweakable platform tickles my nagging sense of indignation with Google and its censor-happy tendencies. DuckDuckGo promises zero tracking, privacy, and decent searches. For me, it does well for R and science related searches, which is important (although Google is still the best for pulling results from academia, Stack Overflow, and relevant science sites, etc).

Has it occurred to you that Google has turned to the dark side? For me, the realization came a long time ago, before SOPA and other aggressive attacks on internet freedom. Consider the story of the deceased and way back in 2006: these websites hosted free (& pirated) college textbooks, and were blacklisted from the Google results. To add insult to injury, when was taken down in early 2012, the domain name actually redirected you to Books.Google.Com (wow!): now, directs you to, which is a great study of double-think.
For me, I'll give DuckDuckGo a good chance.
My cannot-do-without Search tools:

Firefox InstaFox Add-on: to do all sorts of search engine searches from your browser address bar. Tailor to use d+space for DuckDuckGo. I have 'ci' for, 'me' for Mendeley, etc.

GNOME Do: For linux users, why point and click when the Super+Space allows you access to web searches, applications, recent files, music, system settings, etc.
(I miss the duck).

Wednesday, 13 February 2013

The Random Whale Wiki App - Android

Ever want to just stumple-upon random Cetacean information from Wikipedia? Here is an App for that!  I made it as a fun way to lazily learn about marine mammals on my tablet while in bed: just lie back and learn about the fascinating behaviour, evolution, phylogeny and all the mangled bytes on Wikipedia's Cetacean portal.

> Download the .apk file here
> Move the .apk onto your android device, via bluetooth or usb or whatever. Move it anywhere in the android file structure, but make sure you can find it again within your Android file manager (I use ES File Manager).
> Detach fron computer and turn on your Android. Browse to the .apk file and tab. Follow instructions to install!
> To locate the Random Whale Wiki icon, you'll need to manually drag it onto your 'desktop' through Settings--> Apps.
The app is admittedly primitive: just a refresh button to get a new marine mammal article, plus some facilities for directed browsing. Enjoy! The real engine is the Wikipedia Random project.

Personally, I think Wikipedia is great for learning about science. Recent innovations like the organized Portals are downright exciting! I like the Biology Portal. Who would want the linear bore of an intro college textbook, when you can have the organic cluster of articles that allow you to click-through and venture out into the Wikispace as far as you are willing to go?  Just be mindful of Wikipedia's biases and reliability.

Are you a fan of Wikipedia random-browsing? How do you use the web's 2.0 resources for edutainment?