Tuesday, 18 June 2013

Zotero and R: automatically find relevant scientific articles with the Microsoft Academic Search API

Zotero is a powerful scientific-article manager that is part of my 'cannot-live-without' toolbox for research. Its one drawback compared to ReadCube or Mendeley is the lack of a 'find relevant articles' engine to smartly expand your library. But, unlike the aforementioned rivals, Zotero is an open-source project, and is naturally hackable and open to exploration with programs such as R. To make up for the deficiency, I made a script that i) provides R access to your Zotero database; ii) loops through articles and finds them online via the Microsoft Academic Search API; iii) finds the references and 'cited by' articles; iv) and outputs html files with links to Google Scholar, dx.doi.org, and the publisher's website to make it as easy as possible to get high-relevancy articles into Zotero.

The principle is simple: find articles that are most frequently cited by the authors in your Zotero database (and which you haven't yet read) and find other articles that cite the articles in your database. Rather than using keyword similarity algorithms, this script just assumes that Authors who think and read similarly as you do probably know what's relevant.

Below is the code which can be yanked to your R terminal. There is one source-code file that should be downloaded into your working directory (download from here) and called: "SQL_zotero_query.txt" (thank you Royce). The sections which need to be customized are the location of: a) your Zotero database folder (which has zotero.sqlite) and b) the folder to save the output html files. If you use this frequently, you should also get your own free MSA API key (mine is provided below but has limited amount of queries allowed).

Enjoy! Please send me any suggestions and questions!

# Microsoft Academic Search API (please get your own)
apikey <- "88cab0fa-2dd9-4eef-9d07-67d4e0a5c933"


options(gsubfn.engine = "R")
library(sqldf)
library(RSQLite)
library(rjson)


# WORKING DIRECTORIES AND WHERE TO FIND ZOTERO, etc
stem <- ifelse(Sys.info()['sysname']=="Linux","//home/rob","C:/Users/Rob") # your user directory
msadir <- paste(stem,"/Documents/",sep="") # working directory
setwd(msadir)
zot_db <- paste(stem,"/Documents/Literature/zotero.sqlite.bak",sep="") # zotero original databaset to copy
file.copy(from=zot_db, to=msadir, overwrite=TRUE) # notice backup of zotero databse
db <- paste(msadir,"/zotero.sqlite.bak",sep="")


# SQL command
SQLquery_txt_fname <- paste("SQL_zotero_query.txt",sep="") # source code for sqlcommand
SQLquery_txt <- readLines(SQLquery_txt_fname)
# CONNECT TO ZOTERO DATABASE AND QUERY
conn <- dbConnect("SQLite", dbname = "zotero.sqlite.bak")
dbListTables(conn)#list tables in zotero database
res2 <- dbSendQuery(conn, statement = paste(SQLquery_txt,collapse=" ") )
zd <- fetch(res2, n=-1)#
dbClearResult(res2)
# FILTER to only journal articles
zd <- zd[which(zd$TYPE %in% c("journalArticle")),]
# remove TAGS and abstract
zd <- zd[,-(grep("TAG_",names(zd)))]
zd <- zd[,-(grep("ABSTRACT",names(zd)))]


# CLEAN UP UGLY CHARACTERS
zd$TITLE <- gsub('<i>',"",zd$TITLE)
zd$TITLE <- gsub('<i>',"",zd$TITLE)
zd$TITLE <- gsub(':',"",zd$TITLE) # remove some other
zd$TITLE <- gsub(',',"",zd$TITLE) # remove some other
zd$TITLE <- gsub('-'," ",zd$TITLE) # remove some other
# DONE PREPARING THE DATA BASE


# DEFINE SPECIAL FUNCTIONS FOR THE NEXT STEPS
gettits <- function(rmas){ lapply( rmas$d[[8]][[5]][1:length(rmas$d[[8]][[5]])], function(lx){ ti <- lx$Title
gsub('<i>',"", gsub('<i>',"", gsub(':',"",gsub(',',"",gsub('-'," ",ti)))))
}) } # get the titles of the returned objects
getDOI <- function(rmas){lapply( rmas$d[[8]][[5]][1:length(rmas$d[[8]][[5]])], function(lx) lx[["DOI"]]) } # get the DOI of the returned objects
getMSID <- function(rmas){lapply( rmas$d[[8]][[5]][1:length(rmas$d[[8]][[5]])], function(lx) lx[["ID"]]) } # MAS internal ID
getauths <- function(rmas){ lapply(rmas$d[[8]][[5]][1:length(rmas$d[[8]][[5]])], function(lx) {
ret<-unlist(lapply(lx[["Author"]], function(al) al[["LastName"]])); ifelse(is.null(ret),NA,ret)}
)} # last names of authors
getyear <- function(rmas){ lapply(rmas$d[[8]][[5]][1:length(rmas$d[[8]][[5]])], function(lx) lx[["Year"]])}
geturl <- function(rmas){ lapply(rmas$d[[8]][[5]],function(lx){ret<-lx$FullVersionURL; if(is.null(ret) | length(ret)==0){ret<-NA}; ret[1]})}
getpub <- function(rmas){ lapply(rmas$d[[8]][[5]],function(lx){ret<-c(lx$Journal[c("FullName","ShortName")],NA); ret[[which(!is.null(ret))]]})}
# fuzzy matching by agrep
fuzzycor <- function(s1,s2){
sv <- list(one= unlist(strsplit(s1," ")),two=unlist(strsplit(s2," ")))
sv$one <- sv$one[sv$one!=""]; sv$two <- sv$two[sv$two!=""]
find1in2 <- sum(sapply(sv[[1]],function(ss) length(agrep(ss,s2,ignore.case=TRUE))>0))
find2in1 <- sum(sapply(sv[[2]],function(ss) length(agrep(ss,s1,ignore.case=TRUE))>0))
(find1in2+find2in1)/(length(sv[[1]])+length(sv[[2]]))}


# STEP one: match local zotero holdings with the MSAR numbers
authcol <- grep("AUTHOR_[[:digit:]]{1}_LAST",names(zd))
iddbf <- paste(msadir,"msID_db.csv",sep="") # local file of MSA id's
iddb <- data.frame(ITEMID=-1,msID=NA,ztitle=NA,mstitle=NA,zdoi=NA,msdoi=NA)


for(i in 1:nrow(zd)){

msID <- NULL # a handler for the final matching option in retrieval
ctit <- cdoi <- mj <- NA


# zotero title
ztit <-zd$TITLE[i]
zauths <-zd[i,authcol][which(!is.na(zd[i,authcol]))]
zdoi <- zd[i,"DOI"]
zyear <- strsplit(zd$DATE[i],split="(\ )|-" ,perl=TRUE)[[1]][1]
print(paste("MSA query for '",substring(ztit,1,50),"..."))
textcall <- paste("http://academic.research.microsoft.com/json.svc/search?AppId=",apikey,"&TitleQuery=",gsub(" ","+",ztit),"&ResultObjects=publication&PublicationContent=AllInfo&StartIdx=1&EndIdx=5",sep="")
con<- url(description=textcall)
rmas <- fromJSON(readLines(con))
close(con)


# check if the resource is in MSA
if(length(rmas$d$Publication$Result)==0){

msID <-NA; ctit<-"not_found"; cdoi<-NA; mj <- 1
print(paste("no MSA results for '",substring(ztit,1,50),"..."))
} else {
print(paste("found MSA results for '",substring(ztit,1,50),"..."))
# authors
ctit <- gettits(rmas)
cauths <- getauths(rmas)


# first try matching by the doi
if(!is.na(zdoi)){
cdoi <- unlist(getDOI(rmas))
mj <- which(cdoi == zdoi)[1]
if(length(mj>0)){
msID <- getMSID(rmas)[[mj]]
}
}
if(is.null(msID)){
# title correlation
Rtit <- unlist(lapply(ctit, function(mstit,ztit) fuzzycor(mstit,ztit), ztit=ztit))
# check the number of Zotero authrs are in the MSA listing
Rauths <- unlist(lapply(cauths, function(msauths,zauths){
mean(sapply(zauths, function(auth){ length(agrep(auth, msauths,ignore.case=TRUE))>0}))},
zauths=zauths))
# year correlations
Ryear <- unlist(lapply(getyear(rmas), function(cyr){ 1*(cyr==zyear)}))
Rs <- data.frame(do.call('cbind', list(Rtit,Rauths,Ryear)))
mincrit <- apply(Rs, 1, function(rw) all(rw > 0.8))
Rs <- Rs[which(mincrit),]
if(nrow(Rs)!=0){
mj <- which.max(rowMeans(Rs))[1]
msID <- getMSID(rmas)[[mj]]
} else {
msID <- NA
}
} #is.null(msID)
} # resource found
iddb <- rbind(iddb, data.frame(ITEMID=zd$ITEMID[i], msID=msID,ztitle=as.character(ztit),mstitle=as.character(ctit[mj]),zdoi=zdoi,msdoi=cdoi[mj]))
}
iddb2 <- iddb[which(!is.na(iddb$msID)),] # remove not found


# STEP TWO: query the MS id and learn all the "cited" and "cited by" references
# loop through found MSA records
refs <- list(cites=as.list(rep(NA,nrow(iddb2))),cited=as.list(rep(NA,nrow(iddb2))))
names(refs$cited) <- names(refs$cites) <- iddb2$ITEMID
# make a db to store the journal results
pdb <- data.frame(msID=NULL, cited=NULL, title=NULL, author1=NULL, year=NULL, journal=NULL, doi=NULL,url=NULL)


for(i in 1:nrow(iddb2)){

# find other works which cite the current article
msID <- iddb2[i,"msID"]
textcall <- paste("http://academic.research.microsoft.com/json.svc/search?AppId=",apikey,"&PublicationID=",msID,"&ReferenceType=Citation&PublicationContent=AllInfo&StartIdx=1&EndIdx=50&OrderType=CitationCount",sep="")
con<- url(description=textcall)
rmas <- fromJSON(readLines(con))
close(con)
refs$cites[[i]] <- unlist(getMSID(rmas))
if(!is.null(unlist(getMSID(rmas)))){
pdb <- rbind(pdb,
data.frame(msID=unlist(getMSID(rmas)),
cited=unlist(lapply(rmas$d[[8]][[5]],function(lx){lx$CitationCount})),
title=unlist(gettits(rmas)),
author1=unlist(lapply(getauths(rmas), function(lx2) lx2[[1]])),
year=unlist(getyear(rmas)),
journal=unlist(getpub(rmas)),
doi=unlist(getDOI(rmas)),
url=unlist(geturl(rmas))
))
}
# find other works which are cited by the current article
msID <- iddb2[i,"msID"]
textcall <- paste("http://academic.research.microsoft.com/json.svc/search?AppId=",apikey,"&PublicationID=",msID,"&ReferenceType=Reference&PublicationContent=AllInfoIdx=1&EndIdx=50&OrderType=CitationCount",sep="")
con<- url(description=textcall)
rmas <- fromJSON(readLines(con))
close(con)
refs$cited[[i]] <- unlist(getMSID(rmas))
if(!is.null(unlist(getMSID(rmas)))){
pdb <- rbind(pdb,
data.frame(msID=unlist(getMSID(rmas)),
cited=unlist(lapply(rmas$d[[8]][[5]],function(lx){lx$CitationCount})),
title=unlist(gettits(rmas)),
author1=unlist(lapply(getauths(rmas), function(lx2) lx2[[1]])),
year=unlist(getyear(rmas)),
journal=unlist(getpub(rmas)),
doi=unlist(getDOI(rmas)),
url=unlist(geturl(rmas))
))
}
}
names(refs$cited) <- names(refs$cites) <- iddb2$ITEMID
pdb <- unique(pdb)


# STEP THREE: tally results
citetally <- list(cites=NA,cited=NA) # storage for results (back/forwards citations)
for(i in 1:length(refs)){
tpt <- table(unlist(refs[[i]]))
tpt <- tpt[order(tpt,decreasing=TRUE)]
alreadyhave <- names(tpt)[which(names(tpt) %in% iddb2$msID)]
tpt <- tpt[which(names(tpt) %in% alreadyhave==FALSE)]
tptdb <- data.frame(msID=names(tpt),zcount=as.numeric(tpt))
citetally[[i]] <- sqldf("SELECT pdb.*,tptdb.zcount as 'zcount' FROM tptdb LEFT JOIN pdb ON tptdb.msID=pdb.msID")
rm(tpt,tptdb,alreadyhave)
}


# STEP FOUR A: save final output as CSV
write.csv(citetally[[1]], paste(msadir,"most_citedby.csv",sep=""),row.names=FALSE)
write.csv(citetally[[2]], paste(msadir,"most_cited.csv",sep=""),row.names=FALSE)


# STEP FOUR B: save final output as html to open links
# best way to get things into firefox :)
sink(paste(msadir,"mostcitedby.html",sep=""))
cat(paste("<html><body>zcount is the number of articles in your database cited by the focal article<br><table><tr><th>",
paste(names(citetally[[1]])[c(9,2,4,5,3,7,8)],collapse="</th><th>"),"</th></tr>"))
htmltxt <- apply(citetally[[1]],1,function(x){
paste("<tr><td>",x[9],"</td><td>",x[2],"</td><td>",x[4],"</td><td>",x[5],
paste("</td><td><a href='http://scholar.google.com/scholar?hl=en&q=",
gsub(" ","+",x[3]),collapse="",sep=""),
"' target='_blank'>",x[3],"</a></td><td><a href='http://dx.doi.org/",x[7],"' target='_blank'>",x[7],
"</a></td><td><a href='",x[8],"' target='_blank'>website</a></td></tr>\n",sep="")})
cat(paste(htmltxt,collapse="",sep=""))
cat("</table></body></html>")
sink()
# 2nd webpage
sink(paste(msadir,"mostcited.html",sep=""))
cat(paste("<html><body>zcount is the number of articles in your database which cite the focal article<table><tr><th>",
paste(names(citetally[[1]])[c(9,2,4,5,3,7,8)],collapse="</th><th>"),"</th></tr>"))
htmltxt <- apply(citetally[[2]],1,function(x){
paste("<tr><td>",x[9],"</td><td>",x[2],"</td><td>",x[4],"</td><td>",x[5],
paste("</td><td><a href='http://scholar.google.com.proxy.lib.uiowa.edu/scholar?hl=en&q=",
gsub(" ","+",x[3]),collapse="",sep=""),
"' target='_blank'>",x[3],"</a></td><td><a href='http://dx.doi.org/",x[7],"' target='_blank'>",x[7],
"</a></td><td><a href='",x[8],"' target='_blank'>website</a></td></tr>\n",sep="")})
cat(paste(htmltxt,collapse="",sep=""))
cat("</table></body></html>")
sink()

And some example output...


Why R? The above script just serves as a one-stop-shop for SQL and JSON processing. One the side, I also use R's wonderful visualization tools and matrix processing facilities to play around with author and keywords. But really, the above script could probably be run more efficiently in Python or Java.

A special thanks to the post by Royce Kimmons at http://royce.kimmons.me/tutorials/zotero_to_excel for the SQL command to access Zotero databases.

BTW, in case you're wondering why I'm using two open-source projects with a Microsoft project: other online and free-tools such as CiteUlike or CiteseerX do NOT provide the needed forwards-citation or backwards-citation information, neither through an API or thourgh webscrapping. I'd love some alternatives

No comments:

Post a Comment