text - bigrams instead of single words in termdocument matrix using R and Rweka -

i've found way use use bigrams instead of single tokens in term-document matrix. solution has been posed on stackoverflow here: findassocs multiple terms in r

the idea goes this:

library(tm) library(rweka) data(crude)  #tokenizer n-grams , passed on term-document matrix constructor bigramtokenizer <- function(x) ngramtokenizer(x, weka_control(min = 2, max = 2)) txttdmbi <- termdocumentmatrix(crude, control = list(tokenize = bigramtokenizer)) 

however final line gives me error:

error in rep(seq_along(x), sapply(tflist, length)) :    invalid 'times' argument in addition: warning message: in is.na(x) : is.na() applied non-(list or vector) of type 'null' 

if remove tokenizer last line creates regular tdm, guess problem somewhere in bigramtokenizer function although same example weka site gives here: http://tm.r-forge.r-project.org/faq.html#bigrams.

inspired anthony's comment, found out can specify number of threads parallel library uses default (specify before call ngramtokenizer):

# sets default number of threads use options(mc.cores=1) 

since ngramtokenizer seems hang on parallel::mclapply call, changing number of threads seems work around it.


Popular posts from this blog

php - Calling a template part from a post -

Firefox SVG shape not printing when it has stroke -

How to mention the localhost in android -