text - bigrams instead of single words in termdocument matrix using R and Rweka -
i've found way use use bigrams instead of single tokens in term-document matrix. solution has been posed on stackoverflow here: findassocs multiple terms in r
the idea goes this:
library(tm) library(rweka) data(crude) #tokenizer n-grams , passed on term-document matrix constructor bigramtokenizer <- function(x) ngramtokenizer(x, weka_control(min = 2, max = 2)) txttdmbi <- termdocumentmatrix(crude, control = list(tokenize = bigramtokenizer))
however final line gives me error:
error in rep(seq_along(x), sapply(tflist, length)) : invalid 'times' argument in addition: warning message: in is.na(x) : is.na() applied non-(list or vector) of type 'null'
if remove tokenizer last line creates regular tdm, guess problem somewhere in bigramtokenizer function although same example weka site gives here: http://tm.r-forge.r-project.org/faq.html#bigrams.
inspired anthony's comment, found out can specify number of threads parallel
library uses default (specify before call ngramtokenizer
):
# sets default number of threads use options(mc.cores=1)
since ngramtokenizer
seems hang on parallel::mclapply
call, changing number of threads seems work around it.
Comments
Post a Comment