node.js - Google Ngram sorting? -
from understand, each file in google's ngram dataset contains list of ngrams, sorted alphabetically, numerically year. however, assuming data utf8 (which file
says correct), и
1080, i
73, don't understand why использовал_num
comes before i'academie_pron
. relevant lines file (starting line #131356):
использовал_num 2005 4 1 i'academie_pron 1813 1 1
here's ngram-sort-test.js broken comparison function highlighted. run, download this file google , un-gzip in same directory ngram-sort-test.js.
this not answer, workaround manually sort file using lc_all=c sort <googlebooks-eng-all-1gram-20120701-i >googlebooks-eng-all-1gram-20120701-i.sorted
.
Comments
Post a Comment