node.js - Google Ngram sorting? -
from understand, each file in google's ngram dataset contains list of ngrams, sorted alphabetically, numerically year. however, assuming data utf8 (which file says correct), и 1080, i 73, don't understand why использовал_num comes before i'academie_pron. relevant lines file (starting line #131356):
использовал_num 2005 4 1 i'academie_pron 1813 1 1 here's ngram-sort-test.js broken comparison function highlighted. run, download this file google , un-gzip in same directory ngram-sort-test.js.
this not answer, workaround manually sort file using lc_all=c sort <googlebooks-eng-all-1gram-20120701-i >googlebooks-eng-all-1gram-20120701-i.sorted.
Comments
Post a Comment