nlp - tf-idf using data on unigram frequency from Google -
i'm trying identify important terms in set of government documents. generating term frequencies no problem.
for document frequency, hoping use handy python scripts , accompanying data peter norvig posted chapter in "beautiful data", include frequencies of unigrams in huge corpus of data web.
my understanding of tf-idf, however, "document frequency" refers number of documents containing term, not number of total words are term, norvig script. can still use data crude tf-idf operation?
here's sample data:
word    tf       global frequency china   1684     0.000121447     352385   0.022573582 economy 6602     0.0000451130774123 ,     160794   0.012681757 iran    2779     0.0000231482902018 romney  1159     0.000000678497795593  simply dividing tf gf gives "the" higher score "economy," can't right. there basic math i'm missing, perhaps?
as understand, global frequency equal "inverse total term frequency" mentioned here robertson. robertson's paper:
one possible way away problem make radical re- placement idf (that is, radical in principle, although may not radical  in terms of practical effects). .... probability event space of documents event space of term positions  in concatenated text of documents in collection.  have new measure, called here  inverse total term frequency: ... on whole, experiments inverse total term frequency weights have tended show not effective idf weights according text, can use inverse global frequency idf term, albeit more crude standard one.
also missing stop words removal. words such used in documents, therefore not give information. before tf-idf , should remove such stop words.
Comments
Post a Comment