80 likes | 183 Views
Correlation of Term Count and Document Frequency for Google N-Grams Martin Klein and Michael L. Nelson Old Dominion University {mklein,mln}@cs.odu.edu ECIR 2009 Toulouse, France 04/08/2009. Background & Motivation.
E N D
Correlation of Term Count and Document Frequency for Google N-Grams Martin Klein and Michael L. Nelson Old Dominion University {mklein,mln}@cs.odu.edu ECIR 2009 Toulouse, France 04/08/2009
Background & Motivation • Term frequency (TF) – inverse document frequency (IDF) is a well known term weighting concept • Used (among others) to generate lexical signatures (LSs) • TF is not hard to compute, IDF is since it depends on global knowledge about the corpus When the entire web is the corpus IDF can only be estimated! • Most text corpora provide term count values (TC) • D1 = “Please, Please Me” D2 = “Can’t Buy Me Love” • D3 = “All You Need Is Love” D4 = “Long, Long, Long” • TC >= DF but is there a correlation? Can we use TC to estimate DF?
Experimental Setup & Results • Investigate correlation between TC and DF • within “Web as Corpus” (WaC) • Rank similarity of all terms
Experimental Setup & Results • Investigate correlation between TC and DF • within “Web as Corpus” (WaC) • Spearman’s ρ and Kendall τ
Experimental Setup & Results • Show similarity between WaC based TC and • Google N-Gram based TC • TC frequencies
Experimental Setup & Results Top 10 terms in decreasing order of their TF/IDF values U = 14 ∩ = 6 Strong indicator that TC can be used to estimate DF for web pages! Google: screen scraping DF (?) values from the Google web interface
Thank You & Come See My Poster!!! Correlation of Term Count and Document Frequency for Google N-Grams Questions Martin Klein and Michael L. Nelson Old Dominion University {mklein,mln}@cs.odu.edu