1 / 8

Correlation of Term Count and Document Frequency for Google N-Grams

Correlation of Term Count and Document Frequency for Google N-Grams Martin Klein and Michael L. Nelson Old Dominion University {mklein,mln}@cs.odu.edu ECIR 2009 Toulouse, France 04/08/2009. Background & Motivation.

laksha
Download Presentation

Correlation of Term Count and Document Frequency for Google N-Grams

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Correlation of Term Count and Document Frequency for Google N-Grams Martin Klein and Michael L. Nelson Old Dominion University {mklein,mln}@cs.odu.edu ECIR 2009 Toulouse, France 04/08/2009

  2. Background & Motivation • Term frequency (TF) – inverse document frequency (IDF) is a well known term weighting concept • Used (among others) to generate lexical signatures (LSs) • TF is not hard to compute, IDF is since it depends on global knowledge about the corpus When the entire web is the corpus IDF can only be estimated! • Most text corpora provide term count values (TC) • D1 = “Please, Please Me” D2 = “Can’t Buy Me Love” • D3 = “All You Need Is Love” D4 = “Long, Long, Long” • TC >= DF but is there a correlation? Can we use TC to estimate DF?

  3. Experimental Setup & Results • Investigate correlation between TC and DF • within “Web as Corpus” (WaC) • Rank similarity of all terms

  4. Experimental Setup & Results • Investigate correlation between TC and DF • within “Web as Corpus” (WaC) • Spearman’s ρ and Kendall τ

  5. Experimental Setup & Results • Show similarity between WaC based TC and • Google N-Gram based TC • TC frequencies

  6. Experimental Setup & Results Top 10 terms in decreasing order of their TF/IDF values U = 14 ∩ = 6 Strong indicator that TC can be used to estimate DF for web pages! Google: screen scraping DF (?) values from the Google web interface

  7. Thank You & Come See My Poster!!! Correlation of Term Count and Document Frequency for Google N-Grams Questions Martin Klein and Michael L. Nelson Old Dominion University {mklein,mln}@cs.odu.edu

More Related