1 / 24

A Comparison of Document, Sentence, and Term Event Spaces

A Comparison of Document, Sentence, and Term Event Spaces. Catherine Blake School of Information and Library Science University of North Carolina at Chapel Hill North Carolina, NC 27599-3360 cablake@email.unc.edu. ?. Representation. Classic Information Retrieval. ?. ?. ?. Information

keene
Download Presentation

A Comparison of Document, Sentence, and Term Event Spaces

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Comparison of Document, Sentence, and Term Event Spaces Catherine Blake School of Information and Library Science University of North Carolina at Chapel Hill North Carolina, NC 27599-3360 cablake@email.unc.edu

  2. ? Representation Classic Information Retrieval ? ? ? Information Need Document Representation Query Match ? Matching - Exact match = Boolean Model - Weighted match = Vector Model

  3. Term Weighting • Goal : Favor discriminating terms • Commonly used : TF x IDF • IDF(ti)=log2(N)–log2(ni)+1 • N = total number of documents in the corpus • ti = a term (typically an stemmed word) • ni = number of documents that contain at least one occurrence of the term ti Sparck Jones, K. (1972) A statistical interpretation of term specificity and its application in retrieval. Journal of Documentation, 28:11-21. Salton,G. & Buckley,C. (1988) Term-weighting approaches in automatic text retrieval. Information Processing & Management, 24 (5):513-23

  4. Practical Motivations • Systems moving toward sub-document retrieval • Document Summarization – Why not use Inverse Sentence Frequency (ISF) ? • Question Answering – Why not use Inverse Term Frequency (ITF) ? • Calculating IDF is problematic • How many documents to have stable IDF estimates ? • Corpora have changed since initial experiments • # documents - Vocabulary size • # terms per document

  5. Theoretical Motivations • TF x IDF combines two different event spaces • TF – number of terms • IDF – number of documents • Are the limits of these spaces really the same ? • Foundational theories use the term space • Zipf’s Law (Zipf, 1949) • Shannon’s Theory (Shannon, 1948)

  6. Goal : Compare and Contrast • Raw term comparison • Zipf Law comparison • Direct IDF, ISF, and ITF comparison • Abstract versus full-text comparison • IDF Sensitivity

  7. 1,391,763 distinct stemmed terms (Porter algorithm) Corpora • Full text scientific articles in chemistry • Initial corpus: • 103,262 articles • Published in 27 journals over the last 4 years • Two journals excluded due to formatting inconsistencies • These experiments: • 100,830 articles • 16,538,655 sentences • 526,025,066 total unstemmed terms • 2,001,730 distinct unstemmed terms • 1,391,763 distinct stemmed terms (Porter algorithm) Table 1. Corpus summary.

  8. Example IDF, ISF, ITF IDF(ti)=log2(N)–log2(ni)+1

  9. 1) Raw term comparison • Document vs Sentence Frequency (log scales)

  10. 1) Raw term comparison • Document vs Term Frequency (log scales)

  11. Luhn Image Source: Van Rijsbergen, 1979

  12. 1) Raw term comparison • Sentence vs Term Frequency (log scales)

  13. 2) Zipf Law comparison • Zipf’s Law : The frequency of terms in a corpus conforms to a power law distribution K/jθ where θ is close to 1 (Zipf, 1949) • Term distributions followed a power law • θ differed between the event spaces • Average θ in document space = -1.65 • Average θ in sentence space = -1.73 • Average θ in term spaces = -1.73

  14. 2) Example Document Distribution

  15. 2) θ Comparison of all journals

  16. 3) Direct IDF vs ISF comparison

  17. 3) Direct IDF vs ITF comparison

  18. 3) Direct ISF vs ITF comparison

  19. 4) Abstract versus full-text

  20. 4) IDF Sensitivity

  21. 4) IDF Sensitivity

  22. Conclusions • raw document frequencies differ from sentence & term frequencies. • around the areas of important terms • difficult to perform a linear transformation from the document to a sub-document space • raw term frequencies correlate well with the sentence frequencies • IDF, ISF and ITF are highly correlated

  23. Conclusions • IDF values are surprisingly stable • with respect to random samples at 10% of the total corpus. • average IDF values based on only a 20% random stratified sample correlated almost perfectly to IDF • Journal based IDF samples did not correlate well to the global IDF • language used in abstracts is systematically different from the language used in the body of a full-text scientific document.

More Related