240 likes | 316 Views
A Comparison of Document, Sentence, and Term Event Spaces. Catherine Blake School of Information and Library Science University of North Carolina at Chapel Hill North Carolina, NC 27599-3360 cablake@email.unc.edu. ?. Representation. Classic Information Retrieval. ?. ?. ?. Information
E N D
A Comparison of Document, Sentence, and Term Event Spaces Catherine Blake School of Information and Library Science University of North Carolina at Chapel Hill North Carolina, NC 27599-3360 cablake@email.unc.edu
? Representation Classic Information Retrieval ? ? ? Information Need Document Representation Query Match ? Matching - Exact match = Boolean Model - Weighted match = Vector Model
Term Weighting • Goal : Favor discriminating terms • Commonly used : TF x IDF • IDF(ti)=log2(N)–log2(ni)+1 • N = total number of documents in the corpus • ti = a term (typically an stemmed word) • ni = number of documents that contain at least one occurrence of the term ti Sparck Jones, K. (1972) A statistical interpretation of term specificity and its application in retrieval. Journal of Documentation, 28:11-21. Salton,G. & Buckley,C. (1988) Term-weighting approaches in automatic text retrieval. Information Processing & Management, 24 (5):513-23
Practical Motivations • Systems moving toward sub-document retrieval • Document Summarization – Why not use Inverse Sentence Frequency (ISF) ? • Question Answering – Why not use Inverse Term Frequency (ITF) ? • Calculating IDF is problematic • How many documents to have stable IDF estimates ? • Corpora have changed since initial experiments • # documents - Vocabulary size • # terms per document
Theoretical Motivations • TF x IDF combines two different event spaces • TF – number of terms • IDF – number of documents • Are the limits of these spaces really the same ? • Foundational theories use the term space • Zipf’s Law (Zipf, 1949) • Shannon’s Theory (Shannon, 1948)
Goal : Compare and Contrast • Raw term comparison • Zipf Law comparison • Direct IDF, ISF, and ITF comparison • Abstract versus full-text comparison • IDF Sensitivity
1,391,763 distinct stemmed terms (Porter algorithm) Corpora • Full text scientific articles in chemistry • Initial corpus: • 103,262 articles • Published in 27 journals over the last 4 years • Two journals excluded due to formatting inconsistencies • These experiments: • 100,830 articles • 16,538,655 sentences • 526,025,066 total unstemmed terms • 2,001,730 distinct unstemmed terms • 1,391,763 distinct stemmed terms (Porter algorithm) Table 1. Corpus summary.
Example IDF, ISF, ITF IDF(ti)=log2(N)–log2(ni)+1
1) Raw term comparison • Document vs Sentence Frequency (log scales)
1) Raw term comparison • Document vs Term Frequency (log scales)
Luhn Image Source: Van Rijsbergen, 1979
1) Raw term comparison • Sentence vs Term Frequency (log scales)
2) Zipf Law comparison • Zipf’s Law : The frequency of terms in a corpus conforms to a power law distribution K/jθ where θ is close to 1 (Zipf, 1949) • Term distributions followed a power law • θ differed between the event spaces • Average θ in document space = -1.65 • Average θ in sentence space = -1.73 • Average θ in term spaces = -1.73
Conclusions • raw document frequencies differ from sentence & term frequencies. • around the areas of important terms • difficult to perform a linear transformation from the document to a sub-document space • raw term frequencies correlate well with the sentence frequencies • IDF, ISF and ITF are highly correlated
Conclusions • IDF values are surprisingly stable • with respect to random samples at 10% of the total corpus. • average IDF values based on only a 20% random stratified sample correlated almost perfectly to IDF • Journal based IDF samples did not correlate well to the global IDF • language used in abstracts is systematically different from the language used in the body of a full-text scientific document.