A Comparison of Document, Sentence, and Term Event Spaces

A Comparison of Document, Sentence, and Term Event Spaces Catherine Blake School of Information and Library Science University of North Carolina at Chapel Hill North Carolina, NC 27599-3360 cablake@email.unc.edu

? Representation Classic Information Retrieval ? ? ? Information Need Document Representation Query Match ? Matching - Exact match = Boolean Model - Weighted match = Vector Model

Term Weighting • Goal : Favor discriminating terms • Commonly used : TF x IDF • IDF(ti)=log2(N)–log2(ni)+1 • N = total number of documents in the corpus • ti = a term (typically an stemmed word) • ni = number of documents that contain at least one occurrence of the term ti Sparck Jones, K. (1972) A statistical interpretation of term specificity and its application in retrieval. Journal of Documentation, 28:11-21. Salton,G. & Buckley,C. (1988) Term-weighting approaches in automatic text retrieval. Information Processing & Management, 24 (5):513-23

Practical Motivations • Systems moving toward sub-document retrieval • Document Summarization – Why not use Inverse Sentence Frequency (ISF) ? • Question Answering – Why not use Inverse Term Frequency (ITF) ? • Calculating IDF is problematic • How many documents to have stable IDF estimates ? • Corpora have changed since initial experiments • # documents - Vocabulary size • # terms per document

Theoretical Motivations • TF x IDF combines two different event spaces • TF – number of terms • IDF – number of documents • Are the limits of these spaces really the same ? • Foundational theories use the term space • Zipf’s Law (Zipf, 1949) • Shannon’s Theory (Shannon, 1948)

Goal : Compare and Contrast • Raw term comparison • Zipf Law comparison • Direct IDF, ISF, and ITF comparison • Abstract versus full-text comparison • IDF Sensitivity

1,391,763 distinct stemmed terms (Porter algorithm) Corpora • Full text scientific articles in chemistry • Initial corpus: • 103,262 articles • Published in 27 journals over the last 4 years • Two journals excluded due to formatting inconsistencies • These experiments: • 100,830 articles • 16,538,655 sentences • 526,025,066 total unstemmed terms • 2,001,730 distinct unstemmed terms • 1,391,763 distinct stemmed terms (Porter algorithm) Table 1. Corpus summary.

Example IDF, ISF, ITF IDF(ti)=log2(N)–log2(ni)+1

1) Raw term comparison • Document vs Sentence Frequency (log scales)

1) Raw term comparison • Document vs Term Frequency (log scales)

Luhn Image Source: Van Rijsbergen, 1979

1) Raw term comparison • Sentence vs Term Frequency (log scales)

2) Zipf Law comparison • Zipf’s Law : The frequency of terms in a corpus conforms to a power law distribution K/jθ where θ is close to 1 (Zipf, 1949) • Term distributions followed a power law • θ differed between the event spaces • Average θ in document space = -1.65 • Average θ in sentence space = -1.73 • Average θ in term spaces = -1.73

2) Example Document Distribution

2) θ Comparison of all journals

3) Direct IDF vs ISF comparison

3) Direct IDF vs ITF comparison

3) Direct ISF vs ITF comparison

4) Abstract versus full-text

4) IDF Sensitivity

Conclusions • raw document frequencies differ from sentence & term frequencies. • around the areas of important terms • difficult to perform a linear transformation from the document to a sub-document space • raw term frequencies correlate well with the sentence frequencies • IDF, ISF and ITF are highly correlated

Conclusions • IDF values are surprisingly stable • with respect to random samples at 10% of the total corpus. • average IDF values based on only a 20% random stratified sample correlated almost perfectly to IDF • Journal based IDF samples did not correlate well to the global IDF • language used in abstracts is systematically different from the language used in the body of a full-text scientific document.

A Comparison of Document, Sentence, and Term Event Spaces

A Comparison of Document, Sentence, and Term Event Spaces

Presentation Transcript

Parts of a Sentence

PARTS OF A SENTENCE

Document Classification Comparison

Parts of a sentence

Event generator comparison

A progressive sentence selection strategy for document summarization

Parts of a Sentence

Parts of a Sentence

Parts of a Sentence

Parts of a Sentence

Parts of a Sentence

PARTS OF A SENTENCE

Visualization and Navigation of Document Information Spaces Using a Self-Organizing Map

Term and Document Clustering

Parts of a Sentence

Lesson 1: Definition of a Sentence and Basic Sentence Parts

Parts of a sentence

A Comparison of SOM Based Document Categorization Systems

Outstanding event spaces on BookMyShop

Indoor and Outdoor Party Venues and Event Spaces

Term Insurance Comparison - Policies365