Term Weighting approaches in automatic text retrieval.

Term Weighting approaches in automatic text retrieval. Presented by Ehsan

References • Modern Information Retrieval: Text book • Slides on Vectorial Model by Dr. Rada • The paper itself

The main idea • Text indexing system based on weighted single terms is better than the one based on more complex text representation • Crucial importance: effective term weighting.

Basic IR • Attach content identifier to both stored texts and user queries. • A content identifier/term is a word or a group of words extracted from the document/queries • Underlying assumption • Semantics of the documents and queries can be expressed by this terms

Two things to consider • What is an appropriate content identifier? • Are all the identifier of same importance? • If not, how can we discriminate a term from the others?

Choosing content identifier • Use single term/word as individual identifier • Use more complex text representation as identifier • An example • “Industry is the mother of good luck” • Mother said, “Good luck”.

Complex text representation • Set of related terms based on statistical co-occurrence • Term phrases consisting of one of more governing terms (head of the phrase) together with corresponding depending terms • Grouping words under a common heading like thesaurus • Constructing knowledge base to represent the content of the subject area

What is better: single or complex terms? • Construction of complex text representation is inherently difficult. • Need sophisticated syntactic/statistical analysis program • An example • Using term phrase 20% increase in some cases • Other cases it is quite discouraging • Knowledge base • Effective vocabulary tools covering subject areas of reasonable scope is still sort of under-development • Conclusion • Using single terms as content identifier is preferable

The second issue • How to discriminate terms? • Term weight of course! • Effectiveness of IR system • Document with relevant items must be retrieved • Documents with irrelevant/extraneous items must be rejected.

Precision and Recall • Recall • Number of relevant document retrieved divided by total number of relevant documents • Precision • Out of the documents retrieved, how many of them are relevant • Our goal • High recall to retrieve as many relevant documents as possible • High precision to reject extraneous documents. • Basically, it is a trade off.

Weighting mechanism • To get high recall • Term frequency, tf • When high frequency term are prevalent in the whole document collection • With high tf every single documents will be retrieved • To get high precision • Inverse document frequency • Varies inversely with the number of documents, n in which the term appears. • Idf is given by log2 (N/ n) , where N is total number of documents • To discriminate terms • We use tf X idf

Two more things to consider • Current “tf X id” mechanism favors larger documents • introduce a normalizing factor in the weight to equalize the length of the document. • Probabilistic mode • Term weight is the proportion of the relevant documents in which a term occurs divided by proportion of irrelevant items in which the term occurs • Is given by log ((N-n)/n)

Term weighting components • Term frequency components • b, t, n • Collection frequency components • x, f, p • Normalization components • x, c • What would be weighting system given by tfc.nfx?

Experimental evidence • Query vectors • For tf • short query, use n • Long query, use t • For idf • Use f • For normalization • Use x

Experimental evidence • Document vectors • For tf • Technical vocabulary, use n • More varied vocabulary, use t • For idf • Use f in general • Documents from different domain use x • For normalization • Documents with heterogeneous length, use c • Homogenous documents, use x

Conclusion • Best document weighting tfc, nfc (or tpc, npc) • Best query weighting nfx, tfx, bfx (or npx, tpx, bpx) • Questions?

Term Weighting approaches in automatic text retrieval.

Term Weighting approaches in automatic text retrieval.

Presentation Transcript

Automatic Text Classification

Term weighting and vector representation of text

Automatic Text Summarization

Automatic Text Summarization

Automatic Text Summarization

Modeling and Solving Term Mismatch for Full-Text Retrieval

Automatic Text Summarization

AUTOMATIC TEXT SUMMARIZATION

Active Learning in Text Retrieval

Visualization in Text Information Retrieval

Lecture 3 : Term Weighting

Sparsity Analysis of Term Weighting Schemes and Application to Text Classification

Automatic Text Summarization

Text-retrieval Systems

Automatic Term Weighting, Lexical Statistics and …… Quantitative Terminology

Automatic Text Summarization

Short-term Retrieval

Automatic text summarization

Term weighting and Vector space retrieval

Text retrieval systems