180 likes | 348 Views
Term Weighting approaches in automatic text retrieval. Presented by Ehsan. References. Modern Information Retrieval: Text book Slides on Vectorial Model by Dr. Rada The paper itself. The main idea.
E N D
Term Weighting approaches in automatic text retrieval. Presented by Ehsan
References • Modern Information Retrieval: Text book • Slides on Vectorial Model by Dr. Rada • The paper itself
The main idea • Text indexing system based on weighted single terms is better than the one based on more complex text representation • Crucial importance: effective term weighting.
Basic IR • Attach content identifier to both stored texts and user queries. • A content identifier/term is a word or a group of words extracted from the document/queries • Underlying assumption • Semantics of the documents and queries can be expressed by this terms
Two things to consider • What is an appropriate content identifier? • Are all the identifier of same importance? • If not, how can we discriminate a term from the others?
Choosing content identifier • Use single term/word as individual identifier • Use more complex text representation as identifier • An example • “Industry is the mother of good luck” • Mother said, “Good luck”.
Complex text representation • Set of related terms based on statistical co-occurrence • Term phrases consisting of one of more governing terms (head of the phrase) together with corresponding depending terms • Grouping words under a common heading like thesaurus • Constructing knowledge base to represent the content of the subject area
What is better: single or complex terms? • Construction of complex text representation is inherently difficult. • Need sophisticated syntactic/statistical analysis program • An example • Using term phrase 20% increase in some cases • Other cases it is quite discouraging • Knowledge base • Effective vocabulary tools covering subject areas of reasonable scope is still sort of under-development • Conclusion • Using single terms as content identifier is preferable
The second issue • How to discriminate terms? • Term weight of course! • Effectiveness of IR system • Document with relevant items must be retrieved • Documents with irrelevant/extraneous items must be rejected.
Precision and Recall • Recall • Number of relevant document retrieved divided by total number of relevant documents • Precision • Out of the documents retrieved, how many of them are relevant • Our goal • High recall to retrieve as many relevant documents as possible • High precision to reject extraneous documents. • Basically, it is a trade off.
Weighting mechanism • To get high recall • Term frequency, tf • When high frequency term are prevalent in the whole document collection • With high tf every single documents will be retrieved • To get high precision • Inverse document frequency • Varies inversely with the number of documents, n in which the term appears. • Idf is given by log2 (N/ n) , where N is total number of documents • To discriminate terms • We use tf X idf
Two more things to consider • Current “tf X id” mechanism favors larger documents • introduce a normalizing factor in the weight to equalize the length of the document. • Probabilistic mode • Term weight is the proportion of the relevant documents in which a term occurs divided by proportion of irrelevant items in which the term occurs • Is given by log ((N-n)/n)
Term weighting components • Term frequency components • b, t, n • Collection frequency components • x, f, p • Normalization components • x, c • What would be weighting system given by tfc.nfx?
Experimental evidence • Query vectors • For tf • short query, use n • Long query, use t • For idf • Use f • For normalization • Use x
Experimental evidence • Document vectors • For tf • Technical vocabulary, use n • More varied vocabulary, use t • For idf • Use f in general • Documents from different domain use x • For normalization • Documents with heterogeneous length, use c • Homogenous documents, use x
Conclusion • Best document weighting tfc, nfc (or tpc, npc) • Best query weighting nfx, tfx, bfx (or npx, tpx, bpx) • Questions?