510 likes | 698 Views
Information Retrieval and Vector Space Model. Computational Linguiestic Course Instructor : Professor Cercone Presenter : Morteza zihayat. Outline. Introduction to IR IR System Architecture Vector Space Model (VSM) How to Assign Weights? TF-IDF Weighting Example
E N D
Information Retrieval and Vector Space Model Computational Linguiestic CourseInstructor: Professor CerconePresenter: Mortezazihayat
Outline Introduction to IR IR System Architecture Vector Space Model (VSM) How to Assign Weights? TF-IDF Weighting Example Advantages and Disadvantages of VS Model Improving the VS Model Information Retrieval and Vector Space Model
Outline Introduction to IR IR System Architecture Vector Space Model (VSM) How to Assign Weights? TF-IDF Weighting Example Advantages and Disadvantages of VS Model Improving the VS Model Information Retrieval and Vector Space Model
Introduction to IR The world's total yearly production of unique information stored in the form of print, film, optical, and magnetic content would require roughly 1.5 billion gigabytes of storage. This is the equivalent of 250 megabytes per person for each man, woman, and child on earth. (Lyman & Hal 00) Information Retrieval and Vector Space Model
Literature Desktop Email WWW Blog News Intranet Growth of textual information How can we help manage and exploit all the information? Information Retrieval and Vector Space Model
Information overflow Information Retrieval and Vector Space Model
What is Information Retrieval (IR)? • Narrow-sense: • IR= Search Engine Technologies (IR=Google, library info system) • IR= Text matching/classification • Broad-sense: IR = Text Information Management: • General problem: how to manage text information? • How to find useful information? (retrieval) • Example: Google • How to organize information? (text classification) • Example: Automatically assign emails to different folders • How to discover knowledge from text? (text mining) • Example: Discover correlation of events Information Retrieval and Vector Space Model
Outline Introduction to IR IR System Architecture Vector Space Model (VSM) How to Assign Weights? TF-IDF Weighting Example Advantages of VS Model Improving the VSM Model Information Retrieval and Vector Space Model
Formalizing IR Tasks Source: This slide is borrowed from [1] Vocabulary: V = {w1,w2, …, wT} of a language Query: q = q1, q2, …, qm where qi ∈V. Document: di= di1, di2, …, dimi where dij∈V. Collection: C = {d1, d2, …, dN} Relevant document set: R(q) ⊆C:Generally unknown and user-dependent Query provides a “hint” on which documents should be in R(q) IR: find the approximate relevant document set R’(q) Information Retrieval and Vector Space Model
Evaluation measures • The quality of many retrieval systems depends on how well they manage to rank relevant documents. • How can we evaluate rankings in IR? • IR researchers have developed evaluation measures specifically designed to evaluate rankings. • Most of these measures combine precision and recall in a way that takes account of the ranking. Information Retrieval and Vector Space Model
Precision & Recall Source: This slide is borrowed from [1] Information Retrieval and Vector Space Model
In other words: Precision is the percentage of relevant items in the returned set Recall is the percentage of all relevant documents in the collection that is in the returned set. Information Retrieval and Vector Space Model
Evaluating Retrieval Performance Source: This slide is borrowed from [1] Information Retrieval and Vector Space Model
IR System Architecture docs INDEXING Query Rep query Doc Rep User Ranking SEARCHING results INTERFACE Feedback judgments QUERY MODIFICATION Information Retrieval and Vector Space Model
Indexing Document Information Retrieval and Vector Space Model
Searching • Given a query, score documents efficiently • The basic question: • Given a query, how do we know if document A is more relevant than B? • If document A uses more query words than document B • Word usage in document A is more similar to that in query • …. • We should find a way to compute relevance • Query and documents Information Retrieval and Vector Space Model
Relevance P(d q) or P(q d) Probabilistic inference (Rep(q), Rep(d)) Similarity P(r=1|q,d) r {0,1} Probability of Relevance Regression Model (Fox 83) Generative Model Different inference system Different rep & similarity Query generation Doc generation … Inference network model (Turtle & Croft, 91) Prob. concept space model (Wong & Yao, 95) Vector space model (Salton et al., 75) Prob. distr. model (Wong & Yao, 89) Classical prob. Model (Robertson & Sparck Jones, 76) LM approach (Ponte & Croft, 98) (Lafferty & Zhai, 01a) Today’s lecture The Notion of Relevance Information Retrieval and Vector Space Model
Relevance = Similarity • Assumptions • Query and document are represented similarly • A query can be regarded as a “document” • Relevance(d,q) similarity(d,q) • R(q) = {dC|f(d,q)>}, f(q,d)=(Rep(q), Rep(d)) • Key issues • How to represent query/document? • Vector Space Model (VSM) • How to define the similarity measure ? Information Retrieval and Vector Space Model
Outline Introduction to IR IR System Architecture Vector Space Model (VSM) How to Assign Weights? TF-IDF Weighting Example Advantages of VS Model Improving the VSM Model Information Retrieval and Vector Space Model
Vector Space Model (VSM) The vector space model is one of the most widely used models for ad-hoc retrieval Used in information filtering, information retrieval, indexing and relevancy rankings. Information Retrieval and Vector Space Model
VSM • Represent a doc/query by a term vector • Term: basic concept, e.g., word or phrase • Each term defines one dimension • N terms define a high-dimensional space • E.g., d=(x1,…,xN), xi is “importance” of term I • Measure relevance by the distance between the query vector and document vector in the vector space Information Retrieval and Vector Space Model
? ? Starbucks D2 D9 ? ? D11 D5 D3 D10 D4 D6 Java Query D7 D1 D8 Microsoft ?? VS Model: illustration Information Retrieval and Vector Space Model
Some Issues about VS Model • There is no consistent definition for basic concept • Assigning weights to words has not been determined • Weight in query indicates importance of term Information Retrieval and Vector Space Model
Outline Introduction to IR IR System Architecture Vector Space Model (VSM) How to Assign Weights? TF-IDF Weighting Example Advantages of VS Model Improving the VSM Model Information Retrieval and Vector Space Model
How to Assign Weights? • Different terms have different importance in a text • A term weighting scheme plays an important role for the similarity measure. • Higher weight = greater impact • We now turn to the question of how to weight words in the vector space model. Information Retrieval and Vector Space Model
There are three components in a weighting scheme: • gi: the global weight of the ith term, • tij: is the local weight of the ith term in the jth document, • dj:the normalization factor for the jth document Information Retrieval and Vector Space Model
Outline Introduction to IR IR System Architecture Vector Space Model (VSM) How to Assign Weights? TF-IDF Weighting Example Advantages of VS Model Improving the VSM Model Information Retrieval and Vector Space Model
TF Weighting • Idea: A term is more important if it occurs more frequently in a document • Formulas: Let f(t,d) be the frequency count of term t in doc d • Raw TF: TF(t,d) = f(t,d) • Log TF: TF(t,d)=log f(t,d) • Maximum frequency normalization: TF(t,d) = 0.5 +0.5*f(t,d)/MaxFreq(d) • Normalization of TF is very important! Information Retrieval and Vector Space Model
TF Methods Information Retrieval and Vector Space Model
IDF Weighting Idea: A term is more discriminative if it occurs only in fewer documents Formula: IDF(t) = 1+ log(n/k) n : total number of docs k : # docs with term t (doc freq) Information Retrieval and Vector Space Model
IDF weighting Methods Information Retrieval and Vector Space Model
TF Normalization • Why? • Document length variation • “Repeated occurrences” are less informative than the “first occurrence” • Two views of document length • A doc is long because it uses more words • A doc is long because it has more contents • Generally penalize long doc, but avoid over-penalizing Information Retrieval and Vector Space Model
TF-IDF Weighting • TF-IDF weighting : weight(t,d)=TF(t,d)*IDF(t) • Common in doc high tf high weight • Rare in collection high idf high weight • Imagine a word count profile, what kind of terms would have high weights? Information Retrieval and Vector Space Model
How to Measure Similarity? Information Retrieval and Vector Space Model
Outline Introduction to IR IR System Architecture Vector Space Model (VSM) How to Assign Weights? TF-IDF Weighting Example Advantages of VS Model Improving the VSM Model Information Retrieval and Vector Space Model
information retrieval search engine information Sim(q,doc1)=4.8*2.4+4.5*4.5 Sim(q,doc2)=2.4*2.4 Sim(q,doc3)=0 query=“information retrieval” doc1 travel information map travel doc2 Info. Retrieval Travel Map Search Engine Govern. President Congress government president congress IDF (fake) 2.4 4.5 2.8 3.3 2.1 5.4 2.2 3.2 4.3 doc3 Doc1 2(4.8) 1(4.5) 1(2.1) 1(5.4) Doc2 1(2.4) 2(5.6) 1(3.3) … Doc3 1(2.2) 1(3.2) 1(4.3) Query 1(2.4) 1(4.5) VS Example: Raw TF & Dot Product Information Retrieval and Vector Space Model
Example Q: “gold silver truck” • D1: “Shipment of gold delivered in a fire” • D2: “Delivery of silver arrived in a silver truck” • D3: “Shipment of gold arrived in a truck” • Document Frequency of the jth term (dfj ) • Inverse Document Frequency (idf) = log10(n / dfj) Tf*idf is used as term weight here Information Retrieval and Vector Space Model
Example (Cont’d) Information Retrieval and Vector Space Model
Example(Cont’d) Tf*idf is used here SC(Q, D1 ) = (0)(0) + (0)(0) + (0)(0.477) + (0)(0) + (0)(0.477)+ (0.176)(0.176) + (0)(0) + (0)(0) = 0.031 SC(Q, D2 ) = 0.486 SC(Q,D3) = 0.062 The ranking would be D2,D3,D1. • This SC uses the dot product. Information Retrieval and Vector Space Model
Outline Introduction to IR IR System Architecture Vector Space Model (VSM) How to Assign Weights? TF-IDF Weighting Example Advantages and Disadvantages of VS Model Improving the VSM Model Information Retrieval and Vector Space Model
Advantages of VS Model • Empirically effective! (Top TREC performance) • Intuitive • Easy to implement • Well-studied/Most evaluated • The Smart system • Developed at Cornell: 1960-1999 • Still widely used • Warning: Many variants of TF-IDF! Information Retrieval and Vector Space Model
Disadvantages of VS Model Assume term independence Assume query and document to be the same Lots of parameter tuning! Information Retrieval and Vector Space Model
Outline Introduction to IR IR System Architecture Vector Space Model (VSM) How to Assign Weights? TF-IDF Weighting Example Advantages and Disadvantages of VS Model Improving the VSM Model Information Retrieval and Vector Space Model
Improving the VSM Model • We can improve the model by: • Reducing the number of dimensions • eliminating all stop words and very common terms • stemming terms to their roots • Latent Semantic Analysis • Not retrieving documents below a defined cosine threshold • Normalized frequency of a term i in document j is given by[1]: • Normalized Document Frequencies • Normalized Query Frequencies Information Retrieval and Vector Space Model
Stop List • Function words do not bear useful information for IR of, not, to, or, in, about, with, I, be, … • Stop list: contain stop words, not to be used as index • Prepositions • Articles • Pronouns • Some adverbs and adjectives • Some frequent words (e.g. document) • The removal of stop words usually improves IR effectiveness • A few “standard” stop lists are commonly used. Information Retrieval and Vector Space Model
Stemming • Reason: • Different word forms may bear similar meaning (e.g. search, searching): create a “standard” representation for them • Stemming: • Removing some endings of word dancer dancers dance danced dancing dance Information Retrieval and Vector Space Model
Stemming(Cont’d) • Two main methods : Linguistic/dictionary-based stemming • high stemming accuracy • high implementation and processing costs and higher coverage Porter-style stemming • lower stemming accuracy • lower implementation and processing costs and lower coverage • Usually sufficient for IR Information Retrieval and Vector Space Model
Latent Semantic Indexing (LSI) [3] • Reduces the dimensions of the term-document space • Attempts to solve the synonomy and polysemy • Uses Singular Value Decomposition (SVD) • identifies patterns in the relationships between the terms and concepts contained in an unstructured collection of text • Based on the principle that words that are used in the same contexts tend to have similar meanings. Information Retrieval and Vector Space Model
LSI Process • In general, the process involves: • constructing a weighted term-document matrix • performing a Singular Value Decomposition on the matrix • using the matrix to identify the concepts contained in the text • LSI statistically analyses the patterns of word usage across the entire document collection Information Retrieval and Vector Space Model
References • Introduction to Information Retrieval, by Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schuetze • https://wiki.cse.yorku.ca/course_archive/2010-11/F/6390/_media/2.pdf • https://wiki.cse.yorku.ca/course_archive/2010-11/F/6390/_media/ir4up.pdf • https://wiki.cse.yorku.ca/course_archive/2010-11/F/6390/_media/e09-3009.pdf • https://wiki.cse.yorku.ca/course_archive/2010-11/F/6390/_media/07models-vsm.pdf • https://wiki.cse.yorku.ca/course_archive/2010-11/F/6390/_media/03vectorspaceimplementation-6per.pdf • https://wiki.cse.yorku.ca/course_archive/2011-12/F/6339/_media/lecture02.ppt • https://wiki.cse.yorku.ca/course_archive/2011-12/F/6339/_media/vector_space_model-updated.ppt • https://wiki.cse.yorku.ca/course_archive/2011-12/F/6339/_media/lecture_13_ir_and_vsm_.ppt • Document Classification based on Wikipedia Content, http://www.iicm.tugraz.at/cguetl/courses/isr/opt/classification/Vector_Space_Model.html?timestamp=1318275702299 Information Retrieval and Vector Space Model