1 / 51

Computational Linguiestic Course Instructor : Professor Cercone Presenter : Morteza zihayat

Information Retrieval and Vector Space Model. Computational Linguiestic Course Instructor : Professor Cercone Presenter : Morteza zihayat. Outline. Introduction to IR IR System Architecture Vector Space Model (VSM) How to Assign Weights? TF-IDF Weighting Example

loren
Download Presentation

Computational Linguiestic Course Instructor : Professor Cercone Presenter : Morteza zihayat

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Information Retrieval and Vector Space Model Computational Linguiestic CourseInstructor: Professor CerconePresenter: Mortezazihayat

  2. Outline Introduction to IR IR System Architecture Vector Space Model (VSM) How to Assign Weights? TF-IDF Weighting Example Advantages and Disadvantages of VS Model Improving the VS Model Information Retrieval and Vector Space Model

  3. Outline Introduction to IR IR System Architecture Vector Space Model (VSM) How to Assign Weights? TF-IDF Weighting Example Advantages and Disadvantages of VS Model Improving the VS Model Information Retrieval and Vector Space Model

  4. Introduction to IR The world's total yearly production of unique information stored in the form of print, film, optical, and magnetic content would require roughly 1.5 billion gigabytes of storage. This is the equivalent of 250 megabytes per person for each man, woman, and child on earth. (Lyman & Hal 00) Information Retrieval and Vector Space Model

  5. Literature Desktop Email WWW Blog News Intranet Growth of textual information How can we help manage and exploit all the information? Information Retrieval and Vector Space Model

  6. Information overflow Information Retrieval and Vector Space Model

  7. What is Information Retrieval (IR)? • Narrow-sense: • IR= Search Engine Technologies (IR=Google, library info system) • IR= Text matching/classification • Broad-sense: IR = Text Information Management: • General problem: how to manage text information? • How to find useful information? (retrieval) • Example: Google • How to organize information? (text classification) • Example: Automatically assign emails to different folders • How to discover knowledge from text? (text mining) • Example: Discover correlation of events Information Retrieval and Vector Space Model

  8. Outline Introduction to IR IR System Architecture Vector Space Model (VSM) How to Assign Weights? TF-IDF Weighting Example Advantages of VS Model Improving the VSM Model Information Retrieval and Vector Space Model

  9. Formalizing IR Tasks Source: This slide is borrowed from [1] Vocabulary: V = {w1,w2, …, wT} of a language Query: q = q1, q2, …, qm where qi ∈V. Document: di= di1, di2, …, dimi where dij∈V. Collection: C = {d1, d2, …, dN} Relevant document set: R(q) ⊆C:Generally unknown and user-dependent Query provides a “hint” on which documents should be in R(q) IR: find the approximate relevant document set R’(q) Information Retrieval and Vector Space Model

  10. Evaluation measures • The quality of many retrieval systems depends on how well they manage to rank relevant documents. • How can we evaluate rankings in IR? • IR researchers have developed evaluation measures specifically designed to evaluate rankings. • Most of these measures combine precision and recall in a way that takes account of the ranking. Information Retrieval and Vector Space Model

  11. Precision & Recall Source: This slide is borrowed from [1] Information Retrieval and Vector Space Model

  12. In other words: Precision is the percentage of relevant items in the returned set Recall is the percentage of all relevant documents in the collection that is in the returned set. Information Retrieval and Vector Space Model

  13. Evaluating Retrieval Performance Source: This slide is borrowed from [1] Information Retrieval and Vector Space Model

  14. IR System Architecture docs INDEXING Query Rep query Doc Rep User Ranking SEARCHING results INTERFACE Feedback judgments QUERY MODIFICATION Information Retrieval and Vector Space Model

  15. Indexing Document Information Retrieval and Vector Space Model

  16. Searching • Given a query, score documents efficiently • The basic question: • Given a query, how do we know if document A is more relevant than B? • If document A uses more query words than document B • Word usage in document A is more similar to that in query • …. • We should find a way to compute relevance • Query and documents Information Retrieval and Vector Space Model

  17. Relevance P(d q) or P(q d) Probabilistic inference (Rep(q), Rep(d)) Similarity P(r=1|q,d) r {0,1} Probability of Relevance Regression Model (Fox 83) Generative Model Different inference system Different rep & similarity Query generation Doc generation … Inference network model (Turtle & Croft, 91) Prob. concept space model (Wong & Yao, 95) Vector space model (Salton et al., 75) Prob. distr. model (Wong & Yao, 89) Classical prob. Model (Robertson & Sparck Jones, 76) LM approach (Ponte & Croft, 98) (Lafferty & Zhai, 01a) Today’s lecture The Notion of Relevance Information Retrieval and Vector Space Model

  18. Relevance = Similarity • Assumptions • Query and document are represented similarly • A query can be regarded as a “document” • Relevance(d,q)  similarity(d,q) • R(q) = {dC|f(d,q)>}, f(q,d)=(Rep(q), Rep(d)) • Key issues • How to represent query/document? • Vector Space Model (VSM) • How to define the similarity measure ? Information Retrieval and Vector Space Model

  19. Outline Introduction to IR IR System Architecture Vector Space Model (VSM) How to Assign Weights? TF-IDF Weighting Example Advantages of VS Model Improving the VSM Model Information Retrieval and Vector Space Model

  20. Vector Space Model (VSM) The vector space model is one of the most widely used models for ad-hoc retrieval Used in information filtering, information retrieval, indexing and relevancy rankings. Information Retrieval and Vector Space Model

  21. VSM • Represent a doc/query by a term vector • Term: basic concept, e.g., word or phrase • Each term defines one dimension • N terms define a high-dimensional space • E.g., d=(x1,…,xN), xi is “importance” of term I • Measure relevance by the distance between the query vector and document vector in the vector space Information Retrieval and Vector Space Model

  22. ? ? Starbucks D2 D9 ? ? D11 D5 D3 D10 D4 D6 Java Query D7 D1 D8 Microsoft ?? VS Model: illustration Information Retrieval and Vector Space Model

  23. Some Issues about VS Model • There is no consistent definition for basic concept • Assigning weights to words has not been determined • Weight in query indicates importance of term Information Retrieval and Vector Space Model

  24. Outline Introduction to IR IR System Architecture Vector Space Model (VSM) How to Assign Weights? TF-IDF Weighting Example Advantages of VS Model Improving the VSM Model Information Retrieval and Vector Space Model

  25. How to Assign Weights? • Different terms have different importance in a text • A term weighting scheme plays an important role for the similarity measure. • Higher weight = greater impact • We now turn to the question of how to weight words in the vector space model. Information Retrieval and Vector Space Model

  26. There are three components in a weighting scheme: • gi: the global weight of the ith term, • tij: is the local weight of the ith term in the jth document, • dj:the normalization factor for the jth document Information Retrieval and Vector Space Model

  27. Outline Introduction to IR IR System Architecture Vector Space Model (VSM) How to Assign Weights? TF-IDF Weighting Example Advantages of VS Model Improving the VSM Model Information Retrieval and Vector Space Model

  28. TF Weighting • Idea: A term is more important if it occurs more frequently in a document • Formulas: Let f(t,d) be the frequency count of term t in doc d • Raw TF: TF(t,d) = f(t,d) • Log TF: TF(t,d)=log f(t,d) • Maximum frequency normalization: TF(t,d) = 0.5 +0.5*f(t,d)/MaxFreq(d) • Normalization of TF is very important! Information Retrieval and Vector Space Model

  29. TF Methods Information Retrieval and Vector Space Model

  30. IDF Weighting Idea: A term is more discriminative if it occurs only in fewer documents Formula: IDF(t) = 1+ log(n/k) n : total number of docs k : # docs with term t (doc freq) Information Retrieval and Vector Space Model

  31. IDF weighting Methods Information Retrieval and Vector Space Model

  32. TF Normalization • Why? • Document length variation • “Repeated occurrences” are less informative than the “first occurrence” • Two views of document length • A doc is long because it uses more words • A doc is long because it has more contents • Generally penalize long doc, but avoid over-penalizing Information Retrieval and Vector Space Model

  33. TF-IDF Weighting • TF-IDF weighting : weight(t,d)=TF(t,d)*IDF(t) • Common in doc  high tf  high weight • Rare in collection high idf high weight • Imagine a word count profile, what kind of terms would have high weights? Information Retrieval and Vector Space Model

  34. How to Measure Similarity? Information Retrieval and Vector Space Model

  35. Outline Introduction to IR IR System Architecture Vector Space Model (VSM) How to Assign Weights? TF-IDF Weighting Example Advantages of VS Model Improving the VSM Model Information Retrieval and Vector Space Model

  36. information retrieval search engine information Sim(q,doc1)=4.8*2.4+4.5*4.5 Sim(q,doc2)=2.4*2.4 Sim(q,doc3)=0 query=“information retrieval” doc1 travel information map travel doc2 Info. Retrieval Travel Map Search Engine Govern. President Congress government president congress IDF (fake) 2.4 4.5 2.8 3.3 2.1 5.4 2.2 3.2 4.3 doc3 Doc1 2(4.8) 1(4.5) 1(2.1) 1(5.4) Doc2 1(2.4) 2(5.6) 1(3.3) … Doc3 1(2.2) 1(3.2) 1(4.3) Query 1(2.4) 1(4.5) VS Example: Raw TF & Dot Product Information Retrieval and Vector Space Model

  37. Example Q: “gold silver truck” • D1: “Shipment of gold delivered in a fire” • D2: “Delivery of silver arrived in a silver truck” • D3: “Shipment of gold arrived in a truck” • Document Frequency of the jth term (dfj ) • Inverse Document Frequency (idf) = log10(n / dfj) Tf*idf is used as term weight here Information Retrieval and Vector Space Model

  38. Example (Cont’d) Information Retrieval and Vector Space Model

  39. Example(Cont’d) Tf*idf is used here SC(Q, D1 ) = (0)(0) + (0)(0) + (0)(0.477) + (0)(0) + (0)(0.477)+ (0.176)(0.176) + (0)(0) + (0)(0) = 0.031 SC(Q, D2 ) = 0.486 SC(Q,D3) = 0.062 The ranking would be D2,D3,D1. • This SC uses the dot product. Information Retrieval and Vector Space Model

  40. Outline Introduction to IR IR System Architecture Vector Space Model (VSM) How to Assign Weights? TF-IDF Weighting Example Advantages and Disadvantages of VS Model Improving the VSM Model Information Retrieval and Vector Space Model

  41. Advantages of VS Model • Empirically effective! (Top TREC performance) • Intuitive • Easy to implement • Well-studied/Most evaluated • The Smart system • Developed at Cornell: 1960-1999 • Still widely used • Warning: Many variants of TF-IDF! Information Retrieval and Vector Space Model

  42. Disadvantages of VS Model Assume term independence Assume query and document to be the same Lots of parameter tuning! Information Retrieval and Vector Space Model

  43. Outline Introduction to IR IR System Architecture Vector Space Model (VSM) How to Assign Weights? TF-IDF Weighting Example Advantages and Disadvantages of VS Model Improving the VSM Model Information Retrieval and Vector Space Model

  44. Improving the VSM Model • We can improve the model by: • Reducing the number of dimensions • eliminating all stop words and very common terms • stemming terms to their roots • Latent Semantic Analysis • Not retrieving documents below a defined cosine threshold • Normalized frequency of a term i in document j is given by[1]: • Normalized Document Frequencies • Normalized Query Frequencies Information Retrieval and Vector Space Model

  45. Stop List • Function words do not bear useful information for IR of, not, to, or, in, about, with, I, be, … • Stop list: contain stop words, not to be used as index • Prepositions • Articles • Pronouns • Some adverbs and adjectives • Some frequent words (e.g. document) • The removal of stop words usually improves IR effectiveness • A few “standard” stop lists are commonly used. Information Retrieval and Vector Space Model

  46. Stemming • Reason: • Different word forms may bear similar meaning (e.g. search, searching): create a “standard” representation for them • Stemming: • Removing some endings of word dancer dancers dance danced dancing dance Information Retrieval and Vector Space Model

  47. Stemming(Cont’d) • Two main methods : Linguistic/dictionary-based stemming • high stemming accuracy • high implementation and processing costs and higher coverage Porter-style stemming • lower stemming accuracy • lower implementation and processing costs and lower coverage • Usually sufficient for IR Information Retrieval and Vector Space Model

  48. Latent Semantic Indexing (LSI) [3] • Reduces the dimensions of the term-document space • Attempts to solve the synonomy and polysemy • Uses Singular Value Decomposition (SVD) • identifies patterns in the relationships between the terms and concepts contained in an unstructured collection of text • Based on the principle that words that are used in the same contexts tend to have similar meanings. Information Retrieval and Vector Space Model

  49. LSI Process • In general, the process involves: • constructing a weighted term-document matrix • performing a Singular Value Decomposition on the matrix • using the matrix to identify the concepts contained in the text • LSI statistically analyses the patterns of word usage across the entire document collection Information Retrieval and Vector Space Model

  50. References • Introduction to Information Retrieval, by Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schuetze • https://wiki.cse.yorku.ca/course_archive/2010-11/F/6390/_media/2.pdf • https://wiki.cse.yorku.ca/course_archive/2010-11/F/6390/_media/ir4up.pdf • https://wiki.cse.yorku.ca/course_archive/2010-11/F/6390/_media/e09-3009.pdf • https://wiki.cse.yorku.ca/course_archive/2010-11/F/6390/_media/07models-vsm.pdf • https://wiki.cse.yorku.ca/course_archive/2010-11/F/6390/_media/03vectorspaceimplementation-6per.pdf • https://wiki.cse.yorku.ca/course_archive/2011-12/F/6339/_media/lecture02.ppt • https://wiki.cse.yorku.ca/course_archive/2011-12/F/6339/_media/vector_space_model-updated.ppt • https://wiki.cse.yorku.ca/course_archive/2011-12/F/6339/_media/lecture_13_ir_and_vsm_.ppt • Document Classification based on Wikipedia Content, http://www.iicm.tugraz.at/cguetl/courses/isr/opt/classification/Vector_Space_Model.html?timestamp=1318275702299 Information Retrieval and Vector Space Model

More Related