1 / 27

Under The Hood [Part II] Web-Based Information Architectures

Under The Hood [Part II] Web-Based Information Architectures. MSEC 20-760 Mini II Jaime Carbonell. Today’s Topics. Term weighting in detail Generalized Vector Space Model (GVSM) Maximal Marginal Relevance Summarization as Passage Retrieval. Term Weighting Revisited (1). Definitions

mirari
Download Presentation

Under The Hood [Part II] Web-Based Information Architectures

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Under The Hood [Part II]Web-Based Information Architectures MSEC 20-760Mini II Jaime Carbonell

  2. Today’s Topics • Term weighting in detail • Generalized Vector Space Model (GVSM) • Maximal Marginal Relevance • Summarization as Passage Retrieval

  3. Term Weighting Revisited (1) Definitions wi "ith Term:" a word, stemmed word, or indexed phrase Dj "jth Document:" a unit of indexed text, e.g. a web-page, a news report, an article, a patent, a legal case, a book, a chapter of a book, etc.

  4. Term Weighting Revisited (2) Definitions C "The Collection:" the full set of indexed documents (e.g. the New York Times archive, the Web, ...) Tf(wi ,Dj) "Term Frequency:" the number of times wi occurs in document Dj. Tf is sometimes normalized by dividing by frequency of the most-frequent non-stop term in the document [Tf norm = Tf/ max_TF ].

  5. Term Weighting Revisited (3) Definitions Df(wi ,C) "Document Frequency:" the number of documents from C in which wi occurs. Df may be normalized by dividing it by the total number of documents in C. IDf(wi, C) “Inverse Document Frequency”: [Df(wi, C)/size(C)]-1 . Most often the log2(IDf) is used, rather than IDf directly.

  6. Term Weighting Revisited (4) TfIDf Term Weights In general: TfIDf(wi, Dj, C) = F1(Tf(wi, Dj) * F2(IDf(wi, C)) Usually F1 = 0.5 + log2(Tf), or Tf/Tfmax or 0.5 + 0.5Tf/Tfmax Usually F2 = log2(IDf) In the SMART IR system: TfIDf(wi, Dj,C) = [0.5 + 0.5Tf(wi, Dj/Tfmax(Dj)] * log2(IDf(wi, C))

  7. Term Weighting beyond TfIDf (1) Probabilistic Models • Old style (see textbooks) Improves precision-recall slightly • Full statistical language modeling (CMU) Improves precision-recall more significantly • Difficult to compute efficiently.

  8. Term Weighting beyond TfIDf (2) Neural Networks • Theoretically attractive • Do not scale up at all, unfortunately Fuzzy Sets • Not deeply researched, scaling difficulties

  9. Term Weighting beyond TfIDf (3) Natural Language Analysis • Analyze and understand D’s & Q first • Ultimate IR method, in theory • Generally NL understanding is an unsolved problem • Scale up challenges, even if we could do it • But, shown to improve IR for very limited domains

  10. Generalized Vector Space Model (1) Principles • Define terms by their occurrence patterns in documents • Define query terms in the same way • Compute similarity by document-pattern overlap for terms in D and Q • Use standard Cos similarity and either binary or TfIDf weights

  11. Generalized Vector Space Model (2) Advantages • Automatically calculates partial similarity If "heart disease" and "stroke" and "ventricular" co-occur in many documents, then if the query contains only one of these terms, documents containing the other will receive partial credit proportional to their document co-occurrence ratio. • No need to do query expansion or relevance feedback

  12. Generalized Vector Space Model (3) Disadvantages • Computationally expensive • Performance = vector space + Q expansion

  13. GVSM, How it Works (1) Represent the collection as vector of documents: Let C = [D1, D2, ..., Dm ] Represent each term by its distributional frequency: Let ti = [Tf(ti, D1), Tf(ti, D2 ), ..., Tf(ti, Dm )] Term-to-term similarity is computed as: Sim(ti, tj) = cos(vec(ti), vec(tj)) Hence, highly co-occurring terms like "Arafat" and "PLO" will be treated as near-synonyms for retrieval

  14. GVSM, How it Works (2) And query-document similarity is computed as before: Sim(Q,D) = cos(vec(Q)), vec(D)), except that instead of the dot product calculation, we use a function of the term-to-term similarity computation above, For instance: Sim(Q,D) = Σi[Maxj(sim(qi, dj)] or normalizing for document & query length: Simnorm(Q, D) =

  15. GVSM, How it Works (3) Primary problem: More computation (sparse => dense) Primary benefit: Automatic term expansion by corpus

  16. A Critique of Pure Relevance (1) IR Maximizes Relevance • Precision and recall are relevance measures • Quality of documents retrieved is ignored

  17. A Critique of Pure Relevance (2) Other Important Factors • What about information novelty, timeliness, appropriateness, validity, comprehensibility, density, medium,...?? • In IR, we really want to maximize: P(U(f i , ..., f n ) | Q & {C} & U & H) where Q = query, {C} = collection set, U = user profile, H = interaction history • ...but we don’t yet know how. Darn.

  18. Maximal Marginal Relevance (1) • A crude first approximation: novelty => minimal-redundancy • Weighted linear combination: (redundancy = cost, relevance = benefit) • Free parameters: k and λ

  19. Maximal Marginal Relevance (2) MMR(Q, C, R) = Argmaxkdiin C[λS(Q, di) - (1-λ)maxdjin R (S(di, dj))]

  20. Maximal Marginal Relevance (MMR) (3) COMPUTATION OF MMR RERANKING 1. Standard IR Retrieval of top-N docs Let Dr = IR(D, Q, N) 2. Rank max sim(di ε Dr, Q) as top doc, i.e. Let Ranked = {di} 3. Let Dr = Dr\{di} 4. While Dr is not empty, do: a. Find di with max MMR(Dr, Q. Ranked) b. Let Ranked = Ranked.di c. Let Dr = Dr\{di}

  21. MMR Ranking vs Standard IR documents query MMR IR λcontrols spiral curl

  22. Maximal Marginal Relevance (MMR) (4) Applications: • Ranking retrieved documents from IR Engine • Ranking passages for inclusion in Summaries

  23. Document Summarization in a Nutshell (1) Types of Summaries

  24. Document Summarization in a Nutshell (2) Other Dimensions • Single vs multi document summarization • Genre-adaptive vs one-size-fits all • Single-language vs translingual • Flat summary vs hyperlinked pyramid • Text-only vs multi-media • ...

  25. Summarization as Passage Retrieval (1) For Query-Driven Summaries 1. Divide document into passages e.g, sentences, paragraphs, FAQ-pairs, .... 2. Use query to retrieve most relevant passages, or better, use MMR to avoid redundancy. 3. Assemble retrieved passages into a summary.

  26. Summarization as Passage Retrieval (2) For Generic Summaries 1. Use title or top-k Tf-IDF terms as query. 2. Proceed as Query-Driven Summarization.

  27. Summarization as Passage Retrieval (3) For Multidocument Summaries 1. Cluster documents into topically-related groups. 2. For each group, divide document into passages and keep track of source of each passage. 3. Use MMR to retrieve most relevant non- redundant passages (MMR is necessary for multiple docs). 4. Assemble a summary for each cluster.

More Related