Under The Hood [Part II] Web-Based Information Architectures

Under The Hood [Part II]Web-Based Information Architectures MSEC 20-760Mini II Jaime Carbonell

Today’s Topics • Term weighting in detail • Generalized Vector Space Model (GVSM) • Maximal Marginal Relevance • Summarization as Passage Retrieval

Term Weighting Revisited (1) Definitions wi "ith Term:" a word, stemmed word, or indexed phrase Dj "jth Document:" a unit of indexed text, e.g. a web-page, a news report, an article, a patent, a legal case, a book, a chapter of a book, etc.

Term Weighting Revisited (2) Definitions C "The Collection:" the full set of indexed documents (e.g. the New York Times archive, the Web, ...) Tf(wi ,Dj) "Term Frequency:" the number of times wi occurs in document Dj. Tf is sometimes normalized by dividing by frequency of the most-frequent non-stop term in the document [Tf norm = Tf/ max_TF ].

Term Weighting Revisited (3) Definitions Df(wi ,C) "Document Frequency:" the number of documents from C in which wi occurs. Df may be normalized by dividing it by the total number of documents in C. IDf(wi, C) “Inverse Document Frequency”: [Df(wi, C)/size(C)]-1 . Most often the log2(IDf) is used, rather than IDf directly.

Term Weighting Revisited (4) TfIDf Term Weights In general: TfIDf(wi, Dj, C) = F1(Tf(wi, Dj) * F2(IDf(wi, C)) Usually F1 = 0.5 + log2(Tf), or Tf/Tfmax or 0.5 + 0.5Tf/Tfmax Usually F2 = log2(IDf) In the SMART IR system: TfIDf(wi, Dj,C) = [0.5 + 0.5Tf(wi, Dj/Tfmax(Dj)] * log2(IDf(wi, C))

Term Weighting beyond TfIDf (1) Probabilistic Models • Old style (see textbooks) Improves precision-recall slightly • Full statistical language modeling (CMU) Improves precision-recall more significantly • Difficult to compute efficiently.

Term Weighting beyond TfIDf (2) Neural Networks • Theoretically attractive • Do not scale up at all, unfortunately Fuzzy Sets • Not deeply researched, scaling difficulties

Term Weighting beyond TfIDf (3) Natural Language Analysis • Analyze and understand D’s & Q first • Ultimate IR method, in theory • Generally NL understanding is an unsolved problem • Scale up challenges, even if we could do it • But, shown to improve IR for very limited domains

Generalized Vector Space Model (1) Principles • Define terms by their occurrence patterns in documents • Define query terms in the same way • Compute similarity by document-pattern overlap for terms in D and Q • Use standard Cos similarity and either binary or TfIDf weights

Generalized Vector Space Model (2) Advantages • Automatically calculates partial similarity If "heart disease" and "stroke" and "ventricular" co-occur in many documents, then if the query contains only one of these terms, documents containing the other will receive partial credit proportional to their document co-occurrence ratio. • No need to do query expansion or relevance feedback

Generalized Vector Space Model (3) Disadvantages • Computationally expensive • Performance = vector space + Q expansion

GVSM, How it Works (1) Represent the collection as vector of documents: Let C = [D1, D2, ..., Dm ] Represent each term by its distributional frequency: Let ti = [Tf(ti, D1), Tf(ti, D2 ), ..., Tf(ti, Dm )] Term-to-term similarity is computed as: Sim(ti, tj) = cos(vec(ti), vec(tj)) Hence, highly co-occurring terms like "Arafat" and "PLO" will be treated as near-synonyms for retrieval

GVSM, How it Works (2) And query-document similarity is computed as before: Sim(Q,D) = cos(vec(Q)), vec(D)), except that instead of the dot product calculation, we use a function of the term-to-term similarity computation above, For instance: Sim(Q,D) = Σi[Maxj(sim(qi, dj)] or normalizing for document & query length: Simnorm(Q, D) =

GVSM, How it Works (3) Primary problem: More computation (sparse => dense) Primary benefit: Automatic term expansion by corpus

A Critique of Pure Relevance (1) IR Maximizes Relevance • Precision and recall are relevance measures • Quality of documents retrieved is ignored

A Critique of Pure Relevance (2) Other Important Factors • What about information novelty, timeliness, appropriateness, validity, comprehensibility, density, medium,...?? • In IR, we really want to maximize: P(U(f i , ..., f n ) | Q & {C} & U & H) where Q = query, {C} = collection set, U = user profile, H = interaction history • ...but we don’t yet know how. Darn.

Maximal Marginal Relevance (1) • A crude first approximation: novelty => minimal-redundancy • Weighted linear combination: (redundancy = cost, relevance = benefit) • Free parameters: k and λ

Maximal Marginal Relevance (2) MMR(Q, C, R) = Argmaxkdiin C[λS(Q, di) - (1-λ)maxdjin R (S(di, dj))]

Maximal Marginal Relevance (MMR) (3) COMPUTATION OF MMR RERANKING 1. Standard IR Retrieval of top-N docs Let Dr = IR(D, Q, N) 2. Rank max sim(di ε Dr, Q) as top doc, i.e. Let Ranked = {di} 3. Let Dr = Dr\{di} 4. While Dr is not empty, do: a. Find di with max MMR(Dr, Q. Ranked) b. Let Ranked = Ranked.di c. Let Dr = Dr\{di}

MMR Ranking vs Standard IR documents query MMR IR λcontrols spiral curl

Maximal Marginal Relevance (MMR) (4) Applications: • Ranking retrieved documents from IR Engine • Ranking passages for inclusion in Summaries

Document Summarization in a Nutshell (1) Types of Summaries

Document Summarization in a Nutshell (2) Other Dimensions • Single vs multi document summarization • Genre-adaptive vs one-size-fits all • Single-language vs translingual • Flat summary vs hyperlinked pyramid • Text-only vs multi-media • ...

Summarization as Passage Retrieval (1) For Query-Driven Summaries 1. Divide document into passages e.g, sentences, paragraphs, FAQ-pairs, .... 2. Use query to retrieve most relevant passages, or better, use MMR to avoid redundancy. 3. Assemble retrieved passages into a summary.

Summarization as Passage Retrieval (2) For Generic Summaries 1. Use title or top-k Tf-IDF terms as query. 2. Proceed as Query-Driven Summarization.

Summarization as Passage Retrieval (3) For Multidocument Summaries 1. Cluster documents into topically-related groups. 2. For each group, divide document into passages and keep track of source of each passage. 3. Use MMR to retrieve most relevant non- redundant passages (MMR is necessary for multiple docs). 4. Assemble a summary for each cluster.

Under The Hood [Part II] Web-Based Information Architectures