420 likes | 502 Views
Information Retrieval (13). Prof. Dragomir R. Radev radev@umich.edu. IR Winter 2010. … 23. Text summarization …. Separate presentation (SIGIR 2004 tutorial). IR Winter 2010. … 24. Collaborative filtering. Recommendation systems. …. Examples. http://www.netflix.com
E N D
Information Retrieval(13) Prof. Dragomir R. Radev radev@umich.edu
IR Winter 2010 … 23. Text summarization …
Separate presentation (SIGIR 2004 tutorial)
IR Winter 2010 … 24. Collaborative filtering. Recommendation systems. …
Examples • http://www.netflix.com • Given “Pulp Fiction”, it recommends: • Apocalypse Now • Reservoir Dogs • Kill Bill: Vol. 1 • Kill Bill: Vol. 2 • American Beauty • http://www.amazon.com • Given Philip Ball’s “Critical Mass”, here are Amazon’s recommendations: • The Wisdom of Crowds by James Surowiecki • The Paradox of Choice: Why More Is Less by Barry Schwartz • Why Life Speeds Up As You Get Older: How Memory Shapes our Past by Douwe Draaisma • Origin of Wealth: Evolution, Complexity, and the Radical Remaking of Economics by Eric D. Beinhocker • Freakonomics [Revised and Expanded]: A Rogue Economist Explores the Hidden Side of Everything by Steven D. Levitt
Examples • http://www.pandora.com/ • http://www.google.com/search?hl=en&q=related:www.umich.edu/ • Main approaches: • Vector-based: represent each user as a vector • Graph-based: using random walks on bipartite graphs
IR Winter 2010 … 25. Burstiness Self-triggerability … Slides by Zhuoran Chen
Burstiness • Given the average per-document frequency of a word in a collection, can wee predict how many times it will appear in a document? • Church example: how many instances of “Noriega” will we see in a document? • The first occurrence depends on DF, but the second does not! • The adaptive language model • The degree of adaptation depends on lexical content – independent of the frequency. “word rates vary from genre to genre, author to author, topic to topic, document to document, section to section, and paragraph to paragraph” -- Church&Gale
The 2-Poisson Model – Bookstein and Swanson • Intuition: content-bearing words clustered in relevant documents; non-content words occur randomly. • Methods: linear combination of Poisson distributions • The two-poisson model, surprisingly, could account for the occupancy distribution of most words.
Term Burstiness • The definitions of word frequency • Term frequency or TF: count of occurrences in a given document • Document frequency or DF: count of documents in a corpus that a word occurs • Generalized document frequency or DFj: like DF but a word must occurs at least j times • DF/N: Given a word, the chance we will see it in a document (the p in Church2000). • ∑TF/N: Given a word, the average count we will see it in a document • Given we have seen a word in one document, what’s the chance that we will see it again?
Adaptive model • Church’s formulas • Cache model Pr(w) = λPrlocal(w) + (1-λ)Prglobal(w) • History-Test division; Positive and negative adaptations Pr(+adapt) = Pr(w in test| w in history) Pr(-adapt) = Pr(w in test| w not in history) observation: Pr(+adapt) >> Pr(prior) > Pr(-adapt) • Generalized DF dfj = number of documents with j or more instances of w.
IR Winter 2010 … 26. Information Extraction. Hidden Markov Models. …
Information Extraction • Extracting database records from unstructured and semi-structured inputs • Examples: • Recognizing names of people in text • Extracting prices from tables • Linking companies with products • Identifying positive vs. negative opinions • Main steps: • Segmentation • Classification • Association • Clustering
FDA expands pet food recall The nationwide pet food recall was expanded Wednesday to include products containing rice protein laced with melamine, a toxic agent, the Food and Drug Administration said. Before this latest announcement, the FDA attributed pet illness and deaths to recalled pet food with wheat gluten found to contain melamine, a component of fertilizers and plastic utensils. Also on Wednesday, Menu Foods, the company that recalled more than 60 million cans and pouches of wet cat and dog food on March 15, added one of its Natural Life brand products to its recall list. It added two product dates to eight of its already recalled pet foods. The FDA has recorded 16 animal deaths related to the wheat gluten-pet food recall. However, other organizations have put the death toll in the thousands. After consumer complaints to Natural Balance of Pacoima, California, reporting kidney failure in several cats and dogs after eating the company's venison products, the firm issued a nationwide recall of its venison and brown rice canned and bagged dog foods and treats, and venison and green pea dry cat food, the FDA said. FDA – organization Food and Drug Administration - organization Menu Foods – company Natural Life – brand Natural Balance – company Pacoima – location California - location
Classify Pre-segmentedCandidates Sliding Window Abraham Lincoln was born in Kentucky. Abraham Lincoln was born in Kentucky. Classifier Classifier which class? which class? Try alternatewindow sizes: Context Free Grammars Boundary Models Finite State Machines Abraham Lincoln was born in Kentucky. Abraham Lincoln was born in Kentucky. Abraham Lincoln was born in Kentucky. BEGIN Most likely state sequence? NNP NNP V V P NP Most likely parse? Classifier PP which class? VP NP VP BEGIN END BEGIN END S Landscape of IE Techniques Lexicons Abraham Lincoln was born in Kentucky. member? Alabama Alaska … Wisconsin Wyoming Our Focus today! Slide by William Cohen
Markov Property S1: rain S2: cloud S3: sun The state of a system at time t+1, qt+1, is conditionally independent of {qt-1, qt-2, …, q1, q0} given qt In another word, current state determines the probability distribution for the next state. 1/2 S2 1/3 1/2 S1 S2 2/3 1 Slide by Yunyao Li
Markov Property S1: rain S2: cloud S3: sun 1/2 State-transition probabilities, A = S2 1/3 1/2 S1 S3 2/3 1 Q: given today is sunny (i.e., q1=3), what is the probability of “sun-cloud” with the model? Slide by Yunyao Li
state sequences O1 O3 O4 O5 O2 Hidden Markov Model S1: rain S2: cloud S3: sun 1/2 1/10 S2 9/10 1/2 1/3 S1 S3 2/3 observations 4/5 1 7/10 3/10 1/5 Slide by Yunyao Li
IE with Hidden Markov Model Given a sequence of observations: CS 6998 is held weekly at IPB. course name and a trained HMM: location name background person name Find the most likely state sequence: (Viterbi) CS 6998 is held weekly at IPB Any words said to be generated by the designated “course name” state extract as a course name: Course name: CS 6998 Slide by Yunyao Li
Person end-of-sentence start-of-sentence Org (Five other name classes) Other Named Entity Extraction [Bikel et al 1998] Hidden states Slide by Yunyao Li
Name Entity Extraction Transitionprobabilities Observationprobabilities P(ot | st , st-1) P(st | st-1, ot-1) P(ot | st , ot-1) or (1) Generating first word of a name-class (2) Generating the rest of words in the name-class (3) Generating “+end+” in a name-class Slide by Yunyao Li
Training: Estimating Probabilities Slide by Yunyao Li
Back-Off “unknown words” and insufficient training data Transitionprobabilities Observationprobabilities P(st | st-1 ) P(ot | st ) P(st ) P(ot ) Slide by Yunyao Li
HMM-Experimental Results Train on ~500k words of news wire text. Results: Slide by Yunyao Li
IR Winter 2010 … 27. Probabilistic models of IR Language models … Slides by Manning, Schuetze, Raghavan
Probability Ranking Principle Let xbe a document in the collection. Let R represent relevance of a document w.r.t. given (fixed) query and let NR represent non-relevance. R={0,1} vs. NR/R Need to find p(R|x)- probability that a document xis relevant. p(R),p(NR) - prior probability of retrieving a (non) relevant document p(x|R), p(x|NR) - probability that if a relevant (non-relevant) document is retrieved, it is x.
Binary Independence Model • Traditionally used in conjunction with PRP • “Binary” = Boolean: documents are represented as binary incidence vectors of terms (cf. lecture 1): • iff term i is present in document x. • “Independence”: terms occur in documents independently • Different documents can be modeled as same vector • Bernoulli Naive Bayes model (cf. text categorization!)
Binary Independence Model • Queries: binary term incidence vectors • Given query q, • for each document d need to compute p(R|q,d). • replace with computing p(R|q,x) where xis binary term incidence vector representing d Interested only in ranking • Will use odds and Bayes’ Rule:
Using Independence Assumption: • So : Binary Independence Model Constant for a given query Needs estimation
Let • Assume, for all terms not occurring in the query (qi=0) Binary Independence Model • Since xi is either 0 or 1: This can be changed (e.g., in relevance feedback) Then...
All matching terms All matching terms Non-matching query terms All query terms Binary Independence Model
Constant for each query Only quantity to be estimated for rankings • Retrieval Status Value: Binary Independence Model
Binary Independence Model • All boils down to computing RSV. So, how do we compute ci’s from our data ?
Estimates: For now, assume no zero terms. More inMSR12 Binary Independence Model • Estimating RSV coefficients. • For each term i look at this table of document counts:
IR Winter 2010 … 28. Adversarial IR. Spamming and anti-spamming methods. …
Adversarial IR • We looked at spamming in the context of Naïve Bayes • Let’s now consider spamming of hyperlinked IR • The main idea: artificially increase your in-degree • Link farms: groups of pages that point to each other. • Google penalizes sites that belong to link farms
IR Winter 2010 … 29. Human behavior on the Web. …
Sample tasks • Identifying sessions in query logs • Predicting accesses to a given page (e.g., for caching) • Recognizing human vs. automated queries
Analysis of Search Engine Query Logs This slide is from Pierre Baldi
Main Results • Average number of terms in a query is ranging from a low of 2.2 to a high of 2.6 • The most common number of terms in a query is 2 • The majority of users don’t refine their query • The number of users who viewed only a single page increase 29% (1997) to 51% (2001) (Excite) • 85% of users viewed only first page of search results (AltaVista) • 45% (2001) of queries is about Commerce, Travel, Economy, People (was 20%1997) • The queries about adult or entertainment decreased from 20% (1997) to around 7% (2001) This slide is from Pierre Baldi
Main Results • All four studies produced a generally consistent set of findings about user behavior in a search engine context • most users view relatively few pages per query • most users don’t use advanced search features - Query Length Distributions (bar) - Poisson Model(dots & lines) This slide is from Pierre Baldi
Power-law Characteristics • Frequency f(r) of Queries with Rank r • 110000 queries from Vivisimo • 1.9 Million queries from Excite • There are strong regularities in terms of patterns of behavior in how we search the Web Power-Law in log-log space This slide is from Pierre Baldi