Information Retrieval Models and Search Engines Overview

Search Engines e Question Answering Giuseppe Attardi Università di Pisa (some slides borrowed from C. Manning, H. Schütze)

Overview • Information Retrieval Models • Boolean and vector-space retrieval models; ranked retrieval; text-similarity metrics; TF-IDF (term frequency/inverse document frequency) weighting; cosine similarity; performance metrics: precision, recall, F-measure. • Indexing and Search • Indexing and inverted files; Compression; Postings Lists; Query languages • Web Search • Search engines; Architecture; Crawling: parallel/distributed, focused; Link analysis (Google PageRank); Scaling • Text Categorization and Clustering • Question Answering • Information extraction; Named Entity Recognition; Natural Language Processing; Part of Speech tagging; Question analysis and semantic matching

References • Modern Information Retrieval, R. Baeza-Yates, B. Ribeiro-Nieto, Addison Wesley • Managing Gigabytes, 2nd Edition, I.H. Witten, A. Moffat, T.C. Bell, Morgan Kaufmann, 1999. • Foundations of Statistical Natural Language Processing, MIT Statistical Natural Language Processing, C. Manning and Shutze, MIT Press, 1999.

Motivation

Adaptive Computing • Desktop Metaphor: highly successful in making computers popular • See Alan Kay 1975 presentation in Pisa • Limitations: • Point and click involves very elementary actions • People are required to perform more and more clerical tasks • We have become: bank clerks, typographers, illustrators, librarians

Illustrative problem • Add a table to a document with results from latest benchmarks and send it to my colleague Antonio: • 7-8 point&click just to get to the document • 7-8 point&click to get to the data • Lengthy fiddling with table layout • 3-4 point&click to retrieve mail address • Etc.

Success story • Do I care where a document is stored? • Shall I need a secretary for filing my documents? • Search Engines prove that you don’t

Overcoming Desktop Metaphor • Could think of just one possibility: • Raise the level of interaction with computers • How? • Could think of just one possibility: • Use natural language

Adaptiveness • My language is different from yours • Should be learned from user interaction • See: Steels’ talking heads language games • Through implicit interactions • Many potential sources (e.g. file a message in a folder  classification)

Research Goal • Question Answering • Techniques: • Traditional IR tools • NLP tools (POS tagging, parser) • Complement Knowledge Bases with massive data sets of usage (Web) • Knowledge extraction tools (NE tagging) • Continuous learning

IXE Framework Passage Index NE Tagger Python Perl Java EventStream ContextStream GIS POS Tagger Clustering Sent. Splitter Web Service Wrappers MaxEntropy Files Mem Mapping Threads Synchronization Unicode RegExp Tokenizer Suffix Trees Readers Indexer Search Crawler Text Object Store OS Abstraction

Information Retrieval Models

Information Retrieval Models • A model is an embodiment of the theory in which we define a set of objects about which assertions can be made and restrict the ways in which classes of objects can interact • A retrieval model specifies the representations used for documents and information needs, and how they are compared (Turtle & Croft, 1992)

Information Retrieval Model • Provides an abstract description of the representation used for documents, the representation of queries, the indexing process, the matching process between a query and the documents and the ranking criteria

Formal Characterization • An Information Retrieval model is a quadruple D, Q, F, R where • D is a set of representations for the documents in the collection • Q is a set of representations for the user information needs (queries) • F is a framework for modelling document representations, queries, and their relationships • R: QDℝ is a ranking function which associates a real number with a query qi Qand document representation dj D(Baeza-Yates & Ribeiro-Neto, 1999)

Information Retrieval Models • Three ‘classic’ models: • Boolean Model • Vector Space Model • Probabilistic Model • Additional models • Extended Boolean • Fuzzy matching • Cluster-based retrieval • Language models

Query Rank or Match Collections Information need Pre-process text input Index Parse

Boolean Model t1 t2 D9 D2 D1 q5 q3 q6 q1 = t1 t2 t3 D11 D4 q2 = t1 t2 t3 D5 q3 = t1 t2 t3 q1 D3 D6 q4 = t1 t2 t3 q2 q4 D10 q5 = t1 t2 t3 q7 q6 = t1 t2 t3 q8 q7 = t1 t2 t3 D8 D7 q8 = t1 t2 t3 t3

Boolean Searching Formal Query: cracks AND beams AND Width_measurement AND Prestressed_concrete “Measurement of the width of cracks in prestressed concrete beams” Cracks Width measurement Beams Relaxed Query: (C AND B AND P) OR (C AND B AND W) OR (C AND W AND P) OR (B AND W AND P) Prestressed concrete

Boolean Problems • Disjunctive (OR) queries lead to information overload • Conjunctive (AND) queries lead to reduced, and commonly zero result • Conjunctive queries imply reduction in Recall

Complete expressiveness for any identifiable subset of collection Exact and simple to program The whole panoply of Boolean Algebra available Complex query syntax is often misunderstood (if understood at all) Problems of Null output and Information Overload Output is not ordered in any useful fashion Boolean Model: Assessment Disadvantages Advantages

Boolean Extensions • Fuzzy Logic • Adds weights to each term/concept • ta AND tb is interpreted as MIN(w(ta),w(tb)) • taOR tbis interpreted as MAX (w(ta),w(tb)) • Proximity/Adjacency operators • Interpreted as additional constraints on Boolean AND • Verity TOPIC system • Uses various weighted forms of Boolean logic and proximity information in calculating Robertson Selection Values (RSV)

Vector Space Model • Documents are represented as vectors in term space • Terms are usually stems • Documents represented by binary vectors of terms • Queries represented the same as documents • Query and Document weights are based on length and direction of their vector • A vector distance measure between the query and documents is used to rank retrieved documents

Documents in Vector Space t3 D1 D9 D11 D5 D3 D10 D4 D2 t1 D7 D6 D8 t2

Vector Space Documents and Queries t1 t3 D2 D9 D1 D4 D11 D5 D3 D6 D10 D8 t2 D7

Similarity Measures Simple matching (coordination level match) Dice’s Coefficient Jaccard’s Coefficient Cosine Coefficient Overlap Coefficient

Vector Space with Term Weights Di=(wdi1, wdi2,… , wdit) Q =(wqi1, wqi2,… , wqit) Term B 1.0 Q = (0.4, 0.8) D1=(0.8, 0.3) D2=(0.2, 0.7) Q D2 0.8 0.6 0.4 D1 0.2 0 0.2 0.4 0.6 0.8 1.0 Term A

Problems with Vector Space • There is no real theoretical basis for the assumption of a term space • it is more for visualization that having any real basis • most similarity measures work about the same regardless of model • Terms are not really orthogonal dimensions • Terms are not independent of all other terms

Probabilistic Retrieval • Goes back to 1960’s (Maron and Kuhns) • Robertson’s “Probabilistic Ranking Principle” • Retrieved documents should be ranked in decreasing probability that they are relevant to the user’s query • How to estimate these probabilities? • Several methods (Model 1, Model 2, Model 3) with different emphasis on how estimates are done

Probabilistic Models: Notation • D = all present and future documents • Q = all present and future queries • (di, qj) = a document query pair • x D = class of similar documents • y Q = class of similar queries • Relevance is a relation: R = {(di, qj) | di D, qj  Q, di is judged relevant by the user submitting qj}

Probabilistic model • Given D, estimate P(R|D) and P(NR|D) • P(R|D)=P(D|R)*P(R)/P(D) (P(D), P(R) constant)  P(D|R) D = {t1=x1, t2=x2, …}

Prob. model (cont’d) For document ranking

Prob. model (cont’d) • How to estimate pi and qi? • A set of N relevant and irrelevant samples:

Prob. model (cont’d) • Smoothing (Robertson-Sparck-Jones formula) • When no sample is available: pi=0.5, qi=(ni+0.5)/(N+0.5)ni/N • May be implemented as VSM

Probabilistic Models • Model 1 – Probabilistic Indexing, P(R | y, di) • Model 2 – Probabilistic Querying, P(R| qj, x) • Model 3 – Merged Model, P(R | qj, di) • Model 0 – P(R | y, x) • Probabilities are estimated based on prior usage or relevance estimation

Probabilistic Models • Rigorous formal model attempts to predict the probability that a given document will be relevant to a given query • Ranks retrieved documents according to this probability of relevance (Probability Ranking Principle) • Relies on accurate estimates of probabilities for accurate results

Vector and Probabilistic Models • Support “natural language” queries • Treat documents and queries the same • Support relevance feedback searching • Support ranked retrieval • Differ primarily in theoretical basis and in how the ranking is calculated • Vector assumes relevance • Probabilistic relies on relevance judgments or estimates

IR Ranking

Ranking models in IR • Key idea: • We wish to return in order the documents most likely to be useful to the searcher • To do this, we want to know which documents best satisfy a query • An obvious idea is that if a document talks about a topic more then it is a better match • A query should then just specify terms that are relevant to the information need, without requiring that all of them must be present • Document relevant if it has a lot of the terms

Binary term presence matrices • Record whether a document contains a word: document is binary vector in {0,1}v • What we have mainly assumed so far • Idea: Query satisfaction = overlap measure:

Overlap matching • What are the problems with the overlap measure? • It doesn’t consider: • Term frequency in document • Term scarcity in collection (document mention frequency) • Length of documents • (AND queries: score not normalized)

Overlap matching • One can normalize in various ways: • Jaccard coefficient: • Cosine measure: • What documents would score best using Jaccard against a typical query? • Does the cosine measure fix this problem?

Count term-document matrices • We haven’t considered frequency of a word • Count of a word in a document: • Bag of words model • Document is a vector in ℕv Normalization: Calpurnia vs. Calphurnia

Weighting term frequency: tf • What is the relative importance of • 0 vs. 1 occurrence of a term in a doc • 1 vs. 2 occurrences • 2 vs. 3 occurrences … • Unclear: but it seems that more is better, but a lot isn’t necessarily better than a few • Can just use raw score • Another option commonly used in practice:

Dot product matching • Match is dot product of query and document • [Note: 0 if orthogonal (no words in common)] • Rank by match • It still doesn’t consider: • Term scarcity in collection (document mention frequency) • Length of documents and queries • Not normalized

Weighting should depend on the term overall • Which of these tells you more about a doc? • 10 occurrences of hernia? • 10 occurrences of the? • Suggest looking at collection frequency (cf) • But document frequency (df) may be better: Word cf df try10422 8760 insurance10440 3997 • Document frequency weighting is only possible in known (static) collection

tf x idf term weights • tf x idf measure combines: • term frequency (tf) • measure of term density in a doc • inverse document frequency (idf) • measure of informativeness of term: its rarity across the whole corpus • could just be raw count of number of documents the term occurs in (idfi = 1/dfi) • but by far the most commonly used version is: • See Kishore Papineni, NAACL 2, 2002 for theoretical justification

Summary: tf x idf • Assign a tf.idf weight to each term i in each document d tfi,d = frequency of term i in document d n = total number of documents dfi= number of documents that contain term i • Increases with the number of occurrences within a doc • Increases with the rarity of the term across the whole corpus What is the wt of a term that occurs in all of the docs?

Real-valued term-document matrices • Function (scaling) of count of a word in a document: • Bag of words model • Each is a vector in ℝv • Here log scaled tf.idf

Documents as vectors • Each doc j can now be viewed as a vector of tfidf values, one component for each term • So we have a vector space • terms are axes • docs live in this space • even with stemming, may have 20,000+ dimensions • (The corpus of documents gives us a matrix, which we could also view as a vector space in which words live – transposable data)

Information Retrieval Models and Search Engines Overview