740 likes | 757 Views
Explore information retrieval models, Boolean and vector-space retrieval, TF-IDF weighting, search engines architecture, text categorization, and question answering techniques. Learn about indexing, compression, query languages, and more.
E N D
Search Engines e Question Answering Giuseppe Attardi Università di Pisa (some slides borrowed from C. Manning, H. Schütze)
Overview • Information Retrieval Models • Boolean and vector-space retrieval models; ranked retrieval; text-similarity metrics; TF-IDF (term frequency/inverse document frequency) weighting; cosine similarity; performance metrics: precision, recall, F-measure. • Indexing and Search • Indexing and inverted files; Compression; Postings Lists; Query languages • Web Search • Search engines; Architecture; Crawling: parallel/distributed, focused; Link analysis (Google PageRank); Scaling • Text Categorization and Clustering • Question Answering • Information extraction; Named Entity Recognition; Natural Language Processing; Part of Speech tagging; Question analysis and semantic matching
References • Modern Information Retrieval, R. Baeza-Yates, B. Ribeiro-Nieto, Addison Wesley • Managing Gigabytes, 2nd Edition, I.H. Witten, A. Moffat, T.C. Bell, Morgan Kaufmann, 1999. • Foundations of Statistical Natural Language Processing, MIT Statistical Natural Language Processing, C. Manning and Shutze, MIT Press, 1999.
Adaptive Computing • Desktop Metaphor: highly successful in making computers popular • See Alan Kay 1975 presentation in Pisa • Limitations: • Point and click involves very elementary actions • People are required to perform more and more clerical tasks • We have become: bank clerks, typographers, illustrators, librarians
Illustrative problem • Add a table to a document with results from latest benchmarks and send it to my colleague Antonio: • 7-8 point&click just to get to the document • 7-8 point&click to get to the data • Lengthy fiddling with table layout • 3-4 point&click to retrieve mail address • Etc.
Success story • Do I care where a document is stored? • Shall I need a secretary for filing my documents? • Search Engines prove that you don’t
Overcoming Desktop Metaphor • Could think of just one possibility: • Raise the level of interaction with computers • How? • Could think of just one possibility: • Use natural language
Adaptiveness • My language is different from yours • Should be learned from user interaction • See: Steels’ talking heads language games • Through implicit interactions • Many potential sources (e.g. file a message in a folder classification)
Research Goal • Question Answering • Techniques: • Traditional IR tools • NLP tools (POS tagging, parser) • Complement Knowledge Bases with massive data sets of usage (Web) • Knowledge extraction tools (NE tagging) • Continuous learning
IXE Framework Passage Index NE Tagger Python Perl Java EventStream ContextStream GIS POS Tagger Clustering Sent. Splitter Web Service Wrappers MaxEntropy Files Mem Mapping Threads Synchronization Unicode RegExp Tokenizer Suffix Trees Readers Indexer Search Crawler Text Object Store OS Abstraction
Information Retrieval Models • A model is an embodiment of the theory in which we define a set of objects about which assertions can be made and restrict the ways in which classes of objects can interact • A retrieval model specifies the representations used for documents and information needs, and how they are compared (Turtle & Croft, 1992)
Information Retrieval Model • Provides an abstract description of the representation used for documents, the representation of queries, the indexing process, the matching process between a query and the documents and the ranking criteria
Formal Characterization • An Information Retrieval model is a quadruple D, Q, F, R where • D is a set of representations for the documents in the collection • Q is a set of representations for the user information needs (queries) • F is a framework for modelling document representations, queries, and their relationships • R: QDℝ is a ranking function which associates a real number with a query qi Qand document representation dj D(Baeza-Yates & Ribeiro-Neto, 1999)
Information Retrieval Models • Three ‘classic’ models: • Boolean Model • Vector Space Model • Probabilistic Model • Additional models • Extended Boolean • Fuzzy matching • Cluster-based retrieval • Language models
Query Rank or Match Collections Information need Pre-process text input Index Parse
Boolean Model t1 t2 D9 D2 D1 q5 q3 q6 q1 = t1 t2 t3 D11 D4 q2 = t1 t2 t3 D5 q3 = t1 t2 t3 q1 D3 D6 q4 = t1 t2 t3 q2 q4 D10 q5 = t1 t2 t3 q7 q6 = t1 t2 t3 q8 q7 = t1 t2 t3 D8 D7 q8 = t1 t2 t3 t3
Boolean Searching Formal Query: cracks AND beams AND Width_measurement AND Prestressed_concrete “Measurement of the width of cracks in prestressed concrete beams” Cracks Width measurement Beams Relaxed Query: (C AND B AND P) OR (C AND B AND W) OR (C AND W AND P) OR (B AND W AND P) Prestressed concrete
Boolean Problems • Disjunctive (OR) queries lead to information overload • Conjunctive (AND) queries lead to reduced, and commonly zero result • Conjunctive queries imply reduction in Recall
Complete expressiveness for any identifiable subset of collection Exact and simple to program The whole panoply of Boolean Algebra available Complex query syntax is often misunderstood (if understood at all) Problems of Null output and Information Overload Output is not ordered in any useful fashion Boolean Model: Assessment Disadvantages Advantages
Boolean Extensions • Fuzzy Logic • Adds weights to each term/concept • ta AND tb is interpreted as MIN(w(ta),w(tb)) • taOR tbis interpreted as MAX (w(ta),w(tb)) • Proximity/Adjacency operators • Interpreted as additional constraints on Boolean AND • Verity TOPIC system • Uses various weighted forms of Boolean logic and proximity information in calculating Robertson Selection Values (RSV)
Vector Space Model • Documents are represented as vectors in term space • Terms are usually stems • Documents represented by binary vectors of terms • Queries represented the same as documents • Query and Document weights are based on length and direction of their vector • A vector distance measure between the query and documents is used to rank retrieved documents
Documents in Vector Space t3 D1 D9 D11 D5 D3 D10 D4 D2 t1 D7 D6 D8 t2
Vector Space Documents and Queries t1 t3 D2 D9 D1 D4 D11 D5 D3 D6 D10 D8 t2 D7
Similarity Measures Simple matching (coordination level match) Dice’s Coefficient Jaccard’s Coefficient Cosine Coefficient Overlap Coefficient
Vector Space with Term Weights Di=(wdi1, wdi2,… , wdit) Q =(wqi1, wqi2,… , wqit) Term B 1.0 Q = (0.4, 0.8) D1=(0.8, 0.3) D2=(0.2, 0.7) Q D2 0.8 0.6 0.4 D1 0.2 0 0.2 0.4 0.6 0.8 1.0 Term A
Problems with Vector Space • There is no real theoretical basis for the assumption of a term space • it is more for visualization that having any real basis • most similarity measures work about the same regardless of model • Terms are not really orthogonal dimensions • Terms are not independent of all other terms
Probabilistic Retrieval • Goes back to 1960’s (Maron and Kuhns) • Robertson’s “Probabilistic Ranking Principle” • Retrieved documents should be ranked in decreasing probability that they are relevant to the user’s query • How to estimate these probabilities? • Several methods (Model 1, Model 2, Model 3) with different emphasis on how estimates are done
Probabilistic Models: Notation • D = all present and future documents • Q = all present and future queries • (di, qj) = a document query pair • x D = class of similar documents • y Q = class of similar queries • Relevance is a relation: R = {(di, qj) | di D, qj Q, di is judged relevant by the user submitting qj}
Probabilistic model • Given D, estimate P(R|D) and P(NR|D) • P(R|D)=P(D|R)*P(R)/P(D) (P(D), P(R) constant) P(D|R) D = {t1=x1, t2=x2, …}
Prob. model (cont’d) For document ranking
Prob. model (cont’d) • How to estimate pi and qi? • A set of N relevant and irrelevant samples:
Prob. model (cont’d) • Smoothing (Robertson-Sparck-Jones formula) • When no sample is available: pi=0.5, qi=(ni+0.5)/(N+0.5)ni/N • May be implemented as VSM
Probabilistic Models • Model 1 – Probabilistic Indexing, P(R | y, di) • Model 2 – Probabilistic Querying, P(R| qj, x) • Model 3 – Merged Model, P(R | qj, di) • Model 0 – P(R | y, x) • Probabilities are estimated based on prior usage or relevance estimation
Probabilistic Models • Rigorous formal model attempts to predict the probability that a given document will be relevant to a given query • Ranks retrieved documents according to this probability of relevance (Probability Ranking Principle) • Relies on accurate estimates of probabilities for accurate results
Vector and Probabilistic Models • Support “natural language” queries • Treat documents and queries the same • Support relevance feedback searching • Support ranked retrieval • Differ primarily in theoretical basis and in how the ranking is calculated • Vector assumes relevance • Probabilistic relies on relevance judgments or estimates
Ranking models in IR • Key idea: • We wish to return in order the documents most likely to be useful to the searcher • To do this, we want to know which documents best satisfy a query • An obvious idea is that if a document talks about a topic more then it is a better match • A query should then just specify terms that are relevant to the information need, without requiring that all of them must be present • Document relevant if it has a lot of the terms
Binary term presence matrices • Record whether a document contains a word: document is binary vector in {0,1}v • What we have mainly assumed so far • Idea: Query satisfaction = overlap measure:
Overlap matching • What are the problems with the overlap measure? • It doesn’t consider: • Term frequency in document • Term scarcity in collection (document mention frequency) • Length of documents • (AND queries: score not normalized)
Overlap matching • One can normalize in various ways: • Jaccard coefficient: • Cosine measure: • What documents would score best using Jaccard against a typical query? • Does the cosine measure fix this problem?
Count term-document matrices • We haven’t considered frequency of a word • Count of a word in a document: • Bag of words model • Document is a vector in ℕv Normalization: Calpurnia vs. Calphurnia
Weighting term frequency: tf • What is the relative importance of • 0 vs. 1 occurrence of a term in a doc • 1 vs. 2 occurrences • 2 vs. 3 occurrences … • Unclear: but it seems that more is better, but a lot isn’t necessarily better than a few • Can just use raw score • Another option commonly used in practice:
Dot product matching • Match is dot product of query and document • [Note: 0 if orthogonal (no words in common)] • Rank by match • It still doesn’t consider: • Term scarcity in collection (document mention frequency) • Length of documents and queries • Not normalized
Weighting should depend on the term overall • Which of these tells you more about a doc? • 10 occurrences of hernia? • 10 occurrences of the? • Suggest looking at collection frequency (cf) • But document frequency (df) may be better: Word cf df try10422 8760 insurance10440 3997 • Document frequency weighting is only possible in known (static) collection
tf x idf term weights • tf x idf measure combines: • term frequency (tf) • measure of term density in a doc • inverse document frequency (idf) • measure of informativeness of term: its rarity across the whole corpus • could just be raw count of number of documents the term occurs in (idfi = 1/dfi) • but by far the most commonly used version is: • See Kishore Papineni, NAACL 2, 2002 for theoretical justification
Summary: tf x idf • Assign a tf.idf weight to each term i in each document d tfi,d = frequency of term i in document d n = total number of documents dfi= number of documents that contain term i • Increases with the number of occurrences within a doc • Increases with the rarity of the term across the whole corpus What is the wt of a term that occurs in all of the docs?
Real-valued term-document matrices • Function (scaling) of count of a word in a document: • Bag of words model • Each is a vector in ℝv • Here log scaled tf.idf
Documents as vectors • Each doc j can now be viewed as a vector of tfidf values, one component for each term • So we have a vector space • terms are axes • docs live in this space • even with stemming, may have 20,000+ dimensions • (The corpus of documents gives us a matrix, which we could also view as a vector space in which words live – transposable data)