250 likes | 262 Views
Learn about different information retrieval models, with a focus on the Boolean retrieval model. Discover how to optimize queries and evaluate search results.
E N D
INFORMATION RETRIEVAL TECHNIQUES BYDR. ADNAN ABID Lecture # 2 Introduction Information Retrieval Models Boolean Retrieval Model
ACKNOWLEDGEMENTS The presentation of this lecture has been taken from the following sources • “Introduction to information retrieval” by PrabhakarRaghavan, Christopher D. Manning, and Hinrich Schütze • “Managing gigabytes” by Ian H. Witten, Alistair Moffat, Timothy C. Bell • “Modern information retrieval” by Baeza-Yates Ricardo, • “Web Information Retrieval” by Stefano Ceri, Alessandro Bozzon, Marco Brambilla
Outline • What is Information Retrieval ? ? ? • IR Models • The Boolean Model • Considerations on the Boolean Model
What is Information Retrieval ? ? ? • Information retrieval (IR) deals with the representation, storage, organization of, and access to information items. Baeza-Yates, Ribeiro-Nieto, 1999 • Information retrieval (IR) is devoted to finding relevant documents, not finding simple match to patterns. Grossman - Frieder, 2004 • Information retrieval (IR) is finding material (usually documents) of an unstructured nature(usually text) that satisfy an information need from within large collections (usually stored on computers). Manning et al., 2007
Document corpus Query String 1. Doc1 2. Doc2 3. Doc3 . . Ranked Documents Information Retrieval IR System
IR Models • Modeling in IR is a complex process aimed at producing a ranking function • Ranking function: a function that assigns scores to documents with regard to a given query • This process consists of two main tasks: • The conception of a logical framework for representing documents and queries • The definition of a ranking function that allows quantifying the similarities among documents and queries
IR Models • An IR model is a quadruple [D, Q, F, R(qi, dj)] where • 1. D is a set of logical views for the documents in the collection • 2. Q is a set of logical views for the user queries • 3. F is a framework for modeling documents and queries • 4. R(qi, dj) is a ranking function
Retrieval: Ad Hoc x Filtering • Ad Hoc Retrieval:
Retrieval: Ad Hoc x Filtering • Filtering
A Taxonomy of IR Models Set Theoretic Fuzzy Extended Boolean Classic Models Algebraic Boolean Vector Probabilistic Generalized Vector Lat.Semantic Index Neural Networks Ad-Hoc Retrieval Filtering Probabilistic Structured Models Inference Network Belief Network Non-Overlapping Lists Proximal Nodes
The Boolean Model • The Boolean retrieval model is a simple retrieval model based on set theory and Boolean algebra • Index term’s Significance represented by binary weights • Wkj ∈{0,1} is associated to the tuple (tk, dj) • Rdj set of index terms for a document • Rti set of document for an index term • Queries are defined as Boolean expressions over index terms (using Boolean operators AND, OR and NOT) • Brutus AND Caesar but NOT Calpurnia?
The Boolean Model (Cont.….) tb ta (1,1,0) (1,0,0) • Relevance is modeled as a binary property of the documents (SC = 0 or SC=1) • Closed world assumption: the absence of a term t in the representation of a document d is equivalent to the presence of (not t) in the same representation (1,1,1) tc
The Boolean Model d1 = [1, 1,1]T d2 = [1, 0,0]T d3 = [0, 1,0]T Rt1 = {d1, d2} Rt2 = {d1, d3} Rt3 = {d1} q = t1 q = t1 AND t2 q = t1 OR t2 q = NOT t1 Rt1 = {d1, d2} Rt1 ∩ Rt2 = d1 Rt1 ∪ Rt2 = {d1, d2,d3} ⌐Rt1 = d3
The Boolean Model • Consider processing the query: • q = ta tb • Locate ta in the Dictionary; • Retrieve its postings. • Locate tb in the Dictionary; • Retrieve its postings. • “Merge” the two postings: ta 2 4 8 16 32 64 128 tb 1 34 2 3 5 8 13 21
Posting list intersection • The intersection operation is the crucial one: we need to efficiently intersect postings lists so as to be able to quickly find documents that contain both terms. • If the list lengths are x and y, the merge takes O(x+y) operations. • Crucial: postings sorted by docID.
Evaluation of Boolean queries • keyword: retrieval of all the documents containing a keyword in the inverted list and build of a document list • OR: creation of a list associated with the node which is the union of the lists associated with the left and right sub-trees • AND: creation of a list associated with the node which is the intersection of the lists associated with the left and right sub-trees • BUT = AND NOT: creation of a list associated with the difference between the lists related with the left and right sub-trees
Query optimization • Is the process of selecting how to organize the work of answering a query • GOAL: minimize the total amount of work performed by the system.
Order of evaluation t1 = {d1, d3, d5, d7} t2 = {d2, d3, d4, d5, d6} q = t1 AND t2 OR t3 t3 = {d4, d6, d8} • From the left:{d3, d5} {d4, d6, d8}= {d3, d5,d4, d6, d8} • From the right: {d2,d3,d4,d5,d6,d8} {d1, d3, d5, d7}={d3,d5} • Standard evaluation priority: and, not, or A and B or C and B (A and B) or (C and B) AND OR t1 t2 t3
Considerations on the Boolean Model • Strategy is based on a binary decision criterion (i.e., a document is predicted to be either relevant or non-relevant) without any notion of a grading scale • The Boolean model is in reality much more a data (instead of information) retrieval model • Pros: • Boolean expressions have precise semantics • Structured queries • For expert users, intuitivity • Simple and neat formalism great attention in past years and was adopted by many of the early commercial bibliographic systems
Drawbacks of the Boolean Model • Retrieval based on binary decision criteria with no notion of partial matching • No ranking of the documents is provided (absence of a grading scale) • Information need has to be translated into a Boolean expression, which most users find awkward • The Boolean queries formulated by the users are most often too simplistic • The model frequently returns either too few or too many documents in response to a user query
Resources • Modern Information Retrieval • Chapter 1 of IIR • Resources at http://ifnlp.org/ir • Boolean Retrieval