750 likes | 776 Views
This lecture covers the fundamentals of Boolean and Extended Boolean principles in Information Retrieval, including IR components, Inverted Files, Boolean Model, and more. Understand the structure of an IR system, how Inverted Files are created, and their importance in facilitating fast search for terms. Study Boolean operations on sets, query languages, and Boolean queries with connectors like AND, OR, and NOT. Explore different query expressions and rules for Boolean query formation.
E N D
Prof. Ray Larson University of California, Berkeley School of Information Tuesday and Thursday 10:30 am - 12:00 pm Spring 2007 http://courses.ischool.berkeley.edu/i240/s07 Lecture 5: Boolean and Extended Boolean Principles of Information Retrieval
Today • Review • IR Components • Inverted Files • IR Models • The Boolean Model • Fuzzy sets, Rubric, P-norm, etc.
Structure of an IR System Storage Line Interest profiles & Queries Documents & data Information Storage and Retrieval System Rules of the game = Rules for subject indexing + Thesaurus (which consists of Lead-In Vocabulary and Indexing Language Formulating query in terms of descriptors Indexing (Descriptive and Subject) Storage of profiles Storage of Documents Store1: Profiles/ Search requests Store2: Document representations Comparison/ Matching Potentially Relevant Documents Search Line Adapted from Soergel, p. 19
Boolean Implementation: Inverted Files • We will look at “Vector files” in detail later. But conceptually, an Inverted File is a vector file “inverted” so that rows become columns and columns become rows
How Are Inverted Files Created • Documents are parsed to extract words (or stems) and these are saved with the Document ID. Doc 1 Doc 2 Now is the time for all good men to come to the aid of their country It was a dark and stormy night in the country manor. The time was past midnight Text Proc Steps
How Inverted Files are Created • After all document have been parsed the inverted file is sorted
How Inverted Files are Created • Multiple term entries for a single document are merged and frequency information added
Inverted Files • The file is commonly split into a Dictionary and a Postings file
Inverted files • Permit fast search for individual terms • Search results for each term is a list of document IDs (and optionally, frequency and/or positional information) • These lists can be used to solve Boolean queries: • country: d1, d2 • manor: d2 • country and manor: d2
Inverted Files • Lots of alternative implementations • E.g.: Cheshire builds within-document frequency using a hash table during document parsing. Then Document IDs and frequency info are stored in a BerkeleyDB B-tree index keyed by the term.
Btree (conceptual) F | | P | | Z | B | | D | | F | H | | L | | P | R | | S | | Z | Devils Hawkeyes Hoosiers Minors Panthers Seminoles Aces Boilers Cars Flyers
Btree with Postings F | | P | | Z | B | | D | | F | H | | L | | P | R | | S | | Z | Devils Hawkeyes Hoosiers Minors Panthers Seminoles Aces Boilers Cars Flyers 2,4,8,12 2,4,8,12 2,4,8,12 2,4,8,12 8,120 2,4,8,12 2,4,8,12 5, 7, 200 2,4,8,12 2,4,8,12
Inverted files • Permit fast search for individual terms • Search results for each term is a list of document IDs (and optionally, frequency and/or positional information) • These lists can be used to solve Boolean queries: • country: d1, d2 • manor: d2 • country and manor: d2
Today • Review • IR Components • Inverted Files • IR Models • The Boolean Model • Fuzzy sets, Rubric, P-norm, etc.
IR Models • Set Theoretic Models • Boolean • Fuzzy • Extended Boolean • Vector Models (Algebraic) • Probabilistic Models (probabilistic) • Others (e.g., neural networks, etc.)
Boolean Model for IR • Based on Boolean Logic (Algebra of Sets). • Fundamental principles established by George Boole in the 1850’s • Deals with set membership and operations on sets • Set membership in IR systems is usually based on whether (or not) a document contains a keyword (term)
Boolean Operations on Sets • Intersection – Boolean ‘AND’ -- -- • Union – Boolean ‘OR’ -- -- • Negation – Boolean ‘NOT’ -- -- • Usually means “AND NOT” in IR • Exclusive OR – ‘XOR’ – seldom used, • Instead
Boolean Logic A B
Query Languages • A way to express the query (formal expression of the information need) • Types: • Boolean • Natural Language • Stylized Natural Language • Form-Based (GUI)
Simple query language: Boolean • Terms + Connectors • terms • words • normalized (stemmed) words • phrases • thesaurus terms • connectors • AND • OR • NOT • parentheses (for grouping operations)
Boolean Queries • Cat • Cat OR Dog • Cat AND Dog • (Cat ANDDog) • (Cat AND Dog) OR Collar • (Cat AND Dog) OR (Collar AND Leash) • (Cat OR Dog) AND (Collar OR Leash)
Boolean Queries • (Cat OR Dog) AND (Collar OR Leash) • Each of the following combinations works:
Boolean Queries • (Cat OR Dog) AND (Collar OR Leash) • None of the following combinations works:
Boolean Queries • Usually expressed as INFIX operators in IR • ((a AND b) OR (c AND b)) • NOT is UNARY PREFIX operator • ((a AND b) OR (c AND (NOT b))) • AND and OR can be n-ary operators • (a AND b AND c AND d) • Some rules - (De Morgan revisited) • NOT(a) AND NOT(b) = NOT(a OR b) • NOT(a) OR NOT(b)= NOT(a AND b) • NOT(NOT(a)) = a
Boolean Searching Formal Query: cracksANDbeams ANDWidth_measurement ANDPrestressed_concrete Cracks Width measurement Beams Relaxed Query: (C AND B AND P) OR (C AND B AND W) OR (C AND W AND P) OR (B AND W AND P) Prestressed concrete Relaxed Query: (CAND B AND P) OR (C AND B AND W) OR (C AND W AND P) OR (B AND W AND P)
Boolean Logic t1 t2 D9 D2 D1 m5 m3 m6 m1= t1t2t3 D11 D4 m2= t1 t2t3 D5 m3= t1 t2t3 D3 m1 D6 m4= t1t2t3 m2 m4 D10 m5 = t1t2t3 m6 = t1t2t3 m7 m8 m7 = t1t2t3 D8 D7 m8= t1t2t3 t3
Precedence Ordering • In what order do we evaluate the components of the Boolean expression? • Parenthesis get done first • (a or b) and (c or d) • (a or (b and c) or d) • Usually start from the left and work right (in case of ties) • Usually (if there are no parentheses) • NOT before AND • AND before OR
Faceted Boolean Query • Strategy: break query into facets (polysemous with earlier meaning of facets) • conjunction of disjunctions (a1 OR a2 OR a3) (b1 OR b2) (c1 OR c2 OR c3 OR c4) • each facet expresses a topic (“rain forest” OR jungle OR amazon) (medicine OR remedy OR cure) (Smith OR Zhou) AND AND
Ordering of Retrieved Documents • Pure Boolean has no ordering • In practice: • order chronologically • order by total number of “hits” on query terms • What if one term has more hits than others? • Is it better to one of each term or many of one term? • Fancier methods have been investigated • p-norm is most famous • usually impractical to implement • usually hard for user to understand
Faceted Boolean Query • Query still fails if one facet missing • Alternative: • Coordination level ranking • Order results in terms of how many facets (disjuncts) are satisfied • Also called Quorum ranking, Overlap ranking, and Best Match • Problem: Facets still undifferentiated • Alternative: • Assign weights to facets
Boolean Processing • Boolean Processing (classic Boolean) • Data structures for Query representation and Boolean Operations • Boolean processing logic and algorithms • Extended Boolean Models • Fuzzy Logic • Others
Boolean Processing • All processing takes place on postings lists • Different methods can be used for sorted or unsorted postings lists
Boolean Query Processing • The query must be parsed to determine what the: • Search Words • Optional field or index qualifications • Boolean Operators • Are and how they relate to one-another • Typical parsing uses lexical analysers (like lex or flex) along with parser generators like YACC, BISON or Llgen • These produce code to be compiled into programs. • Example…
Z39.50 Query Structure (ASN-1 Notation) -- Query Definitions Query ::= CHOICE{ type-0 [0] ANY, type-1 [1] IMPLICIT RPNQuery, type-2 [2] OCTET STRING, type-100 [100] OCTET STRING, type-101 [101] IMPLICIT RPNQuery, type-102 [102] OCTET STRING}
Z39.50 RPN Query (ASN-1 Notation) -- Definitions for RPN query RPNQuery ::= SEQUENCE{ attributeSet AttributeSetId, rpn RPNStructure}
RPN Structure RPNStructure ::= CHOICE{ op [0] Operand, rpnRpnOp [1] IMPLICIT SEQUENCE{ rpn1 RPNStructure, rpn2 RPNStructure, op Operator } }
Operand Operand ::= CHOICE{ attrTerm AttributesPlusTerm, resultSet ResultSetId, -- If version 2 is in force: -- - If query type is 1, one of the above two must be chosen; -- - resultAttr (below) may be used only if query type is 101. resultAttr ResultSetPlusAttributes}
Operator Operator ::= [46] CHOICE{ and [0] IMPLICIT NULL, or [1] IMPLICIT NULL, and-not [2] IMPLICIT NULL, -- If version 2 is in force: -- - For query type 1, one of the above three must be chosen; -- - prox (below) may be used only if query type is 101. prox [3] IMPLICIT ProximityOperator}
Parse Result (Query Tree) • Z39.50 queries… Title XXX and Subject YYY Operator: AND right left Operand: Index = Subject Value = YYY Operand: Index = Title Value = XXX
Parse Results • Subject XXX and (title yyy and author zzz) Op: AND Oper: Index: Subject Value: XXX Op: AND Oper: Index: Title Value: YYY Oper: Index: Author Value: ZZZ
Boolean AND (Sorted) Algorithm • Choose the shortest list • Create new list the same length as the short list • For each item in the short list • Compare next item in longer list • If greater than – go to next item in longer list • If equal - add to new list and go to next item in both lists • If less than - go to next item in short list
Boolean AND Algorithm = AND
Boolean OR (Sorted) Algorithm • Choose the longer list • Create new list the same length both lists combined • For each item in the longer list • If less than or equal to the first item in the short list • Add to new list • Otherwise • Add item from short list • Compare next items in short and long lists • If long item less then short item add long item and go to next long item • Otherwise – add from short list and go to next short item • Once the short list runs out, add the remaining items in the long list
Boolean OR Algorithm = OR
Boolean AND NOT(Sorted) Algorithm Create new list the same length as the left-hand list • For each item in the left-hand list • Compare next item in not list • If greater than – add to new list and go to next item in not list • If equal - go to next item in both lists • If less than - go to next item in not list
Boolean AND NOTAlgorithm = AND NOT
Hashed Boolean AND (unsorted) • Put each item in shortest list into hash table • For each item in other lists • If hash entry exists, set flag in hash table entry (or increment counter) • Scan hash table contents • If flag set (or counter == number of lists) add to new list
Hashed Boolean OR (unsorted) • Put each item in EACH list into hash table • If match increment counter (optional) • Scan hash table contents and add to new list
Hashed Boolean AND NOT (unsorted) • Put each item in left-hand list into hash table • For each item in NOT list • If hash entry exists, remove it • Scan hash table contents and add to new list