820 likes | 831 Views
This article discusses the Boolean Information Retrieval (IR) model, which is based on set theoretic principles established by George Boole in the 1850s. It explores how this model deals with set membership and operations on sets, using Boolean expressions for queries and returning documents that satisfy the expressions. The model's simplicity and ease of understanding make it conceptually straightforward, but it has limitations in terms of speed and complex operations.
E N D
Information retrieval 2019/2020
IR Models • Set Theoretic Models • Boolean • Fuzzy • Extended Boolean • Vector Models (Algebraic) • Probabilistic Models (probabilistic) • Others (e.g., neural networks, etc.)
Boolean IR Model • Based on Boolean Logic (Algebra of Sets) • Fundamental principles established by George Boole in the 1850’s • Deals with set membership and operations on sets • Set membership in IR systems is usually based on whether (or not) a document contains a keyword (term)
Boolean IR Model • Queries are Boolean expressions, e.g., ‘Caesar and Brutus’ (predicates for term occurrence) • Search engine returns all documents that satisfy the Boolean expression • Conceptually simple and easy to understand • Basic operation: • identify set of documents containing a certain term
Boolean IR Model • Which plays of Shakespeare contain the words BRUTUS and CAESAR, but NOT CALPURNIA? • One could grepall of Shakespeare’s plays for BRUTUS and CAESAR, then strip out lines containing CALPURNIA. • Why is grep not the solution? • Slow (forlargecollections) • “NOT CALPURNIA” isnon-trivial • Other operations (e.g., find the word Romans near countryman) notfeasible • Ranked retrieval (best documents to return)
Term-document matrix document term
We have a 0/1 vector for each term (characteristic functions for term occurrence predicates) • To answer the query BRUTUS AND CAESAR AND NOT CALPURNIA: • Take the vectors for BRUTUS, CAESAR, and CALPURNIA • Complement the vector of CALPURNIA • Do a (bitwise) and on the three vectors 110100 and 110111 and 101111 100100
Sometimes we need to • Fast process large documents • More specific operations (not only AND, OR, NOT) – e.g., near • Sort = RANK of retrieved documents
1 2 4 11 31 45 173 1 2 4 5 6 16 57 132 Dictionary Posting list Inverted index • We need variable-size postings lists • On disk, a continuous run of postings is normal and best • In memory, can use linked lists or variable length arrays Brutus 174 Caesar Calpurnia 2 31 54 101
1 2 4 11 31 45 173 1 2 4 5 6 16 57 132 Dictionary Posting list Concepts • Dictionary of terms – list of all terms in memory • Posting list – list of term occurrence in documents • Posting – one item in the occurrence list Brutus 174 Caesar Calpurnia 2 31 54 101
How to create posting list • Acquire the collection of documents. • “Miervšetkýmnárodomsveta.” • Tokenize content of documents. • Mier, všetkým, národom, sveta. • Perform text operations (which?). • mier, všetko, národ, svet • Alphabetically sort and find all documents containing terms.
mier národ svet všetko
Query • “miervšetkým” • ->mier and všetko mier národ svet všetko
Merging lists INTERSECT(p1,p2) answer=[] whilep1<>null and p2<>null doifdocID(p1)=docID(p2) then ADD(answer,docID(p1)) p1=next(p1) p2=next(p2) elseifdocID(p1)<docID(p2) thenp1=next(p1) elsep2=next(p2) returnanswer
INTERSECT(p1,p2) answer=[] whilep1<>null and p2<>null doifdocID(p1)=docID(p2) then ADD(answer,docID(p1)) p1=next(p1) p2=next(p2) elseifdocID(p1)<docID(p2) thenp1=next(p1) elsep2=next(p2) returnanswer Intersection example • Brutus AND Caesar AND Calpurnia
INTERSECT(p1,p2) answer=[] whilep1<>null and p2<>null doifdocID(p1)=docID(p2) then ADD(answer,docID(p1)) p1=next(p1) p2=next(p2) elseifdocID(p1)<docID(p2) thenp1=next(p1) elsep2=next(p2) returnanswer Intersection example • Brutus AND Caesar AND Calpurnia
INTERSECT(p1,p2) answer=[] whilep1<>null and p2<>null doifdocID(p1)=docID(p2) then ADD(answer,docID(p1)) p1=next(p1) p2=next(p2) elseifdocID(p1)<docID(p2) thenp1=next(p1) elsep2=next(p2) returnanswer Intersection example • Brutus AND Caesar AND Calpurnia • 1,
INTERSECT(p1,p2) answer=[] whilep1<>null and p2<>null doifdocID(p1)=docID(p2) then ADD(answer,docID(p1)) p1=next(p1) p2=next(p2) elseifdocID(p1)<docID(p2) thenp1=next(p1) elsep2=next(p2) returnanswer Intersection example • Brutus AND Caesar AND Calpurnia • 1,3,
INTERSECT(p1,p2) answer=[] whilep1<>null and p2<>null doifdocID(p1)=docID(p2) then ADD(answer,docID(p1)) p1=next(p1) p2=next(p2) elseifdocID(p1)<docID(p2) thenp1=next(p1) elsep2=next(p2) returnanswer Intersection example • Brutus AND Caesar AND Calpurnia • 1,3,5,
INTERSECT(p1,p2) answer=[] whilep1<>null and p2<>null doifdocID(p1)=docID(p2) then ADD(answer,docID(p1)) p1=next(p1) p2=next(p2) elseifdocID(p1)<docID(p2) thenp1=next(p1) elsep2=next(p2) returnanswer Intersection example • Brutus AND Caesar AND Calpurnia • 1,3,5,15
INTERSECT(p1,p2) answer=[] whilep1<>null and p2<>null doifdocID(p1)=docID(p2) then ADD(answer,docID(p1)) p1=next(p1) p2=next(p2) elseifdocID(p1)<docID(p2) thenp1=next(p1) elsep2=next(p2) returnanswer Intersection example • Brutus AND Caesar AND Calpurnia • 1
INTERSECT(p1,p2) answer=[] whilep1<>null and p2<>null doifdocID(p1)=docID(p2) then ADD(answer,docID(p1)) p1=next(p1) p2=next(p2) elseifdocID(p1)<docID(p2) thenp1=next(p1) elsep2=next(p2) returnanswer Intersection example • Brutus AND Caesar AND Calpurnia • 1
Intersection example - frequencies • Brutus AND Caesar AND Calpurnia -> • Calpurnia AND Brutus AND Caesar
Intersection example - frequencies • Brutus AND Caesar AND Calpurnia -> • Calpurnia AND Brutus AND Caesar
Intersection example - frequencies • Brutus AND Caesar AND Calpurnia -> • Calpurnia AND Brutus AND Caesar
Intersection example - frequencies • Brutus AND Caesar AND Calpurnia -> • Calpurnia AND Brutus AND Caesar
Results • Sorted vs. Not sorted • 3 vs 7 • Sort posting list by size – it is useful to store frequencies
Faster Posting lists • If the list lengths are m and n, the merge takes O(m+n) operations. • Canwe do better? • Yes (if index isn’t changing too fast).
Augment postings with skip pointers(at indexingtime) • To skip postings that will not figure in the search results.
Process • Suppose we’ve stepped through the lists until we process 8 on each list. We match it and advance. • We then have 41 and 11 on the lower. 11 is smaller. • But instead to advance to 17 the skip successor of 11 on the lower list is 31, and it is smaller than 41, so we can skip ahead.
Intersect with skips INTERSECTWITHSKIPS(p1,p2) answer=[] whilep1<>null and p2<>null doifdocID(p1)=docID(p2) then ADD(answer,docID(p1)) p1=next(p1) p2=next(p2) elseifdocID(p1)<docID(p2) thenwhilehasSkip(p1) and (docID(skip(p1))<docID(p2))) do p1=skip(p1) p1=next(p1) elsewhilehasSkip(p2) and (docID(skip(p2))<docID(p1))) do p2=skip(p2) p2=next(p2) returnanswer
Where to place skips? • Tradeoff: • More skips → shorter skip spans ⇒ more likely to skip. But lots of comparisons to skip pointers. • Fewer skips → few pointer comparison, but then long skip spans ⇒ few successful skips.
Placingskips • Simple heuristic: for postinglists of length L, use evenly spaced skippointers • This takes into account the distribution of query terms in a simple way: • the larger the doc frequency of a term the larger the number of skip pointers • Easy if the index is relatively static; harder if postings keep changingbecause of updates • This definitely used to help; with modern hardware it may not (Bahle et al. 2002) unless you’re memory-based: • because the I/O cost of loading a bigger index structure can outweigh the gains from quicker in memory merging!
Phrase retrieval • 10% of queries are phrases • Occurrence posting list is not enough • Biword index • Positional index
Biword index • Mier ľuďom celého sveta • mier ľudia • ľudia celý • celý svet • Longer phrases are transformed to biword • May lead to wrong results (recall)
Biword – false positive • Query: “faculty of informatics and information technologies” • Doc1 • The Faculty of Informatics is one of eight faculties at the Vienna University of Technology, which carries out research in information, technology and business informatics. • faculty informatics, informatics one, …, information technology • Doc2 • Informationtechnology is the application of computers and teleco equipment to store, retrieve, transmit and manipulate data. • information technology, technology application, application computers • Doc3 • Faculty of Informatics and InformationTechnologies (FIIT) is one of the seven faculties of the Slovak University of Technology in Bratislava (SUT). • faculty informatics, informatics information, information technology
Positional index(with freq) • <term, number of docs containing term; Doc1, term-freq in Doc1: position1, position2 … ; Doc2, term-freq in Doc2: position1, position2 … ; etc.>
Positional index example (without freq) <be: 993427; 1: 7, 18, 33, 72, 86, 231; 2: 3, 149; 4: 17, 191, 291, 430, 434; 5: 363, 367,…> • For phrase queries, we use a merge algorithm recursively at the document level • But we need to deal with more than just equality
Positional index processing a query • “to be or not to be” • Extract inverted index entries for to,be,or,not • Merge their doc:position lists to enumerate all positions with “to be or not to be” • to: • 2:1,17,74,222,551; 4:8,16,190,429,433; 7:13,23,191… • be: • 1:17,19; 4:17,191,291,430,434; 5:14,19,101… • Same general method for proximity queries (distance between words)
Intersect with positional indexes POSITIONALINTERSECT(p1,p2,k) answer=[] whilep1<>null and p2<>null doifdocID(p1)=docID(p2) then l=[] pp1=positions(p1) pp2=positions(p2) whilepp1<>null do while pp2<>null do if abs(pos(pp1)-pos(pp2))≤k then ADD(l,pos(pp2)) else if pos(pp2)>pos(pp1) thenbreak pp2=next(pp2) while l<>[] and abs(l[0]-pos(pp1))>k do delete(l[0]) for each ps € l do ADD(answer,(docID(p1),pos(pp1),ps)) pp1=next(pp1) p1=next(p1) p2=next(p2) elseifdocID(p1)<docID(p2) thenp1=next(p1) elsep2=next(p2) returnanswer
Rules • A positional index is 2-4 as large as a non-positional index • Positional index size 35-50% of volume of original text
Dictionary data structures for inverted indexes • An array of struct:
Dictionary data structures • Twomainchoices: • Hash table • Tree • Some IR systems use hashes, some trees
Hashes • Each vocabulary term is hashed to an integer • (We assume you’ve seen hash tables before) • Pros: • Lookup is faster than for a tree • Cons: • No easy way to find minor variants: • judgment/judgement • No prefix search ("bar*") [tolerantretrieval] • If vocabulary keeps growing, need to occasionally do the expensive operation of rehashingeverything
Tree: B-tree • Definition: Every internal node has a number of children in the interval [a,b] where a, b are appropriatenaturalnumbers, e.g., [2,4].
Trees • Simplest: binarytree • More usual: B-trees • Trees require a standard ordering of characters and hence strings … but we have one – lexicographic • Unless we are dealing with Chinese (no unique ordering) • Pros: • Solves the prefix problem (terms starting with 'hyp') • Cons: • Slower: O(log M) [and this requires balanced tree] • Rebalancing binary trees is expensive • But B-trees mitigate the rebalancing problem.