Information retrieval

Information retrieval 2019/2020

ir process

IR Models • Set Theoretic Models • Boolean • Fuzzy • Extended Boolean • Vector Models (Algebraic) • Probabilistic Models (probabilistic) • Others (e.g., neural networks, etc.)

Boolean IR Model • Based on Boolean Logic (Algebra of Sets) • Fundamental principles established by George Boole in the 1850’s • Deals with set membership and operations on sets • Set membership in IR systems is usually based on whether (or not) a document contains a keyword (term)

Boolean IR Model • Queries are Boolean expressions, e.g., ‘Caesar and Brutus’ (predicates for term occurrence) • Search engine returns all documents that satisfy the Boolean expression • Conceptually simple and easy to understand • Basic operation: • identify set of documents containing a certain term

Boolean IR Model • Which plays of Shakespeare contain the words BRUTUS and CAESAR, but NOT CALPURNIA? • One could grepall of Shakespeare’s plays for BRUTUS and CAESAR, then strip out lines containing CALPURNIA. • Why is grep not the solution? • Slow (forlargecollections) • “NOT CALPURNIA” isnon-trivial • Other operations (e.g., find the word Romans near countryman) notfeasible • Ranked retrieval (best documents to return)

Term-document matrix document term

We have a 0/1 vector for each term (characteristic functions for term occurrence predicates) • To answer the query BRUTUS AND CAESAR AND NOT CALPURNIA: • Take the vectors for BRUTUS, CAESAR, and CALPURNIA • Complement the vector of CALPURNIA • Do a (bitwise) and on the three vectors 110100 and 110111 and 101111 100100

Sometimes we need to • Fast process large documents • More specific operations (not only AND, OR, NOT) – e.g., near • Sort = RANK of retrieved documents

1 2 4 11 31 45 173 1 2 4 5 6 16 57 132 Dictionary Posting list Inverted index • We need variable-size postings lists • On disk, a continuous run of postings is normal and best • In memory, can use linked lists or variable length arrays Brutus 174 Caesar Calpurnia 2 31 54 101

1 2 4 11 31 45 173 1 2 4 5 6 16 57 132 Dictionary Posting list Concepts • Dictionary of terms – list of all terms in memory • Posting list – list of term occurrence in documents • Posting – one item in the occurrence list Brutus 174 Caesar Calpurnia 2 31 54 101

How to create posting list • Acquire the collection of documents. • “Miervšetkýmnárodomsveta.” • Tokenize content of documents. • Mier, všetkým, národom, sveta. • Perform text operations (which?). • mier, všetko, národ, svet • Alphabetically sort and find all documents containing terms.

mier národ svet všetko

Query • “miervšetkým” • ->mier and všetko mier národ svet všetko

Merging lists INTERSECT(p1,p2) answer=[] whilep1<>null and p2<>null doifdocID(p1)=docID(p2) then ADD(answer,docID(p1)) p1=next(p1) p2=next(p2) elseifdocID(p1)<docID(p2) thenp1=next(p1) elsep2=next(p2) returnanswer

INTERSECT(p1,p2) answer=[] whilep1<>null and p2<>null doifdocID(p1)=docID(p2) then ADD(answer,docID(p1)) p1=next(p1) p2=next(p2) elseifdocID(p1)<docID(p2) thenp1=next(p1) elsep2=next(p2) returnanswer Intersection example • Brutus AND Caesar AND Calpurnia

INTERSECT(p1,p2) answer=[] whilep1<>null and p2<>null doifdocID(p1)=docID(p2) then ADD(answer,docID(p1)) p1=next(p1) p2=next(p2) elseifdocID(p1)<docID(p2) thenp1=next(p1) elsep2=next(p2) returnanswer Intersection example • Brutus AND Caesar AND Calpurnia • 1,

INTERSECT(p1,p2) answer=[] whilep1<>null and p2<>null doifdocID(p1)=docID(p2) then ADD(answer,docID(p1)) p1=next(p1) p2=next(p2) elseifdocID(p1)<docID(p2) thenp1=next(p1) elsep2=next(p2) returnanswer Intersection example • Brutus AND Caesar AND Calpurnia • 1,3,

INTERSECT(p1,p2) answer=[] whilep1<>null and p2<>null doifdocID(p1)=docID(p2) then ADD(answer,docID(p1)) p1=next(p1) p2=next(p2) elseifdocID(p1)<docID(p2) thenp1=next(p1) elsep2=next(p2) returnanswer Intersection example • Brutus AND Caesar AND Calpurnia • 1,3,5,

INTERSECT(p1,p2) answer=[] whilep1<>null and p2<>null doifdocID(p1)=docID(p2) then ADD(answer,docID(p1)) p1=next(p1) p2=next(p2) elseifdocID(p1)<docID(p2) thenp1=next(p1) elsep2=next(p2) returnanswer Intersection example • Brutus AND Caesar AND Calpurnia • 1,3,5,15

INTERSECT(p1,p2) answer=[] whilep1<>null and p2<>null doifdocID(p1)=docID(p2) then ADD(answer,docID(p1)) p1=next(p1) p2=next(p2) elseifdocID(p1)<docID(p2) thenp1=next(p1) elsep2=next(p2) returnanswer Intersection example • Brutus AND Caesar AND Calpurnia • 1

Intersection example - frequencies • Brutus AND Caesar AND Calpurnia -> • Calpurnia AND Brutus AND Caesar

Results • Sorted vs. Not sorted • 3 vs 7 • Sort posting list by size – it is useful to store frequencies

Faster Posting lists • If the list lengths are m and n, the merge takes O(m+n) operations. • Canwe do better? • Yes (if index isn’t changing too fast).

Augment postings with skip pointers(at indexingtime) • To skip postings that will not figure in the search results.

Process • Suppose we’ve stepped through the lists until we process 8 on each list. We match it and advance. • We then have 41 and 11 on the lower. 11 is smaller. • But instead to advance to 17 the skip successor of 11 on the lower list is 31, and it is smaller than 41, so we can skip ahead.

Intersect with skips INTERSECTWITHSKIPS(p1,p2) answer=[] whilep1<>null and p2<>null doifdocID(p1)=docID(p2) then ADD(answer,docID(p1)) p1=next(p1) p2=next(p2) elseifdocID(p1)<docID(p2) thenwhilehasSkip(p1) and (docID(skip(p1))<docID(p2))) do p1=skip(p1) p1=next(p1) elsewhilehasSkip(p2) and (docID(skip(p2))<docID(p1))) do p2=skip(p2) p2=next(p2) returnanswer

Where to place skips? • Tradeoff: • More skips → shorter skip spans ⇒ more likely to skip. But lots of comparisons to skip pointers. • Fewer skips → few pointer comparison, but then long skip spans ⇒ few successful skips.

Placingskips • Simple heuristic: for postinglists of length L, use evenly spaced skippointers • This takes into account the distribution of query terms in a simple way: • the larger the doc frequency of a term the larger the number of skip pointers • Easy if the index is relatively static; harder if postings keep changingbecause of updates • This definitely used to help; with modern hardware it may not (Bahle et al. 2002) unless you’re memory-based: • because the I/O cost of loading a bigger index structure can outweigh the gains from quicker in memory merging!

Phrase retrieval • 10% of queries are phrases • Occurrence posting list is not enough • Biword index • Positional index

Biword index • Mier ľuďom celého sveta • mier ľudia • ľudia celý • celý svet • Longer phrases are transformed to biword • May lead to wrong results (recall)

Biword – false positive • Query: “faculty of informatics and information technologies” • Doc1 • The Faculty of Informatics is one of eight faculties at the Vienna University of Technology, which carries out research in information, technology and business informatics. • faculty informatics, informatics one, …, information technology • Doc2 • Informationtechnology is the application of computers and teleco equipment to store, retrieve, transmit and manipulate data. • information technology, technology application, application computers • Doc3 • Faculty of Informatics and InformationTechnologies (FIIT) is one of the seven faculties of the Slovak University of Technology in Bratislava (SUT). • faculty informatics, informatics information, information technology

Positional index(with freq) • <term, number of docs containing term; Doc1, term-freq in Doc1: position1, position2 … ; Doc2, term-freq in Doc2: position1, position2 … ; etc.>

Positional index example (without freq) <be: 993427; 1: 7, 18, 33, 72, 86, 231; 2: 3, 149; 4: 17, 191, 291, 430, 434; 5: 363, 367,…> • For phrase queries, we use a merge algorithm recursively at the document level • But we need to deal with more than just equality

Positional index processing a query • “to be or not to be” • Extract inverted index entries for to,be,or,not • Merge their doc:position lists to enumerate all positions with “to be or not to be” • to: • 2:1,17,74,222,551; 4:8,16,190,429,433; 7:13,23,191… • be: • 1:17,19; 4:17,191,291,430,434; 5:14,19,101… • Same general method for proximity queries (distance between words)

Intersect with positional indexes POSITIONALINTERSECT(p1,p2,k) answer=[] whilep1<>null and p2<>null doifdocID(p1)=docID(p2) then l=[] pp1=positions(p1) pp2=positions(p2) whilepp1<>null do while pp2<>null do if abs(pos(pp1)-pos(pp2))≤k then ADD(l,pos(pp2)) else if pos(pp2)>pos(pp1) thenbreak pp2=next(pp2) while l<>[] and abs(l[0]-pos(pp1))>k do delete(l[0]) for each ps € l do ADD(answer,(docID(p1),pos(pp1),ps)) pp1=next(pp1) p1=next(p1) p2=next(p2) elseifdocID(p1)<docID(p2) thenp1=next(p1) elsep2=next(p2) returnanswer

Rules • A positional index is 2-4 as large as a non-positional index • Positional index size 35-50% of volume of original text

Dictionary data structures for inverted indexes • An array of struct:

Dictionary data structures • Twomainchoices: • Hash table • Tree • Some IR systems use hashes, some trees

Hashes • Each vocabulary term is hashed to an integer • (We assume you’ve seen hash tables before) • Pros: • Lookup is faster than for a tree • Cons: • No easy way to find minor variants: • judgment/judgement • No prefix search ("bar*") [tolerantretrieval] • If vocabulary keeps growing, need to occasionally do the expensive operation of rehashingeverything

Tree: binarytree

Tree: B-tree • Definition: Every internal node has a number of children in the interval [a,b] where a, b are appropriatenaturalnumbers, e.g., [2,4].

Trees • Simplest: binarytree • More usual: B-trees • Trees require a standard ordering of characters and hence strings … but we have one – lexicographic • Unless we are dealing with Chinese (no unique ordering) • Pros: • Solves the prefix problem (terms starting with 'hyp') • Cons: • Slower: O(log M) [and this requires balanced tree] • Rebalancing binary trees is expensive • But B-trees mitigate the rebalancing problem.

Information retrieval