450 likes | 618 Views
CMPS 561 Boolean Retrieval. Ryan Benton August 30, 2010. Agenda. Indices IR System Models Processing Boolean Query Algorithms for Intersection. Indices. Indices. Question: How do we store documents and terms such that we can retrieve documents Efficiently Effectively
E N D
CMPS 561Boolean Retrieval Ryan Benton August 30, 2010
Agenda • Indices • IR System Models • Processing Boolean Query • Algorithms for Intersection
Indices • Question: • How do we store documents and terms such that we can retrieve documents • Efficiently • Effectively • With reasonable space requirements?
Term-Document Matrix • Create table • Rows: Terms • Columns: Document Ids • Official Name: • Term-Document Incidence Matrix • Also called: Inverted View of Collection
Term-Document Matrix • Why • Rows Vectors of documents containing term X. • Columns Vectors of terms contained by document Y. • Technically, take the transpose of Matrix to get the columns vectors.
Term-Document Matrix • Naïve Way of Building • Create and store the Matrix • Some Calculations • 500,000 Terms • 1,000,000 Documents • ½ trillion entries : 500,000,000,00 • All 0’s and 1’s. • Memory Impact • As documents and/or term list grows • Can’t keep in memory
Term-Document Matrix • Observation: • Term-Document Matrix Sparse • Typically, only a small number of terms in any given document. • If typical document contains 1,000 terms • Matrix, in previous example, has 1 billion 1’s • 1,000,000,000 • Thus, 99.8% of matrix has 0’s.
Inverted Index • Also called: Inverted File • Dictionary of Terms • Vocabulary • Lexicon • Each term • List of documents in which it appears. • Each document sometimes called a posting.
Inverted Index Term 1 Record 1 Record 3 Term 2 Record 1 Record 2 Term 3 Record 2 Record 3 Record 4 Term 4 Record 1 Record 2 Record 3 Record 4
Inverted Index • Note: • Dictionary sorted alphabetically • Each ‘posting list’ sorted by ID • Storage: • Dictionary kept in memory • Postings Depends on space. • In memory • on disk.
Inverted Index, Some Change Term 1: 2 Record 1 Record 3 Term 2: 2 Record 1 Record 2 Term 3: 3 Record 2 Record 3 Record 4 Term 4: 4 Record 1 Record 2 Record 3 Record 4
Model • S = (D, Q, T, V, F) • d Î D • q Î Q • F: DC Q V • Fq: D V • Retrieval Status Values (RSV) • T: “index terms”
Model • S = (D, Q, T, V, F) • £ defined over elements of V is simpleorder • £¢ defined over elements of D by F is weakorder • Breaks element of D into number of subsets • Each subset are simplyordered
Subject Catalog Model • S = (D, Q, T, V, F) • T = set of subject headings • Q = T • D = 2T • V = { 0, 1 } • Fq(d) where q Î Q, d ÎD • 1, if q Î d • 0, otherwise
Coordination Level System • S = (D, Q, T, V, F) • Q = 2T • D = 2T • V = { 0, 1 } • Fq(d) where • 1, if q Í d • 0, otherwise • F¢q(d) where • 1, if |q Ç d| > k • 0, otherwise
Boolean Systems • S = (D, Q, T, V, F) • D = 2T • Q = E • V = { 0, 1 } • Fq(d) where • 1, if q evaluates to True • With respect to document • 0, otherwise
What is E? • Let t Î T, Then • t Î E • If e Î E, Then • Øe Î E • If e1, e2Î E, Then • e1Ú e2 Î E • e1Ù e2 Î E • Nothing else is in E!
Document Representation • Set of Document IDs • D = {da} a=1,2,…,p • Set of all term IDs: • T = {ti} i = 1,2,…,n
Document Representation • Relation • D = { < da, ti, mD(da, ti)> } • mD: D x T {0,1} • mD(da, ti) • 1, if da contains ti • 0, otherwise • D t = {daÎ D | mD(da, t) = 1} • d º D d = {tiÎ T | mD(d, ti) = 1}
Retrieval Function • Retrieval Status Value (RSV) º F • RSVt(da) = mD(da, t) • RSVØe(da) = 1 - RSVe(da) • RSVe1Ùe2(da) = RSVe1(da) Ù RSVe2(da) • RSVe1Úe2(da) = RSVe1(da) Ú RSVe2(da)
Boolean example q = Ø(d Ú e) Ù (c Ú (a Ù b)) Ù Ø Ú Ú c Ù b d e a
Ù Ø Ú Ú c Ù b d e a Boolean Query Example(Method 1- based on documents ) q = Ø(d Ú e) Ù (c Ú (a Ù b)) Dda = {a,c} RSVq(da) = 1 1 1 1 0 0 0 0 0 1
Boolean Query Example(Method 2- based on inverted lists) • D t1Ùt2 = {daÎ D | daÎ Dt1Ù da Î Dt2} • D t1Út2 = {daÎ D | daÎ Dt1Ú da Î Dt2} • Dt = set of Documents containing term t • T = {a, b, c, d, e} • Da, Db, Dc, Dd, De,
Ù a b Boolean Query Example(Method 2) • Output • DaÇ Db • Input • a Ù b
Output Dt De1 De2 De1Ùe2 De1Úe2 D \ De1 Query t e1 e2 e1Ùe2 e1Úe2 Øe1 Processing Boolean Query (Method 2)
Boolean Queries (Method 2) • and-queries (tiÙtj) • Construct a merged list M for Dti and Dtj. • Transfer all duplicated records Od on merge list to output • or-queries (tiÚtj) • Construct a merged list M for Dti and Dtj. • Transfer all unique records Ou on merge list to output.
Boolean Queries (Method 2) • not-queries (tiÙØtj) • Construct a merged list M for Dti and Dtj. • FIRST_List • Remove all items appearing only once from First List • Transfer remaining items to output (i.e. Od ). • Create merge list composed of First_List and list composed of Dti • SECOND_List • Remove items appearing more than once from SECOND_List • Transfer remaining items to output (i.e. Oa).
Reminder - Inverted Index Term 1 Record 1 Record 3 Term 2 Record 1 Record 2 Term 3 Record 2 Record 3 Record 4 Term 4 Record 1 Record 2 Record 3 Record 4
Example Query • ((t1Út2) ÙØ t3) • Let’s do the first part (t1Út2) • Dt1: {R1, R3) • Dt2: {R1, R2) • M(Dt1, Dt2) : {R1, R1, R2, R3} • Ou(t1Út2) : {R1, R2, R3} • M Merging Operation, O Output Selection
Example Query (cont’d) • Now, let’s handle the second part • ((t1Út2) ÙØ t3) • Dt3: {R2, R3, R4) • M(Dt1Út2, Dt3) : {R1, R2, R2, R3, R3, R4} • Od((t1Út2)Ùt3) : {R2 , R3} • M(Dt1Út2, D(t1Út2)Ùt3) : {R1, R2, R2, R3, R3} • Oa((t1Út2)ÙØt3) : {R1}
Algorithms – Basic Intersection(aka Merging) • Intersect(p1, p2) • answer {} • While (p1 != NIL) and (p2 != NIL) Do • if docID(p1) = docID(p2) • Then ADD(answer, docID(p1)) • p1 next(p1) • p2 next(p2) • Else if (docID(p1) < docID(p2)) • Then p1 next(p1) • Else p2 next(p2) • Return answer
Algorithms – Intersection • Complexity: O(x + y) • For any given two posting lists • List A has size x • List B has size y • Note, this is upper bound. • Formally, Complexity: Q(N) • N can be either • Number of documents in collection • Note, this is a tight bound.
Observation • In many cases, Boolean queries • Conjunctive in nature • Allows for a possible improvement based on posting size (term frequency)
Algorithms – Conjunctive Query Merging • IntersectConjunct(t1, t2, …, tz) • Terms SortByIncreasingFrequency((t1, t2, …, tz)) • Results postings(first(Terms)) • Terms rest(Terms) • while (Terms != NIL) and (Results != NIL) Do • Results Intersect(result, postings(first(Terms))) • Terms rest(Terms) • Return Results
Why? • By using least frequent term • All results guaranteed to be no larger than least frequent term • In practice • The ‘intermediate’ list always places upper bounds on the size.
Variations on Boolean • Extended Boolean • Has standard operations: • AND, OR and NOT • Plus • Term Proximity • Within X words, sentences, paragraphs • Wildcard Matching • Fuzzy • Allow for range • Function F no longer restricted to {0,1}
Thank-you Questions?
References • Christopher D. Manning, Prabhakar Raghavan, Hinrich Schütze, Introduction to Information Retrieval, Chapter 1, 2008. • Abraham Bookstein and William Cooper, “A General Mathematical Model for Information Retrieval Systems”, The Library Quarterly, Vol 26, no. 2, pp 153-67. • Vijay V. Raghavan’s Notes/Lecture Material • http://www.cacs.louisiana.edu/~cmps561/561/notes/Model.pdf • Material in Slides ued with permission