1 / 45

CMPS 561 Boolean Retrieval

CMPS 561 Boolean Retrieval. Ryan Benton August 30, 2010. Agenda. Indices IR System Models Processing Boolean Query Algorithms for Intersection. Indices. Indices. Question: How do we store documents and terms such that we can retrieve documents Efficiently Effectively

espen
Download Presentation

CMPS 561 Boolean Retrieval

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CMPS 561Boolean Retrieval Ryan Benton August 30, 2010

  2. Agenda • Indices • IR System Models • Processing Boolean Query • Algorithms for Intersection

  3. Indices

  4. Indices • Question: • How do we store documents and terms such that we can retrieve documents • Efficiently • Effectively • With reasonable space requirements?

  5. Term-Document Matrix • Create table • Rows: Terms • Columns: Document Ids • Official Name: • Term-Document Incidence Matrix • Also called: Inverted View of Collection

  6. Term-Document Incidence Matrix

  7. Term-Document Matrix • Why • Rows  Vectors of documents containing term X. • Columns  Vectors of terms contained by document Y. • Technically, take the transpose of Matrix to get the columns vectors.

  8. Document-Term Incidence Matrix

  9. Term-Document Matrix • Naïve Way of Building • Create and store the Matrix • Some Calculations • 500,000 Terms • 1,000,000 Documents • ½ trillion entries : 500,000,000,00 • All 0’s and 1’s. • Memory Impact • As documents and/or term list grows • Can’t keep in memory

  10. Term-Document Matrix • Observation: • Term-Document Matrix  Sparse • Typically, only a small number of terms in any given document. • If typical document contains 1,000 terms • Matrix, in previous example, has 1 billion 1’s • 1,000,000,000 • Thus, 99.8% of matrix has 0’s.

  11. Inverted Index • Also called: Inverted File • Dictionary of Terms • Vocabulary • Lexicon • Each term • List of documents in which it appears. • Each document sometimes called a posting.

  12. Term-Document Incidence Matrix

  13. Inverted Index Term 1 Record 1 Record 3 Term 2 Record 1 Record 2 Term 3 Record 2 Record 3 Record 4 Term 4 Record 1 Record 2 Record 3 Record 4

  14. Inverted Index • Note: • Dictionary sorted alphabetically • Each ‘posting list’  sorted by ID • Storage: • Dictionary kept in memory • Postings  Depends on space. • In memory • on disk.

  15. Inverted Index, Some Change Term 1: 2 Record 1 Record 3 Term 2: 2 Record 1 Record 2 Term 3: 3 Record 2 Record 3 Record 4 Term 4: 4 Record 1 Record 2 Record 3 Record 4

  16. IR System Models

  17. Model • S = (D, Q, T, V, F) • d Î D • q Î Q • F: DC Q  V • Fq: D  V • Retrieval Status Values (RSV) • T: “index terms”

  18. Model • S = (D, Q, T, V, F) • £ defined over elements of V is simpleorder • £¢ defined over elements of D by F is weakorder • Breaks element of D into number of subsets • Each subset are simplyordered

  19. Subject Catalog Model • S = (D, Q, T, V, F) • T = set of subject headings • Q = T • D = 2T • V = { 0, 1 } • Fq(d) where q Î Q, d ÎD • 1, if q Î d • 0, otherwise

  20. Coordination Level System • S = (D, Q, T, V, F) • Q = 2T • D = 2T • V = { 0, 1 } • Fq(d) where • 1, if q Í d • 0, otherwise • F¢q(d) where • 1, if |q Ç d| > k • 0, otherwise

  21. Boolean Systems • S = (D, Q, T, V, F) • D = 2T • Q = E • V = { 0, 1 } • Fq(d) where • 1, if q evaluates to True • With respect to document • 0, otherwise

  22. What is E? • Let t Î T, Then • t Î E • If e Î E, Then • Øe Î E • If e1, e2Î E, Then • e1Ú e2 Î E • e1Ù e2 Î E • Nothing else is in E!

  23. Document Representation • Set of Document IDs • D = {da} a=1,2,…,p • Set of all term IDs: • T = {ti} i = 1,2,…,n

  24. Document Representation • Relation • D = { < da, ti, mD(da, ti)> } • mD: D x T  {0,1} • mD(da, ti) • 1, if da contains ti • 0, otherwise • D t = {daÎ D | mD(da, t) = 1} • d º D d = {tiÎ T | mD(d, ti) = 1}

  25. Retrieval Function • Retrieval Status Value (RSV) º F • RSVt(da) = mD(da, t) • RSVØe(da) = 1 - RSVe(da) • RSVe1Ùe2(da) = RSVe1(da) Ù RSVe2(da) • RSVe1Úe2(da) = RSVe1(da) Ú RSVe2(da)

  26. Processing Boolean Queries

  27. Boolean example q = Ø(d Ú e) Ù (c Ú (a Ù b)) Ù Ø Ú Ú c Ù b d e a

  28. Ù Ø Ú Ú c Ù b d e a Boolean Query Example(Method 1- based on documents ) q = Ø(d Ú e) Ù (c Ú (a Ù b)) Dda = {a,c} RSVq(da) = 1 1 1 1 0 0 0 0 0 1

  29. Boolean Query Example(Method 2- based on inverted lists) • D t1Ùt2 = {daÎ D | daÎ Dt1Ù da Î Dt2} • D t1Út2 = {daÎ D | daÎ Dt1Ú da Î Dt2} • Dt = set of Documents containing term t • T = {a, b, c, d, e} • Da, Db, Dc, Dd, De,

  30. Ù a b Boolean Query Example(Method 2) • Output • DaÇ Db • Input • a Ù b

  31. Output Dt De1 De2 De1Ùe2 De1Úe2 D \ De1 Query t e1 e2 e1Ùe2 e1Úe2 Øe1 Processing Boolean Query (Method 2)

  32. Boolean Queries (Method 2) • and-queries (tiÙtj) • Construct a merged list M for Dti and Dtj. • Transfer all duplicated records Od on merge list to output • or-queries (tiÚtj) • Construct a merged list M for Dti and Dtj. • Transfer all unique records Ou on merge list to output.

  33. Boolean Queries (Method 2) • not-queries (tiÙØtj) • Construct a merged list M for Dti and Dtj. • FIRST_List • Remove all items appearing only once from First List • Transfer remaining items to output (i.e. Od ). • Create merge list composed of First_List and list composed of Dti • SECOND_List • Remove items appearing more than once from SECOND_List • Transfer remaining items to output (i.e. Oa).

  34. Reminder - Inverted Index Term 1 Record 1 Record 3 Term 2 Record 1 Record 2 Term 3 Record 2 Record 3 Record 4 Term 4 Record 1 Record 2 Record 3 Record 4

  35. Example Query • ((t1Út2) ÙØ t3) • Let’s do the first part (t1Út2) • Dt1: {R1, R3) • Dt2: {R1, R2) • M(Dt1, Dt2) : {R1, R1, R2, R3} • Ou(t1Út2) : {R1, R2, R3} • M  Merging Operation, O  Output Selection

  36. Example Query (cont’d) • Now, let’s handle the second part • ((t1Út2) ÙØ t3) • Dt3: {R2, R3, R4) • M(Dt1Út2, Dt3) : {R1, R2, R2, R3, R3, R4} • Od((t1Út2)Ùt3) : {R2 , R3} • M(Dt1Út2, D(t1Út2)Ùt3) : {R1, R2, R2, R3, R3} • Oa((t1Út2)ÙØt3) : {R1}

  37. Algorithms for Intersection

  38. Algorithms – Basic Intersection(aka Merging) • Intersect(p1, p2) • answer  {} • While (p1 != NIL) and (p2 != NIL) Do • if docID(p1) = docID(p2) • Then ADD(answer, docID(p1)) • p1  next(p1) • p2  next(p2) • Else if (docID(p1) < docID(p2)) • Then p1 next(p1) • Else p2  next(p2) • Return answer

  39. Algorithms – Intersection • Complexity: O(x + y) • For any given two posting lists • List A has size x • List B has size y • Note, this is upper bound. • Formally, Complexity: Q(N) • N can be either • Number of documents in collection • Note, this is a tight bound.

  40. Observation • In many cases, Boolean queries • Conjunctive in nature • Allows for a possible improvement based on posting size (term frequency)

  41. Algorithms – Conjunctive Query Merging • IntersectConjunct(t1, t2, …, tz) • Terms  SortByIncreasingFrequency((t1, t2, …, tz)) • Results  postings(first(Terms)) • Terms  rest(Terms) • while (Terms != NIL) and (Results != NIL) Do • Results  Intersect(result, postings(first(Terms))) • Terms  rest(Terms) • Return Results

  42. Why? • By using least frequent term • All results guaranteed to be no larger than least frequent term • In practice • The ‘intermediate’ list always places upper bounds on the size.

  43. Variations on Boolean • Extended Boolean • Has standard operations: • AND, OR and NOT • Plus • Term Proximity • Within X words, sentences, paragraphs • Wildcard Matching • Fuzzy • Allow for range • Function F no longer restricted to {0,1}

  44. Thank-you Questions?

  45. References • Christopher D. Manning, Prabhakar Raghavan, Hinrich Schütze, Introduction to Information Retrieval, Chapter 1, 2008. • Abraham Bookstein and William Cooper, “A General Mathematical Model for Information Retrieval Systems”, The Library Quarterly, Vol 26, no. 2, pp 153-67. • Vijay V. Raghavan’s Notes/Lecture Material • http://www.cacs.louisiana.edu/~cmps561/561/notes/Model.pdf • Material in Slides ued with permission

More Related