CMPS 561 Boolean Retrieval

CMPS 561Boolean Retrieval Ryan Benton August 30, 2010

Agenda • Indices • IR System Models • Processing Boolean Query • Algorithms for Intersection

Indices

Indices • Question: • How do we store documents and terms such that we can retrieve documents • Efficiently • Effectively • With reasonable space requirements?

Term-Document Matrix • Create table • Rows: Terms • Columns: Document Ids • Official Name: • Term-Document Incidence Matrix • Also called: Inverted View of Collection

Term-Document Incidence Matrix

Term-Document Matrix • Why • Rows  Vectors of documents containing term X. • Columns  Vectors of terms contained by document Y. • Technically, take the transpose of Matrix to get the columns vectors.

Document-Term Incidence Matrix

Term-Document Matrix • Naïve Way of Building • Create and store the Matrix • Some Calculations • 500,000 Terms • 1,000,000 Documents • ½ trillion entries : 500,000,000,00 • All 0’s and 1’s. • Memory Impact • As documents and/or term list grows • Can’t keep in memory

Term-Document Matrix • Observation: • Term-Document Matrix  Sparse • Typically, only a small number of terms in any given document. • If typical document contains 1,000 terms • Matrix, in previous example, has 1 billion 1’s • 1,000,000,000 • Thus, 99.8% of matrix has 0’s.

Inverted Index • Also called: Inverted File • Dictionary of Terms • Vocabulary • Lexicon • Each term • List of documents in which it appears. • Each document sometimes called a posting.

Term-Document Incidence Matrix

Inverted Index Term 1 Record 1 Record 3 Term 2 Record 1 Record 2 Term 3 Record 2 Record 3 Record 4 Term 4 Record 1 Record 2 Record 3 Record 4

Inverted Index • Note: • Dictionary sorted alphabetically • Each ‘posting list’  sorted by ID • Storage: • Dictionary kept in memory • Postings  Depends on space. • In memory • on disk.

Inverted Index, Some Change Term 1: 2 Record 1 Record 3 Term 2: 2 Record 1 Record 2 Term 3: 3 Record 2 Record 3 Record 4 Term 4: 4 Record 1 Record 2 Record 3 Record 4

IR System Models

Model • S = (D, Q, T, V, F) • d Î D • q Î Q • F: DC Q  V • Fq: D  V • Retrieval Status Values (RSV) • T: “index terms”

Model • S = (D, Q, T, V, F) • £ defined over elements of V is simpleorder • £¢ defined over elements of D by F is weakorder • Breaks element of D into number of subsets • Each subset are simplyordered

Subject Catalog Model • S = (D, Q, T, V, F) • T = set of subject headings • Q = T • D = 2T • V = { 0, 1 } • Fq(d) where q Î Q, d ÎD • 1, if q Î d • 0, otherwise

Coordination Level System • S = (D, Q, T, V, F) • Q = 2T • D = 2T • V = { 0, 1 } • Fq(d) where • 1, if q Í d • 0, otherwise • F¢q(d) where • 1, if |q Ç d| > k • 0, otherwise

Boolean Systems • S = (D, Q, T, V, F) • D = 2T • Q = E • V = { 0, 1 } • Fq(d) where • 1, if q evaluates to True • With respect to document • 0, otherwise

What is E? • Let t Î T, Then • t Î E • If e Î E, Then • Øe Î E • If e1, e2Î E, Then • e1Ú e2 Î E • e1Ù e2 Î E • Nothing else is in E!

Document Representation • Set of Document IDs • D = {da} a=1,2,…,p • Set of all term IDs: • T = {ti} i = 1,2,…,n

Document Representation • Relation • D = { < da, ti, mD(da, ti)> } • mD: D x T  {0,1} • mD(da, ti) • 1, if da contains ti • 0, otherwise • D t = {daÎ D | mD(da, t) = 1} • d º D d = {tiÎ T | mD(d, ti) = 1}

Retrieval Function • Retrieval Status Value (RSV) º F • RSVt(da) = mD(da, t) • RSVØe(da) = 1 - RSVe(da) • RSVe1Ùe2(da) = RSVe1(da) Ù RSVe2(da) • RSVe1Úe2(da) = RSVe1(da) Ú RSVe2(da)

Processing Boolean Queries

Boolean example q = Ø(d Ú e) Ù (c Ú (a Ù b)) Ù Ø Ú Ú c Ù b d e a

Ù Ø Ú Ú c Ù b d e a Boolean Query Example(Method 1- based on documents ) q = Ø(d Ú e) Ù (c Ú (a Ù b)) Dda = {a,c} RSVq(da) = 1 1 1 1 0 0 0 0 0 1

Boolean Query Example(Method 2- based on inverted lists) • D t1Ùt2 = {daÎ D | daÎ Dt1Ù da Î Dt2} • D t1Út2 = {daÎ D | daÎ Dt1Ú da Î Dt2} • Dt = set of Documents containing term t • T = {a, b, c, d, e} • Da, Db, Dc, Dd, De,

Ù a b Boolean Query Example(Method 2) • Output • DaÇ Db • Input • a Ù b

Output Dt De1 De2 De1Ùe2 De1Úe2 D \ De1 Query t e1 e2 e1Ùe2 e1Úe2 Øe1 Processing Boolean Query (Method 2)

Boolean Queries (Method 2) • and-queries (tiÙtj) • Construct a merged list M for Dti and Dtj. • Transfer all duplicated records Od on merge list to output • or-queries (tiÚtj) • Construct a merged list M for Dti and Dtj. • Transfer all unique records Ou on merge list to output.

Boolean Queries (Method 2) • not-queries (tiÙØtj) • Construct a merged list M for Dti and Dtj. • FIRST_List • Remove all items appearing only once from First List • Transfer remaining items to output (i.e. Od ). • Create merge list composed of First_List and list composed of Dti • SECOND_List • Remove items appearing more than once from SECOND_List • Transfer remaining items to output (i.e. Oa).

Reminder - Inverted Index Term 1 Record 1 Record 3 Term 2 Record 1 Record 2 Term 3 Record 2 Record 3 Record 4 Term 4 Record 1 Record 2 Record 3 Record 4

Example Query • ((t1Út2) ÙØ t3) • Let’s do the first part (t1Út2) • Dt1: {R1, R3) • Dt2: {R1, R2) • M(Dt1, Dt2) : {R1, R1, R2, R3} • Ou(t1Út2) : {R1, R2, R3} • M  Merging Operation, O  Output Selection

Example Query (cont’d) • Now, let’s handle the second part • ((t1Út2) ÙØ t3) • Dt3: {R2, R3, R4) • M(Dt1Út2, Dt3) : {R1, R2, R2, R3, R3, R4} • Od((t1Út2)Ùt3) : {R2 , R3} • M(Dt1Út2, D(t1Út2)Ùt3) : {R1, R2, R2, R3, R3} • Oa((t1Út2)ÙØt3) : {R1}

Algorithms for Intersection

Algorithms – Basic Intersection(aka Merging) • Intersect(p1, p2) • answer  {} • While (p1 != NIL) and (p2 != NIL) Do • if docID(p1) = docID(p2) • Then ADD(answer, docID(p1)) • p1  next(p1) • p2  next(p2) • Else if (docID(p1) < docID(p2)) • Then p1 next(p1) • Else p2  next(p2) • Return answer

Algorithms – Intersection • Complexity: O(x + y) • For any given two posting lists • List A has size x • List B has size y • Note, this is upper bound. • Formally, Complexity: Q(N) • N can be either • Number of documents in collection • Note, this is a tight bound.

Observation • In many cases, Boolean queries • Conjunctive in nature • Allows for a possible improvement based on posting size (term frequency)

Algorithms – Conjunctive Query Merging • IntersectConjunct(t1, t2, …, tz) • Terms  SortByIncreasingFrequency((t1, t2, …, tz)) • Results  postings(first(Terms)) • Terms  rest(Terms) • while (Terms != NIL) and (Results != NIL) Do • Results  Intersect(result, postings(first(Terms))) • Terms  rest(Terms) • Return Results

Why? • By using least frequent term • All results guaranteed to be no larger than least frequent term • In practice • The ‘intermediate’ list always places upper bounds on the size.

Variations on Boolean • Extended Boolean • Has standard operations: • AND, OR and NOT • Plus • Term Proximity • Within X words, sentences, paragraphs • Wildcard Matching • Fuzzy • Allow for range • Function F no longer restricted to {0,1}

Thank-you Questions?

References • Christopher D. Manning, Prabhakar Raghavan, Hinrich Schütze, Introduction to Information Retrieval, Chapter 1, 2008. • Abraham Bookstein and William Cooper, “A General Mathematical Model for Information Retrieval Systems”, The Library Quarterly, Vol 26, no. 2, pp 153-67. • Vijay V. Raghavan’s Notes/Lecture Material • http://www.cacs.louisiana.edu/~cmps561/561/notes/Model.pdf • Material in Slides ued with permission

CMPS 561 Boolean Retrieval

CMPS 561 Boolean Retrieval

Presentation Transcript

Modified from Stanford CS276 slides Chap. 1: Boolean retrieval

A Theoretical Framework for Association Mining based on the Boolean Retrieval Model on the Boolean Retrieval Model

CMPS 211 JavaScript

Hinrich Schütze and Christina Lioma Lecture 1: Boolean Retrieval

Lecture 2: Boolean Retrieval

Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 1 Boolean retrieval

Information Retrieval and Web Search Lecture 1: Introduction and Boolean retrieval Fall 2012

Availability in CMPs

IS530 Lesson 12 Boolean vs. Statistical Retrieval Systems

Boolean

Boolean and Vector Space Retrieval Models

CMPS 561 Fuzzy Set Retrieval

CMPS 115 - Requirements

Boolean Searching

Boolean and Vector Space Retrieval Models by Ray Mooney

Boolean Functions

Boolean operators

Lecture 5: Boolean and Extended Boolean

Boolean and Vector Space Retrieval Models

Ch 1 Boolean Retrieval Modified by Dongwon Lee from slides by

The Boolean (logical) data type boolean