110 likes | 125 Views
Explore how the extended Boolean model introduces ranking using partial matching, term weighting, and combines Vector model features with Boolean algebra properties. Generalize the model to consider Euclidean distances in t-dimensional space through p-norms.
E N D
Advanced Information Retreival Chapter. 02: Modeling (Extended Boolean Model)
Extended Boolean Model • Boolean model is simple and elegant. • But, no provision for a ranking • As with the fuzzy model, a ranking can be obtained by relaxing the condition on set membership • Extend the Boolean model with the notions of partial matching and term weighting • Combine characteristics of the Vector model with properties of Boolean algebra
The Idea • The extended Boolean model (introduced by Salton, Fox, and Wu, 1983) is based on a critique of a basic assumption in Boolean algebra • Let, • q = kx ky • wxj = fxj * idf(x) associated with [kx,dj] max(idf(i)) • Further, wxj = x and wyj = y
The Idea: • qand = kx ky; wxj = x and wyj = y AND 2 2 sim(qand,dj) = 1 - sqrt( (1-x) + (1-y) ) 2 (1,1) ky dj+1 y = wyj dj (0,0) x = wxj kx
The Idea: • qor = kx ky; wxj = x and wyj = y OR 2 2 sim(qor,dj) = sqrt( x + y ) 2 (1,1) ky dj+1 dj y = wyj (0,0) x = wxj kx
Generalizing the Idea • We can extend the previous model to consider Euclidean distances in a t-dimensional space • This can be done using p-norms which extend the notion of distance to include p-distances, where 1 p is a new parameter • A generalized conjunctive query is given by • qor = k1 k2 . . . kt • A generalized disjunctive query is given by • qand = k1 k2 . . . kt p p p p p p
Generalizing the Idea 1 p p p p • sim(qor,dj) = (x1 + x2 + . . . + xm ) m 1 p • sim(qand,dj) = 1 - ((1-x1) + (1-x2) + . . . + (1-xm) ) m p p p
Properties 1 p p p p • sim(qor,dj) = (x1 + x2 + . . . + xm ) m • If p = 1 then (Vector like) • sim(qor,dj) = sim(qand,dj) = x1 + . . . +xm m • If p = then (Fuzzy like) • sim(qor,dj) = max (wxj) • sim(qand,dj) = min (wxj)
Properties • By varying p, we can make the model behave as a vector, as a fuzzy, or as an intermediary model • This is quite powerful and is a good argument in favor of the extended Boolean model • (k1 k2) k3 • k1 and k2 are to be used as in a vectorial retrieval while the presence of k3 is required. 2
Properties • q = (k1 k2) k3 • sim(q,dj) = ( (1 - ( (1-x1) + (1-x2) ) ) + x3 ) 2__________________ 2 2 1 1 p p p p p
Conclusions • Model is quite powerful • Properties are interesting and might be useful • Computation is somewhat complex • However, distributivity operation does not hold for ranking computation: • q1 = (k1 k2) k3 • q2 = (k1 k3) (k2 k3) • sim(q1,dj) sim(q2,dj)