500 likes | 1.16k Views
The Boolean Model. Simple model based on set theory Queries specified as boolean expressions precise semantics neat formalism q = k a (k b k c ) Terms are either present or absent. Thus, w ij {0,1} Consider q = k a (k b k c )
E N D
The Boolean Model • Simple model based on set theory • Queries specified as boolean expressions • precise semantics • neat formalism • q = ka (kb kc) • Terms are either present or absent. Thus, wij {0,1} • Consider • q = ka (kb kc) • vec(qdnf) = (1,1,1) (1,1,0) (1,0,0) • vec(qcc) = (1,1,0) is a conjunctive component • Each query can be transformed in DNF form
The Boolean Model Ka Kb • q = ka (kb kc) • sim(q,dj) = 1, if document satisfies the boolean query • 0 otherwise • - no in-between, only 0 or 1 (1,1,0) (1,0,0) (1,1,1) Kc
Exercise D1 = “computer information retrieval” D2 = “computer retrieval” D3 = “information” D4 = “computer information” Q1 = “information retrieval” Q2 = “information ¬computer”
Exercise กำหนด Index term ของแต่ละเอกสาร D1 = {love, need, person, possess, understand} D2 = {heart, listen, love, practice, suffer} D3 = {compassion, love, mind, person, practice} D4 = {death, health, languor, life, suffer} D5 = {energy, love, nourish, practice, teach} Q = {love ^ suffer}
Drawbacks of the Boolean Model • Retrieval based on binary decision criteria with no notion of partial matching • No ranking of the documents is provided (absence of a grading scale) • Information need has to be translated into a Boolean expression which most users find awkward • The Boolean queries formulated by the users are most often too simplistic • As a consequence, the Boolean model frequently returns either too few or too many documents in response to a user query
Drawbacks of the Boolean Model • The Boolean model imposes a binary criterion for deciding relevance • The question of how to extend the Boolean model to accomodate partial matching and a ranking has attracted considerable attention in the past • Two extensions of boolean model: • Fuzzy Set Model • Extended Boolean Model
Algebraic Set Theoretic Generalized Vector Lat. Semantic Index Neural Networks Structured Models Fuzzy Extended Boolean Non-Overlapping Lists Proximal Nodes Classic Models Probabilistic Boolean Vector Probabilistic Inference Network Belief Network Browsing Flat Structure Guided Hypertext IR Models U s e r T a s k Retrieval: Adhoc Filtering Browsing
Set Theoretic Models • The Boolean model imposes a binary criterion for deciding relevance • The question of how to extend the Boolean model to accomodate partial matching and a ranking has attracted considerable attention in the past • Two set theoretic models for this: • Fuzzy Set Model • Extended Boolean Model
Fuzzy Set Model • Queries and docs represented by sets of index terms: matching is approximate from the start • This vagueness can be modeled using a fuzzy framework, as follows: • with each term is associated a fuzzy set • each doc has a degree of membership in this fuzzy set • This interpretation provides the foundation for many models for IR based on fuzzy theory • In here, we discuss the model proposed by Ogawa, Morita, and Kobayashi (1991)
Fuzzy Set Theory • Framework for representing classes whose boundaries are not well defined • Key idea is to introduce the notion of a degree of membership associated with the elements of a set • This degree of membership varies from 0 to 1 and allows modeling the notion of marginal membership • Thus, membership is now a gradual notion, contrary to the crispy notion enforced by classic Boolean logic
Fuzzy Set Theory • Model • A query term: a fuzzy set • A document: degree of membership in this test • Membership function • Associate membership function with the elements of the class • 0: no membership in the test • 1: full membership • 0 ~1: marginal elements of the test documents
Fuzzy Set Theory for query term a class • A fuzzy subsetA of a universe of discourse U is characterized by a membership function µA: U[0,1] which associates with each element u of U a number µA(u) in the interval [0,1] • complement: • union: • intersection: document collection
Examples • Assume U={d1, d2, d3, d4, d5, d6} • Let A and B be {d1, d2, d3} and {d2, d3, d4}, respectively. • Assume A={d1:0.8, d2:0.7, d3:0.6, d4:0, d5:0, d6:0} and B={d1:0, d2:0.6, d3:0.8, d4:0.9, d5:0, d6:0} • = {d1:0.2, d2:0.3, d3:0.4, d4:1, d5:1, d6:1} • = {d1:0.8, d2:0.7, d3:0.8, d4:0.9, d5:0, d6:0} • ={d1:0, d2:0.6, d3:0.6, d4:0, d5:0, d6:0}
Fuzzy Information Retrieval • basic idea • Expand the set of index terms in the query with related terms (from the thesaurus) such that additional relevant documents can be retrieved • A thesaurus can be constructed by defining a term-term correlation matrix c whose rows and columns are associated to the index terms in the document collection keyword connection matrix
Fuzzy Information Retrieval(Continued) • normalized correlation factor ci,l between two terms ki and kl (0~1) • In the fuzzy set associated to each index term ki, a document dj has a degree of membership µi,j ni is # of documents containing term ki where nl is # of documents containing term kl ni,l is # of documents containing ki and kl
Fuzzy Information Retrieval(Continued) • physical meaning • A document dj belongs to the fuzzy set associated to the term ki if its own terms are related to ki, i.e., i,j=1. • If there is at least one index term kl of dj which is strongly related to the index ki, then i,j1. ki is a good fuzzy index • When all index terms of dj are only loosely related to ki, i,j0. ki is not a good fuzzy index
Example q = (ka (kb kc))= (ka kb kc) (ka kb kc) (ka kb kc)= cc1+cc2+cc3 Da: the fuzzy set of documents associated to the index ka cc2 cc3 Da djDa has a degree of membership a,j > a predefined threshold K cc1 Db Da: the fuzzy set of documents associated to the index ka (the negation of index term ka) Dc
Example Query q=ka (kb kc) disjunctive normal form qdnf=(1,1,1) (1,1,0) (1,0,0) (1) the degree of membership in a disjunctive fuzzy set is computed using an algebraic sum (instead of max function) more smoothly (2) the degree of membership in a conjunctive fuzzy set is computed using an algebraic product (instead of min function) Recall
Fuzzy Set Model • Q: “gold silver truck”D1: “Shipment of gold damaged in a fire”D2: “Delivery of silver arrived in a silver truck”D3: “Shipment of gold arrived in a truck” • IDF (Select Keywords) • a = in = of = 0 = log 3/3arrived = gold = shipment = truck = 0.176 = log 3/2damaged = delivery = fire = silver = 0.477 = log 3/1 • 8 Keywords (Dimensions) are selected • arrived(1), damaged(2), delivery(3), fire(4), gold(5), silver(6), shipment(7), truck(8)
Fuzzy Set Model • Sim(q,d): Alternative 1 Sim(q,d3) > Sim(q,d2) > Sim(q,d1) • Sim(q,d): Alternative 2 Sim(q,d3) > Sim(q,d2) > Sim(q,d1)
Extended Boolean Model • Disadvantages of “Boolean Model” : • No term weight is used • Counterexample: query q=Kx AND Ky. Documents containing just one term, e,g, Kx is considered as irrelevant as another document containing none of these terms. • The size of the output might be too large or too small
Extended Boolean Model • The Extended Boolean model was introduced in 1983 by Salton, Fox, and Wu • The idea is to make use of term weight as vector space model. • Strategy: Combine Boolean query with vector space model. • Why not just use Vector Space Model? • Advantages: It is easy for user to provide query.
Extended Boolean Model • Each document is represented by a vector (similar to vector space model.) • Remember the formula. • Query is in terms of Boolean formula. • How to rank the documents?
Fig. Extended Boolean logicconsidering the space composed of two terms kx and ky only. ky ky kx kx
Extended Boolean Model • For query q=Kx or Ky, (0,0) is the point we try to avoid. Thus, we can use to rank the documents • The bigger the better.
Extended Boolean Model • For query q=Kx and Ky, (1,1) is the most desirable point. • We use to rank the documents. • The bigger the better.
Extend the idea to m terms • qor=k1 p k2 p … p Km • qand=k1 p k2 p … p km
Properties • The p norm as defined above enjoys a couple of interesting properties as follows. First, when p=1 it can be verified that • Second, when p= it can be verified that • Sim(qor,dj)=max(xi) • Sim(qand,dj)=min(xi)
Example • For instance, consider the query q=(k1 k2) k3. The similarity sim(q,dj) between a document dj and this query is then computed as • Any boolean can be expressed as a numeral formula.