600 likes | 615 Views
This lecture covers non-binary independence models, term relationships in indexing, and fuzzy information retrieval.
E N D
CS533 Information Retrieval Dr. Michal Cutler Lecture #8 February 17, 1999
This lecture • Non-binary independence probabilistic models • Term relationship in indexing • Fuzzy information retrieval
The non-binary independence model • Yu, Meng and Park • wi depends on the term frequency di and on how the frequency of occurrence is distributed in the sets of relevant and nonrelevant documents
The non-binary independence model • When di= 0, wi(di= 0) is not necessarily 0 • We can modify the weights so the wi’ = wi- wi(dI=0) • This does not change the relevance order and makes the computation more efficient
The weight of a term - example • Assume: R = 6. N=14. • 3 relevant documents have di = 2, 2 have di = 1, and one has di = 0 • One nonrelevant document with di = 2, one with di = 1 and 6 with di = 0
The weight of a term - example • Let p2, p1, p0 denote the probabilities that relevant documents have 2, 1, 0 occurrences of a term • Let q2, q1, q0 denote the probabilities that nonrelevant documents have 2, 1, 0 occurrences of a term
The weight of a term - example • p2 = 3/6, p1 = 2/6, and p0 = 1/6 • q2 = 1/8, q1 = 1/8 and q0 = 6/8 • w2 = log(3/6)/(1/8) = log4, • w1 = log(2/6)/(1/8) = log8/3, • w0 =log(1/6)/(6/8) = log2/9 • w2’ = w2 - w0 = log18, w1’ = log12, w0’ = 0
Term relationship in indexing • Assume that terms do not occur in text independently
Fuzzy Boolean Models • Limitations of the Boolean model • Introduction to fuzzy sets • Fuzzy models • basic • MMM • Paice • p-norm
Boolean model limitations 1. AND query Given the query: • “fuzzy” AND “logic” AND “approximate” AND “reasoning” AND “possibility” AND “theory”, • D is not retrieved when indexed by all the terms except “possibility”
Boolean model limitations 2. OR query Given for example the query: • (“fuzzy” AND “logic”) OR (“approximate” AND “reasoning”), • D1 indexed by all the terms • D2 indexed only by “fuzzy” and “logic” • D1, D2 retrieved in arbitrary order
Boolean model limitations 3. Query term importance • Searchers can rate term importance • If query term A is more important than term B, D1 with only term A should rank higher than D2 that contains only B
Boolean model limitations • During Boolean indexing a term is either chosen to represent a document or not. • We would like to be able to represent the importance of a term to a document.
Introduction to fuzzy sets • We discuss • the difference between conventional (crisp) sets and fuzzy sets • fuzzy set operations
Crisp sets • Sets in which an object is either a member of a set or not are called crisp sets • In fuzzy sets an item may be a partial member of a set • Each object in the universe can be partially compatible with some attribute
Limitations of crisp sets • To decide membership in sets TALL, OLD, or WEALTHY, need a threshold • $1,000,000 for a wealthy person • So a person with $999,999 is not wealthy (poor) • With fuzzy sets a person with $500,000 belongs WEALTHY with some degree of membership
Fuzzy sets • The degree of membership of x in fuzzy set A is mA(x) : X-> [0,1] • where X is the universal set and • [0,1] denotes the interval of real numbers from 0 to 1
Example • The discrete fuzzy set TALL • Let the universe of heights be U={4.5, 5, 5.5, 6, 6.5, 7, 7.5, 8} • TALL={0/4.5, 0.2/5, .5/5.5, .7/6, 1/6.5, 1/7, 1/7.5, 1/8} • The first number in each pair is membership degree
Example • In reality height is a continuous function • Next transparency describes the fuzzy set TALL as a continuous function
A membership function 1.0 mTALL 0.7 0.5 0 0 4.5 5.5 6 6.5 Height in feet
Fuzzy set operations • Set operations can be defined in a variety of ways. • Most common ones are: • The membership function of AÇB is: mAÇB(x)= min{mA(x), mB(x)} or mAÇB(x)= mA(x)mB(x) for all xÎX
Fuzzy set operations • Usually fuzzy operations are compatible with crisp set operations If A and B are crisp, mAÇB(x)=1 iff xÎ AÇB mAÇB(x)=0 iff xÏAÇB • This is satisfied by both definitions
Fuzzy set operations • The membership function of AÈB is: mAÈB(x)= max{mA(x), mB(x)} or mAÈB(x)= mA(x)+mB(x)- mA(x)mB(x) If A and B are crisp, mAÈB(x)=1 iff xÎ AÈ B mAÈB(x)=0 iff xÏAÈ B
Fuzzy set operations • The membership function of the complement A’ is: mA’(x)= 1-mA(x) For crisp A mA’(x)=1 iff xÏA mA’(x)=0 iff xÎA
Information retrieval • A document D is represented by a weight vector (w1,…,wt) where wi = mTi(D) is • the “degree of membership” of D in the fuzzy set for concept Ti • POLITICS={mpolitics(D1)/ D1 , mpolitics(D2)/ D2 ,…, mpolitics(DN)/ DN}
Information retrieval • User can specify a fuzzy value for each query term • To calculate fuzzy weights for document terms use statistical measures tf, idf, normalization, etc
Basic fuzzy Boolean model • The query (Ti AND Tj) is computed for document D by min(wi, wj), • (Ti OR Tj) is computed by max(wi, wj), • (NOT Ti) is computed by 1-wi
Retrieval examples • D1: elephant/0.8 + mammals/0.5 + Asia/0.2 + ... • D2: elephant/0.3 + mammals/0.5 + Asia/0.3 + ... • Q1= elephants • D1 similarity 0.8 • D2 similarity 0.3
Basic fuzzy Boolean model • Model does not solve first three limitations of Boolean retrieval. • A document will not be retrieved for an AND query • if one term has 0 weight • Single value dependency
Retrieval examples • D1: elephant/1 + Asia/0.2 + ... • D2: elephant/0.2 + Asia/0.2 + ... • Q2= elephants AND Asia • D1 retrieved with min(1, 0.2)=0.2. • D2 retrieved with min(0.2,0.2)=0.2 • D1 better
Basic fuzzy Boolean model • A document with all OR query terms may be retrieved with a smaller weight than a document that contains only one query term • User’s subjective value of query terms ignored
Retrieval examples • D1:elephant/0.8 + hunting/0.1 + ... • D2: elephant/0.7 + hunting/0.7 + ... • Q3= elephants OR hunting • D1 max(0.8, 0.1)=0.8, and • D2 with max(0.7, 0.7)=0.7 • D2 better
Retrieval examples • D1: mammals/0.5+Asia/0.2+... • D2: mammals/0.51+Asia/0.49+... • Q4 = (mammals AND NOT Asia) • D1 min(0.5, 1-0.2) = 0.5 • D2 min(0.51, 1-0.49) = 0.51
Mixed min and max model • The MMM model (Fox) takes into account the maximum value for an AND query and the minimum for an OR query. • Deals with missing term limitation of Boolean
Mixed min and max model • AND or OR query • QAND=(A1 AND A2 AND … AND An) • SIM(QAND, D)= CAND1*min(wA1, wA2,…, wAn)+ CAND2*max(wA1, wA2,…, wAn) • CAND1 > CAND2 and CAND1 + CAND2 =1
Mixed min and max model • QOR =(A1 OR A2 OR … OR An) • SIM(QOR ,D)= COR1*max(wA1, wA2,…, wAn)+ COR2*min(wA1, wA2,…, wAn) • COR1 > COR2 and COR1 + COR2 =1
Mixed min and max model • An AND query with a missing term retrieved with a value which depends on the maximum. • Similarly value of OR query reduced if missing query terms
Retrieval example • D1 fuzzy/0.8+logic/0.2+sets/0.2+... • D2 fuzzy/0.8+logic/0.7+sets/0.2+... • D3 fuzzy/0.8+logic/0.7+sets/0+... • CAND1=0.6 • Q3= fuzzy AND logic AND sets • D1 and D2 same rank 0.6*0.2+0.4*0.8 = 0.44. (D2 better) • D3 rank is .4*.8 = .32
Retrieval example • D1 fuzzy/0.8+logic/0.2+sets/0.2+... • D2 fuzzy/0.8+logic/0.7+sets/0.2+... • D3 fuzzy/0.8+logic/0.7+sets/0+... • COR1=0.6 • Q4= fuzzy OR logic OR sets • D1 and D2 same rank 0.6*0.8+0.4*0.2=.56 • D3 0.6*0.8=.48
Paice model • Improves on both the basic model and the MMM model • Takes into account all query terms • AND or OR queries. • QAND=(A1 AND A2 AND … AND An), • QOR =(A1 OR A2 OR … OR An)
Paice model • Values are sorted in ascending order for AND queries • Descending order for OR queries. • Slower • sorting query terms (O(nlogn)) • and computing exponents
Paice model • r=1, Sim(Q, D) is the average • r<1 Sim(Q, D) determined by terms with low exponent
Paice Model • Experiments determined: • r=1 for AND queries (average) • r=0.7 for OR queries • No query weights.
Retrieval • D1 fuzzy/0.8+logic/0.2+sets/0.2+... • D2 fuzzy/0.8+logic/0.7+sets/0.2+... • D3 fuzzy/0.8+logic/0.7+sets/0+... • Q3= fuzzy AND logic AND sets • D1 (0.8+0.2+0.2)/3=.4 • D2 (0.8+0.7+0.2)/3=0.56 • D3 (0.8+0.7+0)/3=0.5
Retrieval • D1 fuzzy/0.8+logic/0.2+sets/0.2+... • D2 fuzzy/0.8+logic/0.7+sets/0.2+... • D3 fuzzy/0.8+logic/0.7+sets/0+... • Q4= fuzzy OR logic OR sets • D1:
The P-norm model • Salton and Fox • Weights for query and document terms • Very good retrieval results • Drawback computation time • http://www.individual.com/
Two Term Queries • OR query • If both query terms have 0 weights do not retrieve D • Similarity distance of document vector from (0,0)
Two Term Queries • AND query • If both query terms have high weights (close to 1) retrieve • Similarity 1 minus the distance of the vector from (1,1)