240 likes | 462 Views
Fussy Set Theory. Definition A fuzzy subset A of a universe of discourse U is characterized by a membership function which associate with each element u of U a number in the interval [0,1]. Set Theory: A={a, b, c}. Subset of A : {a, c}.
E N D
Fussy Set Theory • Definition A fuzzy subset A of a universe of discourse U is characterized by a membership function which associate with each element u of U a number in the interval [0,1]. • Set Theory: A={a, b, c}.Subset of A: {a, c}. • An element is either in a set of not in a set. is either 0 or 1.
Set Theory • Let U be the set of all elements (universe) • There are three basic operations: • AB={elements in A or in B}. • AB={elements in both A and B} • Not A=U-A.
Definition Let U be the universe of discourse, A and B be two fussy subsets of U, and be the complement of A relative to U. Also, let u be an element of U. Then,
Fuzzy Information Retrieval We first set up term-term correlation matric: For terms ki and kl, Where ni is the number of documents containing ki , nl is the number of documents containing kl And ni,l is the number of documents containing both ki and kl. Note Ci,i=1.
Fuzzy Information Retrieval We define a fuzzy set for each term ki. In the fuzzy set for ki , a document dj has a degree of membership ij computed as Example: c1,2=0.1, c1,3=0.21. D1=(0, 1, 1, 0). 1,1= 1-0.9*0.79. D2=(1, 0, 0, 0). 1,2= 1-0. (since c1,1=1.) How is d3=(1, 0, 1,0)?
Fuzzy Information Retrieval Whenever, the document dj contains a term that is strongly related to ki, then the document dj is belong to the fuzzy set of term ki, i.e., i,j is very close to 1. Example, c1,2=0.9, d1=(0, 1, 0, 0). 1,1 =1-(1-0.9)=0.9
Query: • Query is a Boolean formula, e.g., • q=Ka and (Kb or not Kc). • q= (1, 1, 1) or (1, 1, 0) or (1, 0, 0). • Suppose q is
Figure 1. Fuzzy document sets for the query . Each is a conjunctive component. is the query fuzzy set.
Where is the membership of in the fuzzy set associated with . q,j is the membership of document j for query q.
Exercise: suppose there are 3 doc. and 4 terms. d1=(1, 0, 1, 0), d2=(1, 1, 0, 0), and d3=(0, 1, 1, 0). (1) Compute the term-term correlation matrix ci,j. (2) Compute i,j (membership of document j in term i.) (3) If the query q=(1, 0, 0, 0) or (1, 1, 0, 0), compute q,k for each document dk.
Some changes in the last slide. q, j= cc1+cc2+cc3,j=max {cc1,j, cc2,j , cc3,j}, where cc1,j, cc2,j , cc3,j are computed as before.
String Matching Allowing Errors • Problem: Given a short pattern P of length m, a long text T of length n, and a maximum allowed number of errors k, find all the text positions where the pattern occurs with at most k errors.
Dynamic Programming • C[i,j] be the number of errors allowed, i and j are the indices for the pattern and the text. • Three kinds of error: mismatch (a, b), insertion( a, )and deletion ( , a).
The matrix The dynamic programming algorithm search ‘survey’ in the text ‘surgery’ with two errors. Bold entries indicate matching positions. Running time O(nm).
Exercise • Let ABCABCDDABEDF be the text and pattern be ABCDAB. Find the occurrence of the pattern with at most 1 error.
String Matching Allowing Errors (FAST Algorithm) • Just keep the cells with value at most k. • This will reduce the time complexity .
Regular expressions Matching • Regular expression: • Any letter x in {},is a regular expression, where is the set of all letters. • if A and B are regular expression, then A|B, A.B and (A)* are regular expressions.
Regular expressions Matching(Not Required) • Given an regular expression E and a string T, find all the substrings in T that match E. • Let d(i) be the set of all states in the automaton that can be reached after T1T2…Ti is accepted. • Given d(i), d(i+1) can be computed easily. • There is a starting and final state in the automaton. • Whenever the final state is reach, we find a substring in T that match the expression.
Example: • E=(A|AA).(B|AB). • T=ABBAB. • D(1)={a, b, d, c} • D(2)={ a,b, d, e, f, g, i }, • D(3)={a,b,c, e, f, g, i, h, l}. • D(4)={a,b,d,c,j} • D(5)={a,b,d, e, f, g, i, k}
Running time • O(n2), where n is the size of the automaton since d(s, i) could contain O(n) states.