Chase Methods based on Knowledge Discovery

Chase Methods based on Knowledge Discovery Agnieszka Dardzinska & Zbigniew W. Ras agnadar@wp.pl & ras@uncc.edu

Algorithm Chase GIVEN:  Incomplete Information System (IIS) • Constraints (functional dependencies,..) which IIS satisfies [ Dept  Chair, Chair  Dept  Faculty-Name, … Dept(x1) =Dept(x2) ]

Tableau System for IIS– information system with null values replaced by variables

Variables in Tableaux System • distinguished variables, one for each attribute (if b is an attribute of interest, then vb is the corresponding distinguished variable) • nondistinguished variables (there are countably many of them: n1, n2, n3, ….)

Functional Dependencies: [Department → Chair] [Department *Chair → Faculty Name]

Algorithm Chase Input: tableaux system S and set of functional dependencies F Output: tableaux system CHASEF(S) Begin S1:=S; while there are t1, t2  S1 and (B  b)  F such that t1[B]= t2[B] and t1[b] < t2[b] do change all the occurrences of the value t2[b] in S1 to t1[b] CHASEF(S):=S1 End

Algorithm Chase Input: tableaux system S and set of functional dependencies F Output: tableaux system CHASEF(S) Begin S1:=S; while there are t1, t2  S1 and (B  b)  F such that t1[B]= t2[B] and t1[b] < t2[b] do change all the occurrences of the value t2[b] in S1 to t1[b] CHASEF(S):=S1 End The algorithm always terminates if applied to a finite tableaux system. If one execution of the algorithm generates a tableaux system satisfying F, then every execution of the algorithm generates the same tableaux system.

Algorithm Chase 1

Chase supported by rules extracted from IIS (Chase 1) • Chase 1 identifies all incomplete attributes (their values are called concepts) in IS . 2. Main Algorithm - Extraction of rules from IS describing these concepts, - Null values in IS are replaced by values suggested by these rules. 3. These two steps are repeated till fixpoint is reached.

Example (Chase1) X = {x1, x2, x3, x4, x5, x6, x7, x8, x9, x10} A = {b, c, d, e, f, g} Attribute b e2b1(support 2), c1f1 b1(support 2), g2 b2(support 2), c3 b3(support 1), c2 b2(support 2), g3d2 b2(support 1), e3d3 b3(support 1), f2d2 b2(support 1).

Example (Chase1) Attribute b Two null values in S: b(x6), b(x8) b(x6): e2b1(support 2), c1f1 b1(support 2), g2 b2(support 2), c3 b3(support 1), c2 b2(support 2), g3d2 b2(support 1), e3d3 b3(support 1), f2d2 b2(support 1).

Example (Chase1) Attribute b Two null values in S: b(x6), b(x8) b(x8): e2b1(support 2), c1f1 b1(support 2), g2 b2(support 2), c3 b3(support 1), c2 b2(support 2), g3d2 b2(support 1), e3d3 b3(support 1), f2d2 b2(support 1). b(x6) = b2

Example (Chase1) Two null values in S: c(x7), c(x8). c(x7): b1c1(support 2), e2 c1(support 1), f4 c1(support 1), g1 c1(support 2), b2 d2 c2(support 1), b2e1 c2(support 1), b2f2 c2(support 1), b2g3 c2(support 1), d2e1 c2(support 1), d2g3 c2(support 1). b(x6) = b2

Example (Chase1) Two null values in S: c(x7), c(x8). c(x8): b1c1(support 2), e2 c1(support 1), f4 c1(support 1), g1 c1(support 2), b2 d2 c2(support 1), b2e1 c2(support 1), b2f2 c2(support 1), b2g3 c2(support 1), d2e1 c2(support 1), d2g3 c2(support 1). b(x6) = b2 , c(x7) = c1

Algorithm Chase1(S, In(A), L(D)) Input:System S=(X, A, V) Set of incomplete attributes In(A)={a1, a2, …, ak} Set of rules L(D) Output:System Chase1(S) begin j:=1; whilej ≤ kdobegin Sj:=S for allvVajdo while there is xX and rule (t v)L(D) such that xNSj(t) and card(aj(x))≠1 begin a(x):=v; end j:=j+1 end S:={Sj:1 ≤ j ≤ k}, Chase1 (S, In(A), L(D)) end

A2 A5 A6 A7 A1 A3 A4 query q3 2 4 2 25 22 22 2 q3 2 3 8 22 28 23 2 q3 22 2 2 25 12 30 3 q3 2 2 32 27 22 2 14 q1 3 4 4 2 8 4 0 q1 4 9 3 8 5 1 3 q1 14 4 3 12 4 2 3 q1 1 3 17 4 14 4 5 q2 4 2 1 4 13 14 13 q2 4 15 8 13 13 1 3 q2 4 12 15 17 13 2 3 q2 14 27 16 3 4 4 13 A1- “the no. of different attributes used in a query" A2- “the percent of null values in a queried IS" A3- “the no. of objects returned by QAS when IS-complete" A4- “the no. of objects returned by QAS when IS-incomplete" (optimistic interpretation) A5- “the no. of objects returned by QAS based on rule-based chase algorithm" (pessimistic interpretation) A6- “the no. of bad objectsretrieved" A7- “the no. of passes of rule-based chase algorithm" ZOO Database

Rules Discovery from partiallyIncomplete Information Systems

Data (Incomplete) Information SystemS = ( X, A, V ) X - finite set of objects, A - finite set of attributes, - set of their values. Assumption 1. For any , 2. For any

X a b c d e x1 x2 x3 x4 x5 x6 x7 x8 Example

X a b c d e x1 x2 x3 x4 x5 x6 x7 x8 Algorithm ERID for Extracting Rules from partially Incomplete Information System Goal: Describe e in terms of {a,b,c,d}

X a b c d e x1 x2 x3 x4 x5 x6 x7 x8 Algorithm ERID Goal: Describe e in terms of {a,b,c,d} For the values of the decision attribute we have:

X a b c d e x1 x2 • Check the relationship “ ” between values of classification attributes {a,b,c,d} and values of decision attribute e x3 x4 x5 x6 x7 x8 Algorithm ERID Goal: Describe e in terms of {a,b,c,d}.

X a b c d e x1 x2 x3 x4 x5 x6 x7 x8 Algorithm ERID Goal: Describe e in terms of {a,b,c,d} Let , . We say that: support iff and confidence of the rule are above some threshold values.

X a b c d e x1 x2 x3 x4 x5 x6 x7 x8 Algorithm ERID Goal: Describe e in terms of {a,b,c,d} Let , . We say that: support iff and confidence of the rule are above some threshold values. How to define support and confidence of a rule ?

Definition of Support and Confidence (by example) To define support and confidence of the rule a1 e3 we compute: Support of the rule: Support of the term a1: Confidence of the rule:

X a b c d e x1 x2 x3 x4 x5 x6 x7 x8 Extracting Rules from partially Incomplete Information System (Algorithm ERID(λ1, λ2)) Goal: Describe e in terms of {a,b,c,d} Thresholds (provided by user): Minimal support (λ1 = 1) Minimal confidence (λ2 = ½) - marked negative - marked negative - marked positive

X a b c d e x1 x2 x3 x4 x5 x6 x7 x8 Algorithm ERID(λ1, λ2) but but but but but but but but but

X a b c d e x1 x2 x3 x4 x5 x6 x7 x8 Algorithm ERID(λ1, λ2) but but but but but but but but but They all are not marked

X a b c d e x1 x2 x3 x4 x5 x6 x7 x8 Algorithm ERID(λ1, λ2) and and

X a b c d e x1 x2 x3 x4 x5 x6 x7 x8 Algorithm ERID(λ1, λ2) and and They all are marked positive.

X a b c d e x1 x2 x3 x4 x5 x6 x7 x8 Algorithm ERID(λ1, λ2) and and They all are marked positive. They all are marked negative.

X a b c d e x1 x2 x3 x4 x5 x6 x7 x8 Algorithm ERID(λ1, λ2) The algorithm continues for terms of length 3, 4, … till all of them have either positive or negative marks. Rules are automatically constructed from relations marked positive.

Algorithm Chase 2 (for Partially IIS)

is defined for any , Algorithm Chase 2 • partially incomplete information system of type λ, if S is incomplete and the following three conditions hold:

Algorithm Chase 2 S1, S2 - partially incomplete, both of type λ and both classifying the same sets of objects (from X) using the same sets of attributes (A) Let and satisfies containment relation Ψ (or Ψ(S1)=S2) if: The pair (S1, S2) We also denote that fact by

System S1 System S2 X a b c d e X a b c d e x1 x1 x2 x2 x3 x3 x4 x4 x5 x5 x6 x6 x7 x7 c2 x8 x8

Assumptions: • - information system of type λ • - set of all pairwise independent rules extracted by ERID from S • NS(t) - standard interpretation of term t in S, meaning that: , for any where for any , we have: • In(A) = {a1, … , ak} - incomplete attributes in S

for all do begin if and is a maximal subset of rules from L(D) such that then if then begin end end pj:= pj +nj; end if /containment relation holds between aj(x), [bj(x)/pj]/ then Algorithm Chase 2(S, In(A), L(D)) .................. ..................

X a b c d e x1 x2 x3 x4 x5 x6 x7 x8 ERID(λ1, λ2) λ1=1, λ2=0.5 Incomplete Information System S of type λ = 0.3

X a b c d e x1 x2 x3 x4 x5 x6 x7 x8 ERID(λ1, λ2) λ1=1, λ2=0.5 Algorithm Chase 2 will try to replace by enew(x1) = {(e1, ), (e2, ), (e3, )}. We will show that Ψ(e(x1)) = enew(x1) (the value e(x1) will be changed by Chase 2). Incomplete Information System S of type λ = 0.3

X a b c d e x1 x2 x3 x4 x5 x6 x7 x8 For x1: we have: Because the confidence assigned to e3 is below the threshold λ, then only two values remain: (e1, 0.48), (e2, 0.97). The value of attribute e assigned to x1 is: {(e1, 0.33), (e2, 0.67)}. Incomplete Information System S of type λ = 0.3

Distributed Chase Algorithm (Chase 3)

Let: - distributed autonomous information systems - information system for any (I - set of sites) - knowledge-base at site • set of (k, i)-rules (constructed at site k and sent to site i)

Strategy for constructing knowledge-base and Algorithm Chase 3 Notation: qS=[a, b, c : d, e] - request by S for definitions of a, b, c with additional information that d, e are complete attributes in S.

Global Ontology S2 S1 r1 r2 qS qS=[a, c, d : b] S KB KBS

Global Ontology S2 S1 r1 r2 qS qS=[a, c, d : b] S r1 r2 KBS

Assumption: Di - granularity level of values of attributes used in rules from Di may differ from the granularity level of values of attribute used in descriptions of objects in Chase 3 algorithm to be applicable to Sihas to be based on rules from Di satisfying the following two conditions: . • attribute value used in the decision part of a rule has the granularity level either equal to or finer than the granularity level of the corresponding attribute in Si . • the granularity level of any attribute used in the classification part of a rule is either equal or softer than the granularity level of the corresponding attribute in Si

Chase Methods based on Knowledge Discovery