1.13k likes | 1.29k Views
Discovering Patterns in Multiple Datasets. Raj Bhatnagar University of Cincinnati. Nature of Distributed Datasets. Horizontal Partitioning. Vertical Partitioning. Data components may be Geographically Distributed. Nature of Distributed Datasets. Multi-Domain Datasets. Diseases. Drugs.
E N D
Discovering Patterns in Multiple Datasets Raj Bhatnagar University of Cincinnati
Nature of Distributed Datasets Horizontal Partitioning Vertical Partitioning Data components may be Geographically Distributed
Nature of Distributed Datasets Multi-Domain Datasets Diseases Drugs Adverse Reactions Genes Drugs Genes
Nature of Distributed Datasets Multi-Domain Datasets Keywords Cited-Documents Topics Documents Keywords Documents
Types of Patterns • Decision Trees • Association Rules • Principal Component Analysis • K-Nearest Neighbor Analysis • Clusters • Hierarchical • K-Means • Subspace
Nature of Clusters PatternsΞUnsupervised, Data Driven, Clusters Single-Domain Clustering Diseases Diseases Genes Genes Clusters of similar genes; In the context of diseases Clusters of similar diseases In the context of genes Clusters may be: - Mutually Exclusive - Overlapping
Nature of Patterns Simultaneous Two-Domain Clustering Diseases A cluster of similar genes - in a subspace of diseases; A cluster of similar diseases - in a subspace of genes Options: - Exhaustive in one domain - Exhaustive in both domains - Mutually exclusive clusters in one or both domain - Overlapping clusters/subspaces in both domains Genes D G
Dise. Drugs Genes Genes Nature of Patterns Simultaneous Three (Multi)-Domain Clustering Diseases Drugs Genes Genes Match “genes” subsets in two clusters Phase-III of this research
Part-I Patterns in Vertically Distributed Databases
A B C C D E A E G D1 D2 Dn Learning Decision Trees Vertically Partitioned Dataset D = D1 X D2 X . . . X Dn - D is implicitly specified Goal: Build decision tree for implicit D, using the explicit Di’s Geographically distributed databases Limitations: - Can’t move Di’s to a common site - Size / communication cost/Privacy - Can’t update local databases - Can’t send actual data tuples
Node 1 Node 2 Node 3 A B C D C F A E C 1 2 2 1 2 2 1 1 2 A B C D E F 1 6 2 1 1 1 1 2 1 1 6 1 1 2 1 1 6 1 1 1 3 2 2 1 1 6 1 1 2 3 Explicit Component Databases 2 6 1 - - - - - - 1 6 2 1 1 2 1 2 2 1 1 2 2 6 1 1 2 1 A C 2 6 1 1 2 3 1 1 1 2 2 1 SharedSet 2 2 Explicit and Implicit Databases Implicit Database
A B C C D E A E G D1 D2 Dn Decomposition of Computations - Since D is implicit, - For a computation: - Decompose F into G and g’s - Decomposition depends on - F - Di’s and Set of shared attributes
Shareds A C L shared attributes; k values each; 1 1 1 2 tuples 2 1 2 2 Count All Tuples in Implicit D • condJ : Jth tuple in Shareds • n: number of databases (Dis) • (N(Dt)condJ): count of tuples in Dt satisfying condJ • Local computation: gi(Di,) = N(Dt)condJ • G is a sum-of-products • If each Di knows “shared” values, then • Only one message per site needed for #tuples
Learning Decision Trees Consists of various counts only: ID3 Algorithm a=? b branches a1 a2 ab c classes in the dataset Nbcand Nbcan be computed using g and G as for #tuple - one message/database needed for computing each Entropy value
Compute Covariance Matrix for D • Covariance matrix for D • Needed for eigen vectors/principal components • Needs second order moments • Helps compute terms of the type: • This matrix can be computed at one of the databases
G-and-g Decomposition for 2nd order moments • Sum of products for two attributes: • Six different ways in which x and y may be distributed • Each requires a different decomposition • Case 1: x same as y; and x belongs to the SharedSet. • Case 2: x same as y; and x does not belong to the SharedSet. • Case 3: x and y both belong to the SharedSet.
Sum of Products • Case 4: x belongs to SharedSet and y does not. • Case 5: x, y don’t belong to the SharedSet and reside on different nodes. • For each tuple t in SharedSet, obtain • and then • Case 6: x, y don’t belong to the SharedSet and reside on the same node. where Prod(t) is average of product of x and y for cond-t of SharedSet
Nearest Neighbor Algorithm • Find nearest neighbor of r1 in D1 • with virtual extensions in D for all tuples in D1 • Need toCompute all pair wise distances • The same distance values can be used for clustering algorithms
Extracting Communication Graph The learner is D1 Covariance, k-NN, etc. algorithms developed for this situation
Part-II Subspace Clusters and Lattice Organization
Clustering in Multi-Domains • Example 3-D dataset with 4 clusters. • Each cluster is in 2-D • Points from two subspace clusters can be very close --making traditional clustering algorithms inapplicable. • Overlapping between clusters
Subspace Clustering • “Interestingness” of a subspace cluster: • Domain dependent / user defined • Similarity-based clusters
Subspace Clusters • Number of Subspaces
Nature of Real Datasets Examples: Genes--Diseases; person-MovieRating; Document-TermFrequency
Row Ids Lattice of Subspaces: Formal Concept Analysis • Need Algorithms to find • Interesting subspace clusters • Lattice provides much more insight • into dataset. Parallel to the ideas of Formal Concept Analysis
a a a a a b b b b b c c c c c d d d d d e e e e e 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 1 1 1 1 1 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 1 1 1 1 1 3 3 3 3 3 1 1 1 1 1 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 4 4 4 4 4 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 5 5 5 5 5 1 1 1 1 1 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 b Clusters in Subspaces Clusters in overlapping subspaces Density = number of rows - An antimonotonic property d a
Value of (anti)monotonic Properties Pruned supersets If AB < needed density, Then so do all its descendents
Maximal and Closed Subspaces Closed but not maximal Minimum support = 2 Closed and maximal # Closed = 9 # Maximal = 4
a a a a a b b b b b c c c c c d d d d d e e e e e 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 1 1 1 1 1 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 1 1 1 1 1 3 3 3 3 3 1 1 1 1 1 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 4 4 4 4 4 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 5 5 5 5 5 1 1 1 1 1 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Siblings and Parents in Lattice Merge lattice nodes to find clusters of other properties Siblings in Lattice C1 =<{1,2,3,4,5}, {a,c,d,e}> C2 =<{3,4,5}, {a,c,d,e}>
Goal: Subspace Clusters with Properties Anti-monotonic properties: • Minimum density(C=<O,A>):= |O| / total number of objects in the data eg: density(C=<{o2,o3,o4},{a.2}> ) = 3/5 = 0.6 • Succinctness: density is strictly smaller than all of its minimum generalizations eg: C1=<{o2,o3,o4},{a.2c.2}>--not succinct C2 =<{o3,o4},{b.2 c.2 }>--succinct • Numerical properties (row-wise): “max”, “min” eg: C2 =<{o1,o5},{b.2c.4 }> satisfies “max> 3” Weak anti-monotonic properties • “average>=δ”“average<= δ”“variance>=δ”“variance<=δ” eg: C2 =<{o1,o5},{b.2c.4 }> satisfies “average>= 3”, but: both C3 =<{o1,o2,o4,o5},{b.2}> and C4 =<{o1,o5},{b.2c.4d.2}> violate “average >=3”
Levelwise Search • Pruning of weak anti-monotonic properties eg: if C2 =<{o1,o5},{b.2c.4 }> satisfies “average>= 3”, then o1,o5 must be contained in at least one of its minimum generalizations that satisfy this constraint: C5 =<{o1,o5},{c.4}> • If an object is not contained in any cluster of size k that satisfies a weak anti-monotonic property, it must not be contained in any cluster of size k+1 that satisfies this property
Levelwise Search for Subspace Clusters • Anti-monotonic & Weak Anti-monotonic • Candidate generation based on anti-monotonic properties only • Data reduction based on weak anti-monotonic properties, such as: “mean>=δ”, “mean<=δ”, “variance>=δ”, “variance< =δ”
Performance Comparison • Optimizing Techniques • Sorting the attributes • Reuse previous results • Stack of unpromising branches • Check closure property
Distributed Subspace Clustering • Discover closed subspace clusters from databases located at multiple sites • Objectives: • Minimize local computation cost • Minimize communication cost
Distributed Subspace Clustering • Horizontal Partitioned Data D1 DS D2
Distributed Subspace Clustering List of Closed Subspace Clusters Lemma 1: All locally closed attribute sets are also globally closed Lemma 2:Intersection of two locally closed attribute sets from two different sites is globally closed eg: a = ac ∩ ab
Distributed Subspace Clustering List of Closed Subspace Clusters • Compute the object set: • Closed at both partitions: compute the union of the two object setseg: cd • Closed in one of the partition: the union of two object sets whose attribute sets’ intersection equals the target attribute seteg: c = c ∩ cd • Not closed in any of the partitions: similar to case 2eg: a = ac ∩ ab
Distributed Subspace Clustering List of Closed Subspace Clusters Problem: both for case 2 and 3 a = ac ∩ ab and a = acd ∩ ab Solution: for each globally closed attribute set, keep track of the largest object set (or size of the object set)
Distributed Subspace Clustering Density Constraint: δ >= 0.6 Observation: Intersection of two elements both from Eis can not have enough density Efficient Computation: Sort Fi and Ei into decreasing density
Distributed Subspace Clustering • Generalize to k>2 • k sites need k step communication and computation • k sites have k types:
Part-III Multi-Domain Clusters
Introduction Traditional clustering Bi-Clustering 3-Clustering
Why 3-clusters? • Correspondence between bi-clusters of two different lattices • Sharpen local clusters with outside knowledge • Alternative? “Join datasets then search” • Does not capture underlying interactions • Inefficient • Not always possible
Formal Definitions Pattern in Di Bi-cluster in Di 3-Cluster across D1 and D2
Defining 3-clusters • D1 is the “learner” • Maximal rectangle of 1's under suitable permutation in learner • Best Correspondence to rectangle of 1's in D2 D1 D1