1.13k likes | 1.3k Views
Discovering Patterns in Multiple Datasets. Raj Bhatnagar University of Cincinnati. Nature of Distributed Datasets. Horizontal Partitioning. Vertical Partitioning. Data components may be Geographically Distributed. Nature of Distributed Datasets. Multi-Domain Datasets. Diseases. Drugs.
E N D
Discovering Patterns in Multiple Datasets Raj Bhatnagar University of Cincinnati
Nature of Distributed Datasets Horizontal Partitioning Vertical Partitioning Data components may be Geographically Distributed
Nature of Distributed Datasets Multi-Domain Datasets Diseases Drugs Adverse Reactions Genes Drugs Genes
Nature of Distributed Datasets Multi-Domain Datasets Keywords Cited-Documents Topics Documents Keywords Documents
Types of Patterns • Decision Trees • Association Rules • Principal Component Analysis • K-Nearest Neighbor Analysis • Clusters • Hierarchical • K-Means • Subspace
Nature of Clusters PatternsΞUnsupervised, Data Driven, Clusters Single-Domain Clustering Diseases Diseases Genes Genes Clusters of similar genes; In the context of diseases Clusters of similar diseases In the context of genes Clusters may be: - Mutually Exclusive - Overlapping
Nature of Patterns Simultaneous Two-Domain Clustering Diseases A cluster of similar genes - in a subspace of diseases; A cluster of similar diseases - in a subspace of genes Options: - Exhaustive in one domain - Exhaustive in both domains - Mutually exclusive clusters in one or both domain - Overlapping clusters/subspaces in both domains Genes D G
Dise. Drugs Genes Genes Nature of Patterns Simultaneous Three (Multi)-Domain Clustering Diseases Drugs Genes Genes Match “genes” subsets in two clusters Phase-III of this research
Part-I Patterns in Vertically Distributed Databases
A B C C D E A E G D1 D2 Dn Learning Decision Trees Vertically Partitioned Dataset D = D1 X D2 X . . . X Dn - D is implicitly specified Goal: Build decision tree for implicit D, using the explicit Di’s Geographically distributed databases Limitations: - Can’t move Di’s to a common site - Size / communication cost/Privacy - Can’t update local databases - Can’t send actual data tuples
Node 1 Node 2 Node 3 A B C D C F A E C 1 2 2 1 2 2 1 1 2 A B C D E F 1 6 2 1 1 1 1 2 1 1 6 1 1 2 1 1 6 1 1 1 3 2 2 1 1 6 1 1 2 3 Explicit Component Databases 2 6 1 - - - - - - 1 6 2 1 1 2 1 2 2 1 1 2 2 6 1 1 2 1 A C 2 6 1 1 2 3 1 1 1 2 2 1 SharedSet 2 2 Explicit and Implicit Databases Implicit Database
A B C C D E A E G D1 D2 Dn Decomposition of Computations - Since D is implicit, - For a computation: - Decompose F into G and g’s - Decomposition depends on - F - Di’s and Set of shared attributes
Shareds A C L shared attributes; k values each; 1 1 1 2 tuples 2 1 2 2 Count All Tuples in Implicit D • condJ : Jth tuple in Shareds • n: number of databases (Dis) • (N(Dt)condJ): count of tuples in Dt satisfying condJ • Local computation: gi(Di,) = N(Dt)condJ • G is a sum-of-products • If each Di knows “shared” values, then • Only one message per site needed for #tuples
Learning Decision Trees Consists of various counts only: ID3 Algorithm a=? b branches a1 a2 ab c classes in the dataset Nbcand Nbcan be computed using g and G as for #tuple - one message/database needed for computing each Entropy value
Compute Covariance Matrix for D • Covariance matrix for D • Needed for eigen vectors/principal components • Needs second order moments • Helps compute terms of the type: • This matrix can be computed at one of the databases
G-and-g Decomposition for 2nd order moments • Sum of products for two attributes: • Six different ways in which x and y may be distributed • Each requires a different decomposition • Case 1: x same as y; and x belongs to the SharedSet. • Case 2: x same as y; and x does not belong to the SharedSet. • Case 3: x and y both belong to the SharedSet.
Sum of Products • Case 4: x belongs to SharedSet and y does not. • Case 5: x, y don’t belong to the SharedSet and reside on different nodes. • For each tuple t in SharedSet, obtain • and then • Case 6: x, y don’t belong to the SharedSet and reside on the same node. where Prod(t) is average of product of x and y for cond-t of SharedSet
Nearest Neighbor Algorithm • Find nearest neighbor of r1 in D1 • with virtual extensions in D for all tuples in D1 • Need toCompute all pair wise distances • The same distance values can be used for clustering algorithms
Extracting Communication Graph The learner is D1 Covariance, k-NN, etc. algorithms developed for this situation
Part-II Subspace Clusters and Lattice Organization
Clustering in Multi-Domains • Example 3-D dataset with 4 clusters. • Each cluster is in 2-D • Points from two subspace clusters can be very close --making traditional clustering algorithms inapplicable. • Overlapping between clusters
Subspace Clustering • “Interestingness” of a subspace cluster: • Domain dependent / user defined • Similarity-based clusters
Subspace Clusters • Number of Subspaces
Nature of Real Datasets Examples: Genes--Diseases; person-MovieRating; Document-TermFrequency
Row Ids Lattice of Subspaces: Formal Concept Analysis • Need Algorithms to find • Interesting subspace clusters • Lattice provides much more insight • into dataset. Parallel to the ideas of Formal Concept Analysis
a a a a a b b b b b c c c c c d d d d d e e e e e 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 1 1 1 1 1 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 1 1 1 1 1 3 3 3 3 3 1 1 1 1 1 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 4 4 4 4 4 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 5 5 5 5 5 1 1 1 1 1 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 b Clusters in Subspaces Clusters in overlapping subspaces Density = number of rows - An antimonotonic property d a
Value of (anti)monotonic Properties Pruned supersets If AB < needed density, Then so do all its descendents
Maximal and Closed Subspaces Closed but not maximal Minimum support = 2 Closed and maximal # Closed = 9 # Maximal = 4
a a a a a b b b b b c c c c c d d d d d e e e e e 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 1 1 1 1 1 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 1 1 1 1 1 3 3 3 3 3 1 1 1 1 1 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 4 4 4 4 4 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 5 5 5 5 5 1 1 1 1 1 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Siblings and Parents in Lattice Merge lattice nodes to find clusters of other properties Siblings in Lattice C1 =<{1,2,3,4,5}, {a,c,d,e}> C2 =<{3,4,5}, {a,c,d,e}>
Goal: Subspace Clusters with Properties Anti-monotonic properties: • Minimum density(C=<O,A>):= |O| / total number of objects in the data eg: density(C=<{o2,o3,o4},{a.2}> ) = 3/5 = 0.6 • Succinctness: density is strictly smaller than all of its minimum generalizations eg: C1=<{o2,o3,o4},{a.2c.2}>--not succinct C2 =<{o3,o4},{b.2 c.2 }>--succinct • Numerical properties (row-wise): “max”, “min” eg: C2 =<{o1,o5},{b.2c.4 }> satisfies “max> 3” Weak anti-monotonic properties • “average>=δ”“average<= δ”“variance>=δ”“variance<=δ” eg: C2 =<{o1,o5},{b.2c.4 }> satisfies “average>= 3”, but: both C3 =<{o1,o2,o4,o5},{b.2}> and C4 =<{o1,o5},{b.2c.4d.2}> violate “average >=3”
Levelwise Search • Pruning of weak anti-monotonic properties eg: if C2 =<{o1,o5},{b.2c.4 }> satisfies “average>= 3”, then o1,o5 must be contained in at least one of its minimum generalizations that satisfy this constraint: C5 =<{o1,o5},{c.4}> • If an object is not contained in any cluster of size k that satisfies a weak anti-monotonic property, it must not be contained in any cluster of size k+1 that satisfies this property
Levelwise Search for Subspace Clusters • Anti-monotonic & Weak Anti-monotonic • Candidate generation based on anti-monotonic properties only • Data reduction based on weak anti-monotonic properties, such as: “mean>=δ”, “mean<=δ”, “variance>=δ”, “variance< =δ”
Performance Comparison • Optimizing Techniques • Sorting the attributes • Reuse previous results • Stack of unpromising branches • Check closure property
Distributed Subspace Clustering • Discover closed subspace clusters from databases located at multiple sites • Objectives: • Minimize local computation cost • Minimize communication cost
Distributed Subspace Clustering • Horizontal Partitioned Data D1 DS D2
Distributed Subspace Clustering List of Closed Subspace Clusters Lemma 1: All locally closed attribute sets are also globally closed Lemma 2:Intersection of two locally closed attribute sets from two different sites is globally closed eg: a = ac ∩ ab
Distributed Subspace Clustering List of Closed Subspace Clusters • Compute the object set: • Closed at both partitions: compute the union of the two object setseg: cd • Closed in one of the partition: the union of two object sets whose attribute sets’ intersection equals the target attribute seteg: c = c ∩ cd • Not closed in any of the partitions: similar to case 2eg: a = ac ∩ ab
Distributed Subspace Clustering List of Closed Subspace Clusters Problem: both for case 2 and 3 a = ac ∩ ab and a = acd ∩ ab Solution: for each globally closed attribute set, keep track of the largest object set (or size of the object set)
Distributed Subspace Clustering Density Constraint: δ >= 0.6 Observation: Intersection of two elements both from Eis can not have enough density Efficient Computation: Sort Fi and Ei into decreasing density
Distributed Subspace Clustering • Generalize to k>2 • k sites need k step communication and computation • k sites have k types:
Part-III Multi-Domain Clusters
Introduction Traditional clustering Bi-Clustering 3-Clustering
Why 3-clusters? • Correspondence between bi-clusters of two different lattices • Sharpen local clusters with outside knowledge • Alternative? “Join datasets then search” • Does not capture underlying interactions • Inefficient • Not always possible
Formal Definitions Pattern in Di Bi-cluster in Di 3-Cluster across D1 and D2
Defining 3-clusters • D1 is the “learner” • Maximal rectangle of 1's under suitable permutation in learner • Best Correspondence to rectangle of 1's in D2 D1 D1