1 / 113

Discovering Patterns in Multiple Datasets

Discovering Patterns in Multiple Datasets. Raj Bhatnagar University of Cincinnati. Nature of Distributed Datasets. Horizontal Partitioning. Vertical Partitioning. Data components may be Geographically Distributed. Nature of Distributed Datasets. Multi-Domain Datasets. Diseases. Drugs.

rune
Download Presentation

Discovering Patterns in Multiple Datasets

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Discovering Patterns in Multiple Datasets Raj Bhatnagar University of Cincinnati

  2. Nature of Distributed Datasets Horizontal Partitioning Vertical Partitioning Data components may be Geographically Distributed

  3. Nature of Distributed Datasets Multi-Domain Datasets Diseases Drugs Adverse Reactions Genes Drugs Genes

  4. Nature of Distributed Datasets Multi-Domain Datasets Keywords Cited-Documents Topics Documents Keywords Documents

  5. Types of Patterns • Decision Trees • Association Rules • Principal Component Analysis • K-Nearest Neighbor Analysis • Clusters • Hierarchical • K-Means • Subspace

  6. Nature of Clusters PatternsΞUnsupervised, Data Driven, Clusters Single-Domain Clustering Diseases Diseases Genes Genes Clusters of similar genes; In the context of diseases Clusters of similar diseases In the context of genes Clusters may be: - Mutually Exclusive - Overlapping

  7. Nature of Patterns Simultaneous Two-Domain Clustering Diseases A cluster of similar genes - in a subspace of diseases; A cluster of similar diseases - in a subspace of genes Options: - Exhaustive in one domain - Exhaustive in both domains - Mutually exclusive clusters in one or both domain - Overlapping clusters/subspaces in both domains Genes D G

  8. Dise. Drugs Genes Genes Nature of Patterns Simultaneous Three (Multi)-Domain Clustering Diseases Drugs Genes Genes Match “genes” subsets in two clusters Phase-III of this research

  9. Part-I Patterns in Vertically Distributed Databases

  10. A B C C D E A E G D1 D2 Dn Learning Decision Trees Vertically Partitioned Dataset D = D1 X D2 X . . . X Dn - D is implicitly specified Goal: Build decision tree for implicit D, using the explicit Di’s Geographically distributed databases Limitations: - Can’t move Di’s to a common site - Size / communication cost/Privacy - Can’t update local databases - Can’t send actual data tuples

  11. Node 1 Node 2 Node 3 A B C D C F A E C 1 2 2 1 2 2 1 1 2 A B C D E F 1 6 2 1 1 1 1 2 1 1 6 1 1 2 1 1 6 1 1 1 3 2 2 1 1 6 1 1 2 3 Explicit Component Databases 2 6 1 - - - - - - 1 6 2 1 1 2 1 2 2 1 1 2 2 6 1 1 2 1 A C 2 6 1 1 2 3 1 1 1 2 2 1 SharedSet 2 2 Explicit and Implicit Databases Implicit Database

  12. A B C C D E A E G D1 D2 Dn Decomposition of Computations - Since D is implicit, - For a computation: - Decompose F into G and g’s - Decomposition depends on - F - Di’s and Set of shared attributes

  13. Shareds A C L shared attributes; k values each; 1 1 1 2 tuples 2 1 2 2 Count All Tuples in Implicit D • condJ : Jth tuple in Shareds • n: number of databases (Dis) • (N(Dt)condJ): count of tuples in Dt satisfying condJ • Local computation: gi(Di,) = N(Dt)condJ • G is a sum-of-products • If each Di knows “shared” values, then • Only one message per site needed for #tuples

  14. Learning Decision Trees Consists of various counts only: ID3 Algorithm a=? b branches a1 a2 ab c classes in the dataset Nbcand Nbcan be computed using g and G as for #tuple - one message/database needed for computing each Entropy value

  15. Compute Covariance Matrix for D • Covariance matrix for D • Needed for eigen vectors/principal components • Needs second order moments • Helps compute terms of the type: • This matrix can be computed at one of the databases

  16. G-and-g Decomposition for 2nd order moments • Sum of products for two attributes: • Six different ways in which x and y may be distributed • Each requires a different decomposition • Case 1: x same as y; and x belongs to the SharedSet. • Case 2: x same as y; and x does not belong to the SharedSet. • Case 3: x and y both belong to the SharedSet.

  17. Sum of Products • Case 4: x belongs to SharedSet and y does not. • Case 5: x, y don’t belong to the SharedSet and reside on different nodes. • For each tuple t in SharedSet, obtain • and then • Case 6: x, y don’t belong to the SharedSet and reside on the same node. where Prod(t) is average of product of x and y for cond-t of SharedSet

  18. Nearest Neighbor Algorithm • Find nearest neighbor of r1 in D1 • with virtual extensions in D for all tuples in D1 • Need toCompute all pair wise distances • The same distance values can be used for clustering algorithms

  19. Problem: Closed-Loops in Databases

  20. Extracting Communication Graph The learner is D1 Covariance, k-NN, etc. algorithms developed for this situation

  21. Part-II Subspace Clusters and Lattice Organization

  22. Clustering in Multi-Domains • Example 3-D dataset with 4 clusters. • Each cluster is in 2-D • Points from two subspace clusters can be very close --making traditional clustering algorithms inapplicable. • Overlapping between clusters

  23. Subspace Clustering • “Interestingness” of a subspace cluster: • Domain dependent / user defined • Similarity-based clusters

  24. Subspace Clusters • Number of Subspaces

  25. Nature of Real Datasets Examples: Genes--Diseases; person-MovieRating; Document-TermFrequency

  26. Row Ids Lattice of Subspaces: Formal Concept Analysis • Need Algorithms to find • Interesting subspace clusters • Lattice provides much more insight • into dataset. Parallel to the ideas of Formal Concept Analysis

  27. a a a a a b b b b b c c c c c d d d d d e e e e e 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 1 1 1 1 1 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 1 1 1 1 1 3 3 3 3 3 1 1 1 1 1 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 4 4 4 4 4 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 5 5 5 5 5 1 1 1 1 1 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 b Clusters in Subspaces Clusters in overlapping subspaces Density = number of rows - An antimonotonic property d a

  28. Value of (anti)monotonic Properties Pruned supersets If AB < needed density, Then so do all its descendents

  29. Maximal and Closed Subspaces Closed but not maximal Minimum support = 2 Closed and maximal # Closed = 9 # Maximal = 4

  30. a a a a a b b b b b c c c c c d d d d d e e e e e 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 1 1 1 1 1 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 1 1 1 1 1 3 3 3 3 3 1 1 1 1 1 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 4 4 4 4 4 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 5 5 5 5 5 1 1 1 1 1 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Siblings and Parents in Lattice Merge lattice nodes to find clusters of other properties Siblings in Lattice C1 =<{1,2,3,4,5}, {a,c,d,e}> C2 =<{3,4,5}, {a,c,d,e}>

  31. Goal: Subspace Clusters with Properties Anti-monotonic properties: • Minimum density(C=<O,A>):= |O| / total number of objects in the data eg: density(C=<{o2,o3,o4},{a.2}> ) = 3/5 = 0.6 • Succinctness: density is strictly smaller than all of its minimum generalizations eg: C1=<{o2,o3,o4},{a.2c.2}>--not succinct C2 =<{o3,o4},{b.2 c.2 }>--succinct • Numerical properties (row-wise): “max”, “min” eg: C2 =<{o1,o5},{b.2c.4 }> satisfies “max> 3” Weak anti-monotonic properties • “average>=δ”“average<= δ”“variance>=δ”“variance<=δ” eg: C2 =<{o1,o5},{b.2c.4 }> satisfies “average>= 3”, but: both C3 =<{o1,o2,o4,o5},{b.2}> and C4 =<{o1,o5},{b.2c.4d.2}> violate “average >=3”

  32. Levelwise Search • Pruning of weak anti-monotonic properties eg: if C2 =<{o1,o5},{b.2c.4 }> satisfies “average>= 3”, then o1,o5 must be contained in at least one of its minimum generalizations that satisfy this constraint: C5 =<{o1,o5},{c.4}> • If an object is not contained in any cluster of size k that satisfies a weak anti-monotonic property, it must not be contained in any cluster of size k+1 that satisfies this property

  33. Levelwise Search for Subspace Clusters • Anti-monotonic & Weak Anti-monotonic • Candidate generation based on anti-monotonic properties only • Data reduction based on weak anti-monotonic properties, such as: “mean>=δ”, “mean<=δ”, “variance>=δ”, “variance< =δ”

  34. Performance Comparison • Optimizing Techniques • Sorting the attributes • Reuse previous results • Stack of unpromising branches • Check closure property

  35. Distributed Subspace Clustering • Discover closed subspace clusters from databases located at multiple sites • Objectives: • Minimize local computation cost • Minimize communication cost

  36. Distributed Subspace Clustering • Horizontal Partitioned Data D1 DS D2

  37. Distributed Subspace Clustering List of Closed Subspace Clusters Lemma 1: All locally closed attribute sets are also globally closed Lemma 2:Intersection of two locally closed attribute sets from two different sites is globally closed eg: a = ac ∩ ab

  38. Distributed Subspace Clustering List of Closed Subspace Clusters • Compute the object set: • Closed at both partitions: compute the union of the two object setseg: cd • Closed in one of the partition: the union of two object sets whose attribute sets’ intersection equals the target attribute seteg: c = c ∩ cd • Not closed in any of the partitions: similar to case 2eg: a = ac ∩ ab

  39. Distributed Subspace Clustering List of Closed Subspace Clusters Problem: both for case 2 and 3 a = ac ∩ ab and a = acd ∩ ab Solution: for each globally closed attribute set, keep track of the largest object set (or size of the object set)

  40. Distributed Subspace Clustering Density Constraint: δ >= 0.6 Observation: Intersection of two elements both from Eis can not have enough density Efficient Computation: Sort Fi and Ei into decreasing density

  41. Distributed Subspace Clustering

  42. Distributed Subspace Clustering • Generalize to k>2 • k sites need k step communication and computation • k sites have k types:

  43. Distributed Subspace Clustering • K=3

  44. Distributed Subspace Clustering

  45. Distributed Subspace Clustering

  46. Part-III Multi-Domain Clusters

  47. Introduction Traditional clustering Bi-Clustering 3-Clustering

  48. Why 3-clusters? • Correspondence between bi-clusters of two different lattices • Sharpen local clusters with outside knowledge • Alternative? “Join datasets then search” • Does not capture underlying interactions • Inefficient • Not always possible

  49. Formal Definitions Pattern in Di Bi-cluster in Di 3-Cluster across D1 and D2

  50. Defining 3-clusters • D1 is the “learner” • Maximal rectangle of 1's under suitable permutation in learner • Best Correspondence to rectangle of 1's in D2 D1 D1

More Related