Discovering Patterns in Multiple Datasets

Discovering Patterns in Multiple Datasets Raj Bhatnagar University of Cincinnati

Nature of Distributed Datasets Horizontal Partitioning Vertical Partitioning Data components may be Geographically Distributed

Nature of Distributed Datasets Multi-Domain Datasets Diseases Drugs Adverse Reactions Genes Drugs Genes

Nature of Distributed Datasets Multi-Domain Datasets Keywords Cited-Documents Topics Documents Keywords Documents

Types of Patterns • Decision Trees • Association Rules • Principal Component Analysis • K-Nearest Neighbor Analysis • Clusters • Hierarchical • K-Means • Subspace

Nature of Clusters PatternsΞUnsupervised, Data Driven, Clusters Single-Domain Clustering Diseases Diseases Genes Genes Clusters of similar genes; In the context of diseases Clusters of similar diseases In the context of genes Clusters may be: - Mutually Exclusive - Overlapping

Nature of Patterns Simultaneous Two-Domain Clustering Diseases A cluster of similar genes - in a subspace of diseases; A cluster of similar diseases - in a subspace of genes Options: - Exhaustive in one domain - Exhaustive in both domains - Mutually exclusive clusters in one or both domain - Overlapping clusters/subspaces in both domains Genes D G

Dise. Drugs Genes Genes Nature of Patterns Simultaneous Three (Multi)-Domain Clustering Diseases Drugs Genes Genes Match “genes” subsets in two clusters Phase-III of this research

Part-I Patterns in Vertically Distributed Databases

A B C C D E A E G D1 D2 Dn Learning Decision Trees Vertically Partitioned Dataset D = D1 X D2 X . . . X Dn - D is implicitly specified Goal: Build decision tree for implicit D, using the explicit Di’s Geographically distributed databases Limitations: - Can’t move Di’s to a common site - Size / communication cost/Privacy - Can’t update local databases - Can’t send actual data tuples

Node 1 Node 2 Node 3 A B C D C F A E C 1 2 2 1 2 2 1 1 2 A B C D E F 1 6 2 1 1 1 1 2 1 1 6 1 1 2 1 1 6 1 1 1 3 2 2 1 1 6 1 1 2 3 Explicit Component Databases 2 6 1 - - - - - - 1 6 2 1 1 2 1 2 2 1 1 2 2 6 1 1 2 1 A C 2 6 1 1 2 3 1 1 1 2 2 1 SharedSet 2 2 Explicit and Implicit Databases Implicit Database

A B C C D E A E G D1 D2 Dn Decomposition of Computations - Since D is implicit, - For a computation: - Decompose F into G and g’s - Decomposition depends on - F - Di’s and Set of shared attributes

Shareds A C L shared attributes; k values each; 1 1 1 2 tuples 2 1 2 2 Count All Tuples in Implicit D • condJ : Jth tuple in Shareds • n: number of databases (Dis) • (N(Dt)condJ): count of tuples in Dt satisfying condJ • Local computation: gi(Di,) = N(Dt)condJ • G is a sum-of-products • If each Di knows “shared” values, then • Only one message per site needed for #tuples

Learning Decision Trees Consists of various counts only: ID3 Algorithm a=? b branches a1 a2 ab c classes in the dataset Nbcand Nbcan be computed using g and G as for #tuple - one message/database needed for computing each Entropy value

Compute Covariance Matrix for D • Covariance matrix for D • Needed for eigen vectors/principal components • Needs second order moments • Helps compute terms of the type: • This matrix can be computed at one of the databases

G-and-g Decomposition for 2nd order moments • Sum of products for two attributes: • Six different ways in which x and y may be distributed • Each requires a different decomposition • Case 1: x same as y; and x belongs to the SharedSet. • Case 2: x same as y; and x does not belong to the SharedSet. • Case 3: x and y both belong to the SharedSet.

Sum of Products • Case 4: x belongs to SharedSet and y does not. • Case 5: x, y don’t belong to the SharedSet and reside on different nodes. • For each tuple t in SharedSet, obtain • and then • Case 6: x, y don’t belong to the SharedSet and reside on the same node. where Prod(t) is average of product of x and y for cond-t of SharedSet

Nearest Neighbor Algorithm • Find nearest neighbor of r1 in D1 • with virtual extensions in D for all tuples in D1 • Need toCompute all pair wise distances • The same distance values can be used for clustering algorithms

Problem: Closed-Loops in Databases

Extracting Communication Graph The learner is D1 Covariance, k-NN, etc. algorithms developed for this situation

Part-II Subspace Clusters and Lattice Organization

Clustering in Multi-Domains • Example 3-D dataset with 4 clusters. • Each cluster is in 2-D • Points from two subspace clusters can be very close --making traditional clustering algorithms inapplicable. • Overlapping between clusters

Subspace Clustering • “Interestingness” of a subspace cluster: • Domain dependent / user defined • Similarity-based clusters

Subspace Clusters • Number of Subspaces

Nature of Real Datasets Examples: Genes--Diseases; person-MovieRating; Document-TermFrequency

Row Ids Lattice of Subspaces: Formal Concept Analysis • Need Algorithms to find • Interesting subspace clusters • Lattice provides much more insight • into dataset. Parallel to the ideas of Formal Concept Analysis

a a a a a b b b b b c c c c c d d d d d e e e e e 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 1 1 1 1 1 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 1 1 1 1 1 3 3 3 3 3 1 1 1 1 1 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 4 4 4 4 4 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 5 5 5 5 5 1 1 1 1 1 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 b Clusters in Subspaces Clusters in overlapping subspaces Density = number of rows - An antimonotonic property d a

Value of (anti)monotonic Properties Pruned supersets If AB < needed density, Then so do all its descendents

Maximal and Closed Subspaces Closed but not maximal Minimum support = 2 Closed and maximal # Closed = 9 # Maximal = 4

a a a a a b b b b b c c c c c d d d d d e e e e e 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 1 1 1 1 1 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 1 1 1 1 1 3 3 3 3 3 1 1 1 1 1 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 4 4 4 4 4 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 5 5 5 5 5 1 1 1 1 1 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Siblings and Parents in Lattice Merge lattice nodes to find clusters of other properties Siblings in Lattice C1 =<{1,2,3,4,5}, {a,c,d,e}> C2 =<{3,4,5}, {a,c,d,e}>

Goal: Subspace Clusters with Properties Anti-monotonic properties: • Minimum density(C=<O,A>):= |O| / total number of objects in the data eg: density(C=<{o2,o3,o4},{a.2}> ) = 3/5 = 0.6 • Succinctness: density is strictly smaller than all of its minimum generalizations eg: C1=<{o2,o3,o4},{a.2c.2}>--not succinct C2 =<{o3,o4},{b.2 c.2 }>--succinct • Numerical properties (row-wise): “max”, “min” eg: C2 =<{o1,o5},{b.2c.4 }> satisfies “max> 3” Weak anti-monotonic properties • “average>=δ”“average<= δ”“variance>=δ”“variance<=δ” eg: C2 =<{o1,o5},{b.2c.4 }> satisfies “average>= 3”, but: both C3 =<{o1,o2,o4,o5},{b.2}> and C4 =<{o1,o5},{b.2c.4d.2}> violate “average >=3”

Levelwise Search • Pruning of weak anti-monotonic properties eg: if C2 =<{o1,o5},{b.2c.4 }> satisfies “average>= 3”, then o1,o5 must be contained in at least one of its minimum generalizations that satisfy this constraint: C5 =<{o1,o5},{c.4}> • If an object is not contained in any cluster of size k that satisfies a weak anti-monotonic property, it must not be contained in any cluster of size k+1 that satisfies this property

Levelwise Search for Subspace Clusters • Anti-monotonic & Weak Anti-monotonic • Candidate generation based on anti-monotonic properties only • Data reduction based on weak anti-monotonic properties, such as: “mean>=δ”, “mean<=δ”, “variance>=δ”, “variance< =δ”

Performance Comparison • Optimizing Techniques • Sorting the attributes • Reuse previous results • Stack of unpromising branches • Check closure property

Distributed Subspace Clustering • Discover closed subspace clusters from databases located at multiple sites • Objectives: • Minimize local computation cost • Minimize communication cost

Distributed Subspace Clustering • Horizontal Partitioned Data D1 DS D2

Distributed Subspace Clustering List of Closed Subspace Clusters Lemma 1: All locally closed attribute sets are also globally closed Lemma 2:Intersection of two locally closed attribute sets from two different sites is globally closed eg: a = ac ∩ ab

Distributed Subspace Clustering List of Closed Subspace Clusters • Compute the object set: • Closed at both partitions: compute the union of the two object setseg: cd • Closed in one of the partition: the union of two object sets whose attribute sets’ intersection equals the target attribute seteg: c = c ∩ cd • Not closed in any of the partitions: similar to case 2eg: a = ac ∩ ab

Distributed Subspace Clustering List of Closed Subspace Clusters Problem: both for case 2 and 3 a = ac ∩ ab and a = acd ∩ ab Solution: for each globally closed attribute set, keep track of the largest object set (or size of the object set)

Distributed Subspace Clustering Density Constraint: δ >= 0.6 Observation: Intersection of two elements both from Eis can not have enough density Efficient Computation: Sort Fi and Ei into decreasing density

Distributed Subspace Clustering

Distributed Subspace Clustering • Generalize to k>2 • k sites need k step communication and computation • k sites have k types:

Distributed Subspace Clustering • K=3

Distributed Subspace Clustering

Part-III Multi-Domain Clusters

Introduction Traditional clustering Bi-Clustering 3-Clustering

Why 3-clusters? • Correspondence between bi-clusters of two different lattices • Sharpen local clusters with outside knowledge • Alternative? “Join datasets then search” • Does not capture underlying interactions • Inefficient • Not always possible

Formal Definitions Pattern in Di Bi-cluster in Di 3-Cluster across D1 and D2

Defining 3-clusters • D1 is the “learner” • Maximal rectangle of 1's under suitable permutation in learner • Best Correspondence to rectangle of 1's in D2 D1 D1

Discovering Patterns in Multiple Datasets

Discovering Patterns in Multiple Datasets

Presentation Transcript

A Belief-Driven Method for Discovering Unexpected Patterns

DataSets

Discovering Spatial Co-location Patterns

Finding Spatial Equivalences Across Multiple RDF Datasets

Discovering Inheritance Patterns

Algorithms for Discovering Patterns in Sequences

Discovering RFM Sequential Patterns From Customers’ Purchasing Data

Discovering Relational Patterns across Multiple Databases

Discovering Collaborative Patterns in eLearning from Meta-code Subsequence

BRAID: Discovering Lag Correlations in Multiple Streams

Patterns, Profiles, and Multiple Alignment

Discovering Patterns in Adverse Drug Reactions

Multiple alignments, PATTERNS, PSI-BLAST

Multiple Patterns and iWAM Interpretation

Discovering Interesting Sub-paths in Spatiotemporal Datasets: A Summary of Results

DEPICT: DiscovEring Patterns and InteraCTions in databases

Datasets

Visualizing and Discovering Nontrivial Patterns In Large Time Series Databases

Datasets

Discovering functional interaction patterns in Protein-Protein Interactions Networks

Discovering Patterns

Using Graphics Hardware for Multiple Datasets Visualization