Clustering Uncertain Data

CS 290 Project Nick Larusso Brian Ruttenberg Clustering Uncertain Data

Motivation • Many data acquisition tools provide uncertain data • eg. sensor networks, image analysis, etc. • Records are no longer points in multidimensional space, but regions based on the certainty of the data • New methods are required to manage and learn from this data

Probabilistic Analysis of Ganglion Cell Morphology • Bioimages are inherently uncertain • We would like to answer questions like “how large is the cell soma?”, “how many dendrites are there, and how often do they branch?” • It is important to provide a level of confidence in each measurement to avoid error propagation

Project Goal • Approximately 200 images of ganglion cells under various conditions • healthy cells, detached retina (7d, 28d, 56d)‏ • Probabilistic measurements of soma size, dendritic field size, and dendritic field density for each cell • Want to cluster these cells to determine the effect of retinal detachment on cell morphology

UK-means Algorithm • K-means algorithm minimizes sum of squared errors (SSE)‏ • UK-means, minimizes expected sum of squared errors • Compute by finding the expected value of each dimension

UK-Means • This idea may be sufficient for Gaussian-like distributions, but what about arbitrary distributions? • Does not account for variance in data

All Possible Worlds (APW) Probabilistic Clustering • Instead of representing with a single value, consider all possible values for a distribution weighted by their respective probabilities • Provides a much better description of the data than expected value

APW Example • Choose one state from object A • For each possible state in object B calculate distance • Continue for each state in A

APW Clustering • Compute probability of a possible world • Where x(i) is the value chosen for object i • Cluster (certain) objects using k-means • Combine clustering results for each possible world

APW Computational Costs • N = # of uncertain objects • D = # of dimensions of each object • Assume each dimension is described by a distribution over a constant number of values, C • (D * C)N Possible Worlds • Our data: • D ~ 3, C ~ 15, N ~ 200 => we need a very fast computer!

Gibbs Sampling • Computationally infeasible to calculate all possible worlds, so sample from this space instead • Intuition: We really only care about the possible worlds that carry high probabilities, so we can weight our sampling toward these worlds

APW Clustering Using Gibbs Sampling • Randomly pick values for each dimension of each object • Iterate through each dimension of every object • For a given object and a given dimension • Pick a sample value weighted by the probability distribution • Calculate probability of world • Cluster objects via k-means • The objects are then binned according to how often a particular clustering result shows up

Preliminary Results • Interpretation of results is the biggest challenge • Preliminary results run on 7 Day Ganglion cells • 30 cells from each detached and normal retinas • UK-means and APW approach run

7D Normal: UK-means

7D Normal: APW

7D Detached: UK-Means

7D Detached: APW

7D Normal and Detached: UK-means

7D Normal and Detached: AWP • Spreadsheet…

Method Validation • Biologists manually cluster the normal cells based on previous studies [COOMBS06] & [SUN02] • Identify ganglion cell subtypes • Compare these results with the two clustering algorithms

Future Work • Use Earth Mover's Distance (EMD) as a distance metric between two distribution

Questions

Clustering Uncertain Data

Clustering Uncertain Data

Presentation Transcript

Data Mining: Clustering

Data Mining--Clustering

Probabilistic/Uncertain Data Management -- III

Clustering Data Streams

Uncertain Data Management

Probabilistic/Uncertain Data Management

Clustering Data Streams

Density-Based Clustering of Uncertain Data (KDD2005)

Probabilistic/Uncertain Data Management -- IV

Clustering Uncertain Data Items

Data Stream Clustering

Data Clustering Methods

COMP9315 Uncertain and Probabilistic Data

Managing Uncertain Data

Data Clustering

Clustering microarray data

Data Clustering

Clustering Biological Data

Clustering Categorical Data

Uncertain Data