240 likes | 371 Views
CS 290 Project Nick Larusso Brian Ruttenberg. Clustering Uncertain Data. Motivation. Many data acquisition tools provide uncertain data eg. sensor networks, image analysis, etc. Records are no longer points in multidimensional space, but regions based on the certainty of the data
E N D
CS 290 Project Nick Larusso Brian Ruttenberg Clustering Uncertain Data
Motivation • Many data acquisition tools provide uncertain data • eg. sensor networks, image analysis, etc. • Records are no longer points in multidimensional space, but regions based on the certainty of the data • New methods are required to manage and learn from this data
Probabilistic Analysis of Ganglion Cell Morphology • Bioimages are inherently uncertain • We would like to answer questions like “how large is the cell soma?”, “how many dendrites are there, and how often do they branch?” • It is important to provide a level of confidence in each measurement to avoid error propagation
Project Goal • Approximately 200 images of ganglion cells under various conditions • healthy cells, detached retina (7d, 28d, 56d) • Probabilistic measurements of soma size, dendritic field size, and dendritic field density for each cell • Want to cluster these cells to determine the effect of retinal detachment on cell morphology
UK-means Algorithm • K-means algorithm minimizes sum of squared errors (SSE) • UK-means, minimizes expected sum of squared errors • Compute by finding the expected value of each dimension
UK-Means • This idea may be sufficient for Gaussian-like distributions, but what about arbitrary distributions? • Does not account for variance in data
All Possible Worlds (APW) Probabilistic Clustering • Instead of representing with a single value, consider all possible values for a distribution weighted by their respective probabilities • Provides a much better description of the data than expected value
APW Example • Choose one state from object A • For each possible state in object B calculate distance • Continue for each state in A
APW Clustering • Compute probability of a possible world • Where x(i) is the value chosen for object i • Cluster (certain) objects using k-means • Combine clustering results for each possible world
APW Computational Costs • N = # of uncertain objects • D = # of dimensions of each object • Assume each dimension is described by a distribution over a constant number of values, C • (D * C)N Possible Worlds • Our data: • D ~ 3, C ~ 15, N ~ 200 => we need a very fast computer!
Gibbs Sampling • Computationally infeasible to calculate all possible worlds, so sample from this space instead • Intuition: We really only care about the possible worlds that carry high probabilities, so we can weight our sampling toward these worlds
APW Clustering Using Gibbs Sampling • Randomly pick values for each dimension of each object • Iterate through each dimension of every object • For a given object and a given dimension • Pick a sample value weighted by the probability distribution • Calculate probability of world • Cluster objects via k-means • The objects are then binned according to how often a particular clustering result shows up
Preliminary Results • Interpretation of results is the biggest challenge • Preliminary results run on 7 Day Ganglion cells • 30 cells from each detached and normal retinas • UK-means and APW approach run
7D Normal and Detached: AWP • Spreadsheet…
Method Validation • Biologists manually cluster the normal cells based on previous studies [COOMBS06] & [SUN02] • Identify ganglion cell subtypes • Compare these results with the two clustering algorithms
Future Work • Use Earth Mover's Distance (EMD) as a distance metric between two distribution