1 / 24

Clustering Uncertain Data

CS 290 Project Nick Larusso Brian Ruttenberg. Clustering Uncertain Data. Motivation. Many data acquisition tools provide uncertain data eg. sensor networks, image analysis, etc. Records are no longer points in multidimensional space, but regions based on the certainty of the data

robyn
Download Presentation

Clustering Uncertain Data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CS 290 Project Nick Larusso Brian Ruttenberg Clustering Uncertain Data

  2. Motivation • Many data acquisition tools provide uncertain data • eg. sensor networks, image analysis, etc. • Records are no longer points in multidimensional space, but regions based on the certainty of the data • New methods are required to manage and learn from this data

  3. Probabilistic Analysis of Ganglion Cell Morphology • Bioimages are inherently uncertain • We would like to answer questions like “how large is the cell soma?”, “how many dendrites are there, and how often do they branch?” • It is important to provide a level of confidence in each measurement to avoid error propagation

  4. Project Goal • Approximately 200 images of ganglion cells under various conditions • healthy cells, detached retina (7d, 28d, 56d)‏ • Probabilistic measurements of soma size, dendritic field size, and dendritic field density for each cell • Want to cluster these cells to determine the effect of retinal detachment on cell morphology

  5. UK-means Algorithm • K-means algorithm minimizes sum of squared errors (SSE)‏ • UK-means, minimizes expected sum of squared errors • Compute by finding the expected value of each dimension

  6. UK-Means • This idea may be sufficient for Gaussian-like distributions, but what about arbitrary distributions? • Does not account for variance in data

  7. All Possible Worlds (APW) Probabilistic Clustering • Instead of representing with a single value, consider all possible values for a distribution weighted by their respective probabilities • Provides a much better description of the data than expected value

  8. APW Example • Choose one state from object A • For each possible state in object B calculate distance • Continue for each state in A

  9. APW Clustering • Compute probability of a possible world • Where x(i) is the value chosen for object i • Cluster (certain) objects using k-means • Combine clustering results for each possible world

  10. APW Computational Costs • N = # of uncertain objects • D = # of dimensions of each object • Assume each dimension is described by a distribution over a constant number of values, C • (D * C)N Possible Worlds • Our data: • D ~ 3, C ~ 15, N ~ 200 => we need a very fast computer!

  11. Gibbs Sampling • Computationally infeasible to calculate all possible worlds, so sample from this space instead • Intuition: We really only care about the possible worlds that carry high probabilities, so we can weight our sampling toward these worlds

  12. APW Clustering Using Gibbs Sampling • Randomly pick values for each dimension of each object • Iterate through each dimension of every object • For a given object and a given dimension • Pick a sample value weighted by the probability distribution • Calculate probability of world • Cluster objects via k-means • The objects are then binned according to how often a particular clustering result shows up

  13. Preliminary Results • Interpretation of results is the biggest challenge • Preliminary results run on 7 Day Ganglion cells • 30 cells from each detached and normal retinas • UK-means and APW approach run

  14. 7D Normal: UK-means

  15. 7D Normal: APW

  16. 7D Detached: UK-Means

  17. 7D Detached: APW

  18. 7D Normal and Detached: UK-means

  19. 7D Normal and Detached: AWP • Spreadsheet…

  20. Method Validation • Biologists manually cluster the normal cells based on previous studies [COOMBS06] & [SUN02] • Identify ganglion cell subtypes • Compare these results with the two clustering algorithms

  21. Future Work • Use Earth Mover's Distance (EMD) as a distance metric between two distribution

  22. Questions

More Related