290 likes | 304 Views
Stephen Meehan 1 , Darya Orlova 1 , Wayne Moore 1 , David Parks 1 , Connor Meehan 2 Guenther Walther 3 & Leonore Herzenberg 1. Department of Genetics, Stanford University Department of Mathematics, California Institute of Technology Department of Statistics, Stanford University.
E N D
Stephen Meehan1, Darya Orlova1, Wayne Moore1, David Parks1, Connor Meehan2 Guenther Walther3 & Leonore Herzenberg1 • Department of Genetics, Stanford University • Department of Mathematics, California Institute of Technology • Department of Statistics, Stanford University Cyto 2018 Prague, Czech Republic Wednesday May 2, 2018
What is AutoGate? • Software application for flow cytometry analysis • First released in March 2014 • Automates compensation & biexponential plus detecting, matching and differentiating gates • User provides • Sample & parameter labels • Parameter & cluster choices within gating hierarchies • Thus …. AutoGate is semi-supervised • Once for diagnostic analysis. An assay-specific template makes AutoGate unsupervised for future occurrences of THAT assay. User “gates once and never again”. • Always for discovery analysis
AutoGate’s workflow illustrated HOWEVER… The frequency of Cyto 2017 presentations showing tSNE plots clearly indicated that to stay fit and survive 2D oriented AutoGate MUST evolve by doing MORE with LESS guidance … BUT evolve without inheritingthe unfit reproducibility Challenges of tSNE & other all-at-once clustering methods…. AutoGate gating Conventional gating AutoGate allows manual gates along side of cluster-picked gates
“More with Less”…. NEW AutoGate features deliver unsupervised gating in all relevant dimensions • Subset detection: Exhaustive Projection Pursuit (Epp) (added Jan 2018) • Subset matching: QFMatch (added March 2018) • Subset visualization: HiD Subset View like tSNE (added April 2018)
Subset detection: Exhaustive projection pursuit (Epp) • If there exists an orthogonal projection on any pair of dimensions, in which the data can be cleanly split, that cannot be wrong …. and it divides the problem into two simpler parts. • Applied recursively, a divide and conquer strategy would identify subpopulations until we are certain that no further splits are available. • This strategy ensures every cell is found in exactly one final split … hence “No cell is left behind”
Step 1: For any set of cells examine all relevant pairs of FCS parameters (stain or scatter) in 2D projections
Step 2: For each 2D projection compute clusters using the density based merging (DBM) method
Step 4: Find all suitable candidate separatricies & pick the best one (i.e. least estimated classifier error)
Step 5: Pick the 2D projection with the best separatrix overall Step 6: Take ALL cells on each side of the separatrix Step 7: Repeat steps 1-7 on each new subset until no further splits are available.
OK … so HOW do we make sense of Epp’s VOLUMINOUS output? The branches of the Epp gating hierarchy are not expected to be meaningful … just the leaves … THEY can be inspected one-by-one for biological meaningfulness using the multi-dimensional visualization tools like 1D PathFinder!!!
A faster approach is overlaying known gates and finding Epp gate with best F-measure BUT F measures only compare subsets within the same sample and require previous known subsets
Subset matching • Epp needs a mechanism that compares subsets • On different samples … even replicates…. • Without aid of prior known gates • Solution: QFMatch Match & align leaves of gating hierarchies on allrelevant dimensions (Regardless of whether the gating hierarchies are manual, semi-supervised or unsupervised)
QFMatchStep 2: Do probability binning on the merger We use Mario Roederer’s probability binning method to summarize the merged samples because his method is both scalable to high dimensions and non-parametric. M Roederer, W Moore, A Treister, R R Hardy, and L A Herzenberg. Prob- ability binning comparison: a metric for quantitating multivariate distribution differences. Cytometry, 45(1):47–55, September 2001. Bin sizes are not equal but cell counts within them are equal, similarity of high dimensional variance determines each cell’s bin location
QFMatchStep 3: Keep the same binning pattern for each sample This means bin cell counts are no longer equal
QFMatchStep 4:For each subset pair between samples, compute their dissimilarityusingquadraticform (QF)distancemetric
QFMatchStep 4:For each subset pair between samples, compute their dissimilarityusingquadraticform (QF)distancemetric h f https://pdfs.semanticscholar.org/b81a/d9c60add0b0101ebe5c34473fc8f0fa91724.pdf where h and f are vectors of bins where the subset pair occurs within the superset of sample-merged bins. Each h & fbin value Is the subset’s count normalized so that and 0f,h1 Here we define aij as 1-abs(dMij)/dmax, where dMij is a distance between centers of mass of the ith and jth bins, and dmax is a distance between the most distant bins’ centers of mass.
QFMatch Step 5: Set a row’s match to col with lowest dissimilarity Step 6: If 2+ cols are most similar to same row then merge to see if a better QF dissimilarity indicates that subsets either split or vanish between samples…. Sample B Sample A Sample B Sample A
Subset visualization: HiD Subset View • AutoGate needs a way to quickly see any subset grouping’s HiD relatedness • The gating tree shows too many parent gates to capture HiD • The 1D pathfinder is better but not quick for many subset • Solution? Use conventional “Multidimensional scaling” to fit final “leaf”gates (Epp or non Epp) into a single 2D composite view of symbols where symbol • Size reflects frequency of gated events • Position reflects similarity of parameter expression • Face color reflects QFMatch • Shape & border color reflect QFMatch quality in terms of standard deviation unit distance https://www.coursera.org/learn/datavisualization/lecture/6ZQop/3-2-2-multidimensional-scaling
HiD Subset ViewShow subset matching in HiD to 2D view Comparing Epp gates to known (non Epp) gates….
HiD Subset ViewShow subset matching in HiD to 2D view Comparing Epp gates in different mouse strain samples ….
HiD Subset ViewTracking quality of match by means/std deviations Means >3 std deviation units are flagged
Conclusions • AutoGate’s new methods have been run on a wide variety of data sets at the Herzenberg Lab • In every case these methods show promise by • Confirming most of what is known • Causing us to notice new things • Being more reproducible than other subset-identification methods based on clustering in Hi-D space, or based on dimension-reduction techniques like tSNE. • But our optimism is preliminary, we need YOUR datasets • So please download and use it for FREE from CytoGenie.org • Being research software AutoGate’s improvements and documentation is rapidly improved & released • In exchange contact us and tell us how it has been helpful and how it can improve.
Thank You! Stephen Meehan Herzenberg Lab Stanford University School of Medicine Genetics Department Beckman Center, Room B013 279 Campus Drive, Stanford, CA 94305 swmeehan@stanford.edu
Stephen Meehan • I have worked with Stanford University’s Herzenberg Lab since the year October 2000 • But has maintained his primary residence in Vancouver since the year 1961 • Oh … by the way … next year’s CYTO 2019 is in VANCOUVER !! • Where EVERY one is welcome • Where there is great food, great wine, great beer, great friendships and great science! • Be sure to bring your family and book a holiday around our wonderful city. • Visit Whistler mountain, visit Vancouver Island, visit the Okanagon wine country, go whale watching, go salmon fishing … maybe take a cruise to Alaska…. Enjoy food and culture from all over the world • The Herzenberg Lab