1 / 24

### Scaling Multivariate Statistics and Algorithms to Massive Data: Approaches and Challenges ###

Explore the core methods of statistics, machine learning, and data mining for dealing with massive datasets. Learn about querying, density estimation, regression, classification, dimension reduction, outlier detection, clustering, time series analysis, feature selection, causality, fusion, and matching. Delve into efficient algorithms for various analytical tasks and understand the computational bottlenecks affecting scalability. Discover the "7 Giants" of data, including basic statistics, N-body problems, graph-theoretic problems, linear-algebraic problems, optimizations, integrations, and alignment problems. Uncover strategies to navigate challenges posed by vast datasets in a fast-paced digital world. ###

weaver
Download Presentation

### Scaling Multivariate Statistics and Algorithms to Massive Data: Approaches and Challenges ###

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Scaling Multivariate Statistics to Massive DataAlgorithmic problems and approaches Alexander Gray Georgia Institute of Technology www.fast-lab.org

  2. Core methods ofstatistics / machine learning / mining • Querying: spherical range-search O(N), orthogonal range-search O(N), spatial join O(N2), nearest-neighbor O(N), all-nearest-neighbors O(N2) • Density estimation: mixture of Gaussians, kernel density estimation O(N2), kernel conditional density estimation O(N3) • Regression: linear regression, kernel regression O(N2), Gaussian process regression O(N3) • Classification: decision tree, nearest-neighbor classifier O(N2), nonparametric Bayes classifier O(N2), support vector machine O(N3) • Dimension reduction: principal component analysis, non-negative matrix factorization, kernel PCA O(N3), maximum variance unfolding O(N3) • Outlier detection: by density estimation or dimension reduction • Clustering: by density estimation or dimension reduction, k-means, mean-shift segmentation O(N2), hierarchical clustering O(N3) • Time series analysis: Kalman filter, hidden Markov model, trajectory tracking O(Nn) • Feature selection and causality: LASSO, L1 SVM, Gaussian graphical models, discrete graphical models • Fusion and matching: sequence alignment, bipartite matching O(N3), n-point correlation 2-sample testing O(Nn)

  3. Now pretty fast (2011)… • Querying: spherical range-search O(logN)*, orthogonal range-search O(logN)*, spatial join O(N)*, nearest-neighbor O(logN), all-nearest-neighbors O(N) • Density estimation: mixture of Gaussians, kernel density estimation O(N), kernel conditional density estimation O(Nlog3)* • Regression: linear regression, kernel regression O(N), Gaussian process regression O(N)* • Classification: decision tree, nearest-neighbor classifier O(N), nonparametric Bayes classifier O(N)*, support vector machine • Dimension reduction: principal component analysis, non-negative matrix factorization, kernel PCA O(N)*, maximum variance unfolding O(N)* • Outlier detection: by density estimation or dimension reduction • Clustering: by density estimation or dimension reduction, k-means, mean-shift segmentation O(N), hierarchical clustering O(NlogN) • Time series analysis: Kalman filter, hidden Markov model, trajectory tracking O(Nlogn)* • Feature selection and causality: LASSO, L1 SVM, Gaussian graphical models, discrete graphical models • Fusion and matching: sequence alignment, bipartite matching O(N)**, n-point correlation 2-sample testing O(Nlogn)*

  4. Things we made fastfastest, fastest in some settings • Querying: spherical range-search O(logN)*, orthogonal range-search O(logN)*, spatial join O(N)*,nearest-neighbor O(logN),all-nearest-neighbors O(N) • Density estimation: mixture of Gaussians, kernel density estimation O(N), kernel conditional density estimation O(Nlog3)* • Regression: linear regression, kernel regression O(N),Gaussian process regression O(N)* • Classification: decision tree, nearest-neighbor classifier O(N), nonparametric Bayes classifier O(N)*, support vector machine O(N)/O(N2) • Dimension reduction:principal component analysis, non-negative matrix factorization, kernel PCA O(N)*, maximum variance unfolding O(N)* • Outlier detection: by density estimation or dimension reduction • Clustering: by density estimation or dimension reduction, k-means, mean-shift segmentation O(N), hierarchical (FoF) clustering O(NlogN) • Time series analysis: Kalman filter, hidden Markov model, trajectory tracking O(Nlogn)* • Feature selection and causality: LASSO, L1 SVM, Gaussian graphical models, discrete graphical models • Fusion and matching: sequence alignment, bipartite matching O(N)**, n-point correlation 2-sample testing O(Nlogn)*

  5. Core computational problems What are the basic mathematical operationsmaking things hard? • Alternative to speeding up each of the 1000s of statistical methods: treat common computational bottlenecks • Divide up the space of problems (and associated algorithmic strategies), so we can examine the unique challenges and possible ways forward within each

  6. The “7 Giants” of data • Basic statistics • Generalized N-body problems • Graph-theoretic problems • Linear-algebraic problems • Optimizations • Integrations • Alignment problems

  7. The “7 Giants” of data 1. Basic statistics • e.g. counts, contingency tables, means, medians, variances, range queries (SQL queries) 2. Generalized N-body problems • e.g. nearest-nbrs (in NLDR, etc), kernel summations (in KDE, GP, SVM, etc), clustering, MST, spatial correlations

  8. The “7 Giants” of data 3. Graph-theoretic problems • e.g. betweenness centrality, commute distance, graphical model inference 4. Linear-algebraic problems • e.g. linear algebra, PCA, Gaussian process regression, manifold learning 5. Optimizations • e.g. LP/QP/SDP/SOCP/MINLPs in regularized methods, compressed sensing

  9. The “7 Giants” of data 6. Integrations • e.g. Bayesian inference 7. Alignment problems • e.g. BLAST in genomics, string matching, phylogenies, SLAM, cross-match

  10. Back to our listbasic, N-body,graphs,linear algebra,optimization, integration, alignment • Querying:spherical range-search O(N),orthogonal range-search O(N),spatial join O(N2),nearest-neighbor O(N),all-nearest-neighbors O(N2) • Density estimation:mixture of Gaussians, kernel density estimation O(N2), kernel conditional density estimation O(N3) • Regression:linear regression, kernel regression O(N2),Gaussian process regressionO(N3) • Classification: decision tree, nearest-neighbor classifier O(N2), nonparametric Bayes classifier O(N2), support vector machine O(N3) • Dimension reduction:principal component analysis, non-negative matrix factorization, kernel PCA O(N3), maximum variance unfolding O(N3) • Outlier detection: by density estimation or dimension reduction • Clustering: by density estimation or dimension reduction, k-means, mean-shift segmentationO(N2),hierarchical clustering O(N3) • Time series analysis:Kalman filter, hidden Markov model, trajectory tracking O(Nn) • Feature selection and causality: LASSO, L1 SVM, Gaussian graphical models, discrete graphical models • Fusion and matching:sequence alignment, bipartite matching O(N3), n-point correlation 2-sample testingO(Nn)

  11. 5 settings • “Regular”: batch, in-RAM/core, one CPU • Streaming (non-batch) • Disk (out-of-core) • Distributed: threads/multi-core (shared memory) • Distributed: clusters/cloud (distributed memory)

  12. 4 common data types • Vector data, iid • Time series • Images • Graphs

  13. 3 desiderata • Fast experimental runtime/performance* • Fast theoretic (provable) runtime/performance* • Accuracy guarantees *Performance: runtime, memory, communication, disk accesses; time-constrained, anytime, etc.

  14. 7 general solution strategies • Divide and conquer (indexing structures) • Dynamic programming • Function transforms • Random sampling (Monte Carlo) • Non-random sampling (active learning) • Parallelism • Problem reduction

  15. 1. Summary statistics • Examples: counts, contingency tables, means, medians, variances, range queries (SQL queries) • What’s unique/challenges: streaming, new guarantees • Promising/interesting: • Sketching approaches • AD-trees • MapReduce/Hadoop (Aster,Greenplum,Netezza)

  16. 2. Generalized N-body problems • Examples: nearest-nbrs (in NLDR, etc), kernel summations (in KDE, GP, SVM, etc), clustering, MST, spatial correlations • What’s unique/challenges: general dimension, non-Euclidean, new guarantees (e.g. in rank) • Promising/interesting: • Generalized/higher-order FMM O(N2)  O(N) • Random projections • GPUs

  17. 3. Graph-theoretic problems • Examples: betweenness centrality, commute dist, graphical model inference • What’s unique/challenges: high interconnectivity (cliques), out-of-core • Promising/interesting: • Variational methods • Stochastic composite likelihood methods • MapReduce/Hadoop (Facebook,etc)

  18. 4. Linear-algebraic problems • Examples: linear algebra, PCA, Gaussian process regression, manifold learning • What’s unique/challenges: probabilistic guarantees, kernel matrices • Promising/interesting: • Sampling-based methods • Online methods • Approximate matrix-vector multiply via N-body

  19. 5. Optimizations • Examples: LP/QP/SDP/SOCP/MINLPs in regularized methods, compressed sensing • What’s unique/challenges: stochastic programming, streaming • Promising/interesting: • Reformulations/relaxations of various ML forms • Online, mini-batch methods • Parallel online methods • Submodular functions • Global optimization (non-convex)

  20. 6. Integrations • Examples: Bayesian inference • What’s unique/challenges: general dimension • Promising/interesting: • MCMC • ABC • Particle filtering • Adaptive importance sampling, active learning

  21. 7. Alignments • Examples: BLAST in genomics, string matching, phylogenies, SLAM, cross-match • What’s unique/challenges: greater heterogeneity, measurement errors • Promising/interesting: • Probabilistic representations • Reductions to generalized N-body problems

  22. Reductions/transformationsbetween problems • Gaussian graphical models  linear alg • Bayesian integration  MAP optimization • Euclidean graphs  N-body problems • Linear algebra on kernel matrices  N-body inside conjugate gradient • Can featurize a graph or any other structure  matrix-based ML problem • Create new ML methods with different computational properties

  23. General conclusions • Algorithms can dramatically change the runtime order, e.g. O(N2) to O(N) • High dimensionality is a persistent challenge • The non-default (e.g. streaming, disk…) settings need more research work • Systems issues need more work, e.g. connection to data storage/management • Hadoop does not solve everything

  24. General conclusions • No general theory for the tradeoff between statistical quality and computational cost (lower/upper bounds, etc) • More aspects of hardness (statistical and computational) are needed

More Related