240 likes | 422 Views
Scaling Multivariate Statistics to Massive Data Algorithmic problems and approaches. Alexander Gray Georgia Institute of Technology www.fast-lab.org. Core methods of statistics / machine learning / mining.
E N D
Scaling Multivariate Statistics to Massive DataAlgorithmic problems and approaches Alexander Gray Georgia Institute of Technology www.fast-lab.org
Core methods ofstatistics / machine learning / mining • Querying: spherical range-search O(N), orthogonal range-search O(N), spatial join O(N2), nearest-neighbor O(N), all-nearest-neighbors O(N2) • Density estimation: mixture of Gaussians, kernel density estimation O(N2), kernel conditional density estimation O(N3) • Regression: linear regression, kernel regression O(N2), Gaussian process regression O(N3) • Classification: decision tree, nearest-neighbor classifier O(N2), nonparametric Bayes classifier O(N2), support vector machine O(N3) • Dimension reduction: principal component analysis, non-negative matrix factorization, kernel PCA O(N3), maximum variance unfolding O(N3) • Outlier detection: by density estimation or dimension reduction • Clustering: by density estimation or dimension reduction, k-means, mean-shift segmentation O(N2), hierarchical clustering O(N3) • Time series analysis: Kalman filter, hidden Markov model, trajectory tracking O(Nn) • Feature selection and causality: LASSO, L1 SVM, Gaussian graphical models, discrete graphical models • Fusion and matching: sequence alignment, bipartite matching O(N3), n-point correlation 2-sample testing O(Nn)
Now pretty fast (2011)… • Querying: spherical range-search O(logN)*, orthogonal range-search O(logN)*, spatial join O(N)*, nearest-neighbor O(logN), all-nearest-neighbors O(N) • Density estimation: mixture of Gaussians, kernel density estimation O(N), kernel conditional density estimation O(Nlog3)* • Regression: linear regression, kernel regression O(N), Gaussian process regression O(N)* • Classification: decision tree, nearest-neighbor classifier O(N), nonparametric Bayes classifier O(N)*, support vector machine • Dimension reduction: principal component analysis, non-negative matrix factorization, kernel PCA O(N)*, maximum variance unfolding O(N)* • Outlier detection: by density estimation or dimension reduction • Clustering: by density estimation or dimension reduction, k-means, mean-shift segmentation O(N), hierarchical clustering O(NlogN) • Time series analysis: Kalman filter, hidden Markov model, trajectory tracking O(Nlogn)* • Feature selection and causality: LASSO, L1 SVM, Gaussian graphical models, discrete graphical models • Fusion and matching: sequence alignment, bipartite matching O(N)**, n-point correlation 2-sample testing O(Nlogn)*
Things we made fastfastest, fastest in some settings • Querying: spherical range-search O(logN)*, orthogonal range-search O(logN)*, spatial join O(N)*,nearest-neighbor O(logN),all-nearest-neighbors O(N) • Density estimation: mixture of Gaussians, kernel density estimation O(N), kernel conditional density estimation O(Nlog3)* • Regression: linear regression, kernel regression O(N),Gaussian process regression O(N)* • Classification: decision tree, nearest-neighbor classifier O(N), nonparametric Bayes classifier O(N)*, support vector machine O(N)/O(N2) • Dimension reduction:principal component analysis, non-negative matrix factorization, kernel PCA O(N)*, maximum variance unfolding O(N)* • Outlier detection: by density estimation or dimension reduction • Clustering: by density estimation or dimension reduction, k-means, mean-shift segmentation O(N), hierarchical (FoF) clustering O(NlogN) • Time series analysis: Kalman filter, hidden Markov model, trajectory tracking O(Nlogn)* • Feature selection and causality: LASSO, L1 SVM, Gaussian graphical models, discrete graphical models • Fusion and matching: sequence alignment, bipartite matching O(N)**, n-point correlation 2-sample testing O(Nlogn)*
Core computational problems What are the basic mathematical operationsmaking things hard? • Alternative to speeding up each of the 1000s of statistical methods: treat common computational bottlenecks • Divide up the space of problems (and associated algorithmic strategies), so we can examine the unique challenges and possible ways forward within each
The “7 Giants” of data • Basic statistics • Generalized N-body problems • Graph-theoretic problems • Linear-algebraic problems • Optimizations • Integrations • Alignment problems
The “7 Giants” of data 1. Basic statistics • e.g. counts, contingency tables, means, medians, variances, range queries (SQL queries) 2. Generalized N-body problems • e.g. nearest-nbrs (in NLDR, etc), kernel summations (in KDE, GP, SVM, etc), clustering, MST, spatial correlations
The “7 Giants” of data 3. Graph-theoretic problems • e.g. betweenness centrality, commute distance, graphical model inference 4. Linear-algebraic problems • e.g. linear algebra, PCA, Gaussian process regression, manifold learning 5. Optimizations • e.g. LP/QP/SDP/SOCP/MINLPs in regularized methods, compressed sensing
The “7 Giants” of data 6. Integrations • e.g. Bayesian inference 7. Alignment problems • e.g. BLAST in genomics, string matching, phylogenies, SLAM, cross-match
Back to our listbasic, N-body,graphs,linear algebra,optimization, integration, alignment • Querying:spherical range-search O(N),orthogonal range-search O(N),spatial join O(N2),nearest-neighbor O(N),all-nearest-neighbors O(N2) • Density estimation:mixture of Gaussians, kernel density estimation O(N2), kernel conditional density estimation O(N3) • Regression:linear regression, kernel regression O(N2),Gaussian process regressionO(N3) • Classification: decision tree, nearest-neighbor classifier O(N2), nonparametric Bayes classifier O(N2), support vector machine O(N3) • Dimension reduction:principal component analysis, non-negative matrix factorization, kernel PCA O(N3), maximum variance unfolding O(N3) • Outlier detection: by density estimation or dimension reduction • Clustering: by density estimation or dimension reduction, k-means, mean-shift segmentationO(N2),hierarchical clustering O(N3) • Time series analysis:Kalman filter, hidden Markov model, trajectory tracking O(Nn) • Feature selection and causality: LASSO, L1 SVM, Gaussian graphical models, discrete graphical models • Fusion and matching:sequence alignment, bipartite matching O(N3), n-point correlation 2-sample testingO(Nn)
5 settings • “Regular”: batch, in-RAM/core, one CPU • Streaming (non-batch) • Disk (out-of-core) • Distributed: threads/multi-core (shared memory) • Distributed: clusters/cloud (distributed memory)
4 common data types • Vector data, iid • Time series • Images • Graphs
3 desiderata • Fast experimental runtime/performance* • Fast theoretic (provable) runtime/performance* • Accuracy guarantees *Performance: runtime, memory, communication, disk accesses; time-constrained, anytime, etc.
7 general solution strategies • Divide and conquer (indexing structures) • Dynamic programming • Function transforms • Random sampling (Monte Carlo) • Non-random sampling (active learning) • Parallelism • Problem reduction
1. Summary statistics • Examples: counts, contingency tables, means, medians, variances, range queries (SQL queries) • What’s unique/challenges: streaming, new guarantees • Promising/interesting: • Sketching approaches • AD-trees • MapReduce/Hadoop (Aster,Greenplum,Netezza)
2. Generalized N-body problems • Examples: nearest-nbrs (in NLDR, etc), kernel summations (in KDE, GP, SVM, etc), clustering, MST, spatial correlations • What’s unique/challenges: general dimension, non-Euclidean, new guarantees (e.g. in rank) • Promising/interesting: • Generalized/higher-order FMM O(N2) O(N) • Random projections • GPUs
3. Graph-theoretic problems • Examples: betweenness centrality, commute dist, graphical model inference • What’s unique/challenges: high interconnectivity (cliques), out-of-core • Promising/interesting: • Variational methods • Stochastic composite likelihood methods • MapReduce/Hadoop (Facebook,etc)
4. Linear-algebraic problems • Examples: linear algebra, PCA, Gaussian process regression, manifold learning • What’s unique/challenges: probabilistic guarantees, kernel matrices • Promising/interesting: • Sampling-based methods • Online methods • Approximate matrix-vector multiply via N-body
5. Optimizations • Examples: LP/QP/SDP/SOCP/MINLPs in regularized methods, compressed sensing • What’s unique/challenges: stochastic programming, streaming • Promising/interesting: • Reformulations/relaxations of various ML forms • Online, mini-batch methods • Parallel online methods • Submodular functions • Global optimization (non-convex)
6. Integrations • Examples: Bayesian inference • What’s unique/challenges: general dimension • Promising/interesting: • MCMC • ABC • Particle filtering • Adaptive importance sampling, active learning
7. Alignments • Examples: BLAST in genomics, string matching, phylogenies, SLAM, cross-match • What’s unique/challenges: greater heterogeneity, measurement errors • Promising/interesting: • Probabilistic representations • Reductions to generalized N-body problems
Reductions/transformationsbetween problems • Gaussian graphical models linear alg • Bayesian integration MAP optimization • Euclidean graphs N-body problems • Linear algebra on kernel matrices N-body inside conjugate gradient • Can featurize a graph or any other structure matrix-based ML problem • Create new ML methods with different computational properties
General conclusions • Algorithms can dramatically change the runtime order, e.g. O(N2) to O(N) • High dimensionality is a persistent challenge • The non-default (e.g. streaming, disk…) settings need more research work • Systems issues need more work, e.g. connection to data storage/management • Hadoop does not solve everything
General conclusions • No general theory for the tradeoff between statistical quality and computational cost (lower/upper bounds, etc) • More aspects of hardness (statistical and computational) are needed