180 likes | 316 Views
Fast PCA and Bayesian Variable Selection for Large Data Sets Based on SQL and UDFs. Mario Navas, Carlos Ordonez, Veerabhadran Baladandayuthapani KDD-LDMTA‘10 July 25, 2010. Introduction. Efficient computation of very large datasets for DM, ML and statistics.
E N D
Fast PCA and Bayesian Variable Selection for Large Data Sets Based on SQL and UDFs Mario Navas, Carlos Ordonez, Veerabhadran Baladandayuthapani KDD-LDMTA‘10 July 25, 2010
Introduction • Efficient computation of very large datasets for DM, ML and statistics. • Majority of work is done outside the DBMS. • External tools: • Process flat files (too big for RAM). • Disk I/O limitation. • Security risks. • DBMSs are extendable through SQL and UDFs.
Contributions • Extend DBMS for PCA and Bayesian variable selection of large data. • Only one scan of the input table. • Maintain exact results for models generated. • Exploit multi-threaded aggregate functions for data summarization. • Optimal use of hardware resources • memory and multiple processors
Preliminaries • Definitions • Input data X={x1,x2,…,xn}, with d dimensions. • Y is the response or variable of interest. • Xa is the ath dimension. • Data Summarization • Used to summarize essential matrix multiplications: XXT, XYT,YYT, Y 1n. (n, L, and Q).
Preliminaries • Dimensionality reduction with PCA • Find a rotation U to project the input data into its principal components. • Bayesian Variable Selection • Stochastic search of explanatory variables that best predict the output Y, in the regression Y = β0+β1X1+…+βdXd. • Generate a sequence γ1,..,S to estimate π(γ|Y,X), where Xa is part of the model Mγ if γa=1.
Algorithms • Two phase algorithms: • Summarization • Model computation • Large scale processing with SQL and UDFs. • Hardware optimizations are considered.
Efficient Computation of PCA • Covariance and Correlation matrices are computed from n, L, and Q. • SVD of the correlation or the covariance matrices are the principal components.
Efficient Bayesian Variable Selection • Input table: (X1,…,Xd,Y) • Find Mγ, where Xγ={Xa:γa=1}. • Use Zellener’s G-prior to estimate π(γ|Y,X).
Efficient Bayesian Variable Selection • Selection probabilities are calculated from n, L, and Q.
SQL Optimizations • Data summarization is done with distributive aggregate functions. • Resulting table storing terms in n, L, and Q.
SQL Optimizations • UDFs to avoid column length limitations. • Records are packed in a user-defined type. • TVFs and SPs return tables with results for PCA and Bayesian variable selection.
Hardware Optimizations • Multi-threading within TVFs and SPs, following API of DBMSs. • Caching blocks of records in main memory. • Working threads compute aggregations. • Control number of threads, memory used for caching, and workload waiting for a thread.
Evaluation of Optimizations • Our optimizations show linear scalability for summarization.
Comparison with R • Execution comparison of PCA inside the DBMS with the statistical package R working on flat files. • Our implementations have solutions for the cases when R reaches its data-size limitations.
Comparison with R • Our optimizations for SSVS inside the DBMS show to be efficient to process large data. • Experimentation was done with a time limitation of 2K seconds, some experiments could not be executed due the data-size limitations.
Conclusions • We have extended the DBMS functionality to include PCA and SSVS, using standard SQL and UDFs. • We have overcome limitations of external tools to process large data. • The model computation is not affected by the number of records; linear scalability on n and d. • Results are exact; accuracy not compromised.
Future Work • Approach SSVS for high-dimensionality. • Alternative to UDFs. • Further research on multi-threads UDFs to solve a broader set of problems.