Fast PCA and Bayesian Variable Selection for Large Data Sets Based on SQL and UDFs

Fast PCA and Bayesian Variable Selection for Large Data Sets Based on SQL and UDFs Mario Navas, Carlos Ordonez, Veerabhadran Baladandayuthapani KDD-LDMTA‘10 July 25, 2010

Introduction • Efficient computation of very large datasets for DM, ML and statistics. • Majority of work is done outside the DBMS. • External tools: • Process flat files (too big for RAM). • Disk I/O limitation. • Security risks. • DBMSs are extendable through SQL and UDFs.

Contributions • Extend DBMS for PCA and Bayesian variable selection of large data. • Only one scan of the input table. • Maintain exact results for models generated. • Exploit multi-threaded aggregate functions for data summarization. • Optimal use of hardware resources • memory and multiple processors

Preliminaries • Definitions • Input data X={x1,x2,…,xn}, with d dimensions. • Y is the response or variable of interest. • Xa is the ath dimension. • Data Summarization • Used to summarize essential matrix multiplications: XXT, XYT,YYT, Y 1n. (n, L, and Q).

Preliminaries • Dimensionality reduction with PCA • Find a rotation U to project the input data into its principal components. • Bayesian Variable Selection • Stochastic search of explanatory variables that best predict the output Y, in the regression Y = β0+β1X1+…+βdXd. • Generate a sequence γ1,..,S to estimate π(γ|Y,X), where Xa is part of the model Mγ if γa=1.

Algorithms • Two phase algorithms: • Summarization • Model computation • Large scale processing with SQL and UDFs. • Hardware optimizations are considered.

Efficient Computation of PCA • Covariance and Correlation matrices are computed from n, L, and Q. • SVD of the correlation or the covariance matrices are the principal components.

Efficient Bayesian Variable Selection • Input table: (X1,…,Xd,Y) • Find Mγ, where Xγ={Xa:γa=1}. • Use Zellener’s G-prior to estimate π(γ|Y,X).

Efficient Bayesian Variable Selection • Selection probabilities are calculated from n, L, and Q.

SQL Optimizations • Data summarization is done with distributive aggregate functions. • Resulting table storing terms in n, L, and Q.

SQL Optimizations • UDFs to avoid column length limitations. • Records are packed in a user-defined type. • TVFs and SPs return tables with results for PCA and Bayesian variable selection.

Hardware Optimizations • Multi-threading within TVFs and SPs, following API of DBMSs. • Caching blocks of records in main memory. • Working threads compute aggregations. • Control number of threads, memory used for caching, and workload waiting for a thread.

Evaluation of Optimizations • Our optimizations show linear scalability for summarization.

Comparison with R • Execution comparison of PCA inside the DBMS with the statistical package R working on flat files. • Our implementations have solutions for the cases when R reaches its data-size limitations.

Comparison with R • Our optimizations for SSVS inside the DBMS show to be efficient to process large data. • Experimentation was done with a time limitation of 2K seconds, some experiments could not be executed due the data-size limitations.

Conclusions • We have extended the DBMS functionality to include PCA and SSVS, using standard SQL and UDFs. • We have overcome limitations of external tools to process large data. • The model computation is not affected by the number of records; linear scalability on n and d. • Results are exact; accuracy not compromised.

Future Work • Approach SSVS for high-dimensionality. • Alternative to UDFs. • Further research on multi-threads UDFs to solve a broader set of problems.

Thank you!!

Fast PCA and Bayesian Variable Selection for Large Data Sets Based on SQL and UDFs