A K-Means Based Bayesian Classifier Inside a DBMS Using SQL & UDFs

A K-Means Based Bayesian Classifier Inside a DBMS Using SQL & UDFs Ph.D Showcase, Dept. of Computer Science Sasi Kumar Pitchaimalai Ph.D Candidate Database Systems Group, Department of Computer Science University of Houston Advisor: Dr. Carlos Ordonez

Motivation • Naïve Bayes Classifier(NB) • One of the most popular and important classifiers in Machine Learning • Robust, Powerful, Fast to Compute And Easy to Understand • Programming Inside A DBMS • SQL can easily handle complex computations • UDFs can use arrays and processed in memory

Data Mining Inside A DBMS Avoids Exporting the data outside the DBMS Major overhead Data Security Scales Linearly with large data sets Exploit parallelism provided by a DBMS Use optimized queries with simple database operations Objective: Push computations involving large data sets inside the DBMS

Bayesian Classifier Based On K-Means (BKM) • A Generalization Of Naïve Bayes(NB) • The Algorithm • Initialization: Randomly initialize k clusters per class from the data set. • E-Step: Compute Euclidean distance, find nearest cluster and then compute sufficient statistics. • M-Step: Re-compute cluster centers and radii. Check Convergence. • The E-Step and M-Step are repeated until model converges i.e clusters do not move

BKM: Finding the clusters per class

Database Optimizations • Five different query optimization techniques for distance computation were introduced. • User Defined Functions (UDFs) – Computing distance and nearest cluster in a single UDF. • Using CASE statement instead of aggregations. • Sufficient Statistics of the clusters were computed in a single table scan.

Comparing Accuracy – NB Vs BKM Vs DT • Global Accuracy: BKM better than NB and worse than DT(Decision Tree) in most cases • Class Breakdown Accuracy: • BKM better than NB except 2 cases proving class decomposition is a positive step towards increasing NB accuracy. DT performs poorly here and really worse in case of the bscale.

BKM Scalability- Varying n,d,k Times per Iteration. Defaults: d=4,k=4,n=100k

Comparing DBMS with MapReduce MapReduce: A distributed non-transactional high performance data intensive processing framework.

Incremental Mining • An UDF performing incremental data mining exploiting data parallelism • Minimizing the number of scans(1-3) on the data set • Provides an approximation of the model before we scan through the complete data set • Requires thread safe sharing of the model without affecting performance

Papers • Carlos Ordonez, Sasi K. Pitchaimalai: One-pass data mining algorithms in a DBMS with UDFs. SIGMOD Conference 2011: 1217-1220 • Sasi K. Pitchaimalai, Carlos Ordonez, Carlos Garcia Alvarado : Comparing SQL and MapReduce to compute Naïve Bayes in a Single Table Scan, CloudDB, CIKM 2010 • Carlos Ordonez, Sasi K. Pitchaimalai: Fast UDFs to compute sufficient statistics on large data sets exploiting caching and sampling, DKE 2010 • Carlos Ordonez, Sasi K. Pitchaimalai - Bayesian Classifiers Programmed in SQL, TKDE 2008 • Sasi K. Pitchaimalai, Carlos Ordonez, Carlos Garcia Alvarado – Efficient Distance Computation Using SQL Queries and UDFs, ICDM 2008

A K-Means Based Bayesian Classifier Inside a DBMS Using SQL & UDFs