Database Systems Research on Data Mining

Database Systems Research on Data Mining Reference: Ordonez, C, Garcia-Garcia, J, Database Systems Research on Data Mining,Proc. ACM SIGMOD 2010, p.000-999 (tutorial).

Global Outline 1. Data mining models and algorithms (JG,15 min) 1.1 Preprocess to get Data set 1.2 Data set 1.3 Data Mining Models 1.4 Data Mining Algorithms 2.Processing alternatives (JG, 35 min) 2.1 Inside DBMS: SQL 2.2 Outside DBMS: MapReduce 2.3 Example 3. Storage and Optimizations (CO, 35 min) 3.1 Layouts: Horizontal and Vertical 3.2 Optimizations: Algorithmic and Systems

1. Data Mining Models & Algorithms Data set preparation Data set Data mining models and patterns Algorithms

1.1 Data set preparation [CDHHL1999,KDD] [GCBLRVPP1997,JDMKD] [O2004,DMKD] In practice, 80% of project time significant SQL code writing; some tools help query writing iterative: between modeling and data set prep Little attention in research literature query optimization mostly in OLAP context new operators: PIVOT, horizontal aggregations research issue: can algorithms directly analyze 3NF tables?

Data set preparation [GO2010,DKE] [OG2008,DSS] Overall goal: getting data set for analysis Database processing generally required normalized (many 3NF) databases cannot be directly analyzed joins, aggregations and pivoting (transposing) Data cleaning remove outliers null replacement repair referential integrity Data transformation: categorical columns; rescale; code

Typical queries to create data setjoin and aggregations

1.2 Data set Data set with n records Each has attributes: numeric, discrete or both (mixed) Focus of the tutorial, d dimensions Generally, High d makes problem mathematically more difficult Extra column G/Y

Example of data sethorizontal layout n=5, d=3 and G/Y

1.3 Data Mining Models [STA1998,SIGMOD] Models, coming mostly from statistics and machine learning based on matrix computations, probability and calculus time dimension not considered Patterns, mostly combinatorial association rules, cubes, sequences and graphs important, but not considered in tutorial quite different algorithms from models no strong statistical foundation

Common data mining models [DLR1977,RSS] Unsupervised: math: simpler task: clustering, dimensionality reduction models: KM, EM, PCA/SVD, FA statistical tests overlap both Supervised math: tuning and validation than unsupervised task: classification, regression models: decision trees, Naïve Bayes, Bayes, linear/logistic regression, SVM, neural nets

Data mining models characteristics Multidimensional tens, hundreds of dimensions feature selection and dimensionality reduction Represented & computed with matrices & vectors data set: set of vectors or set of records; all numeric, mixed attributes model: numeric=matrices, discrete: histograms intermediate computations: matrices and histograms

Why is it hard?

Data mining Major tasks Model computation focus of most research generally requires matrix computations complex and slow algorithms (iterative) large n makes it slower Scoring data set assumes model exists useful for tuning, testing and model exchange fast: generally requires only one pass over X research issue: not studied enough in literature

1.4 Data Mining Algorithmsinput and output Input: data set X with n records, d dimensions Output: model, quality Parameters (representative; vary a lot): k (clusters, principal components, discrete states) epsilon for stopping (accuracy, convergence, local optima) feature/variable selection (algorithm dependent, step-wise or now bayesian statistics)

Data Mining Algorithms [ZRL1996,SIGMOD] Behavior with respect to data set X: one pass, few passes multiple passes, convergence, bigger issue (most algorithms) Time complexity: Research issues: preserve time complexity in SQL/MapReduce incremental learning

2. Processing alternatives 2.1 Inside DBMS (SQL) 2.2 Outside DBMS (MapReduce) 2.3 Example

2.1 Inside DBMS • Assumption: • data records are in the DBMS; exporting slow • row-based storage (not column-based) • Programming alternatives: • SQL and UDFs: SQL code generation (JDBC), precompiled UDFs. Extra: SP, embedded SQL, cursors • Internal C Code (direct access to file system and mem) • DBMS advantages: • important: storage, queries, security • maybe: recovery, concurrency control, integrity, transactions

Inside DBMSSQL code:CREATE + SELECT, Consider Layout[CDDHW2009,VLDB] • Vertical layout: A(i,j,v), B(i,j,v) • A*B: SELECT A.i, B.j • , sum(A.v * B.v) • FROM A JOIN B ON A.j = B.i • GROUP BY A.i, B.j • CREATE TABLE • Row storage: Clustered (to group rows of pivoted tables), Block size (for large tables) • Index: primary (gen. for pk, critical for joins), secondary (may help joins & searches) • SELECT • Basic mechanism to write queries; standard across DBMSs, arbitrarily complex queries, including arithmetic expressions

Inside DBMSUser-Defined Function (UDF) • Classification: • Scalar UDF • Aggregate UDF • Table UDF • Programming: • Called in a SELECT statement • C code or similar language • API provided by DBMS, in C/C++ • Data type mapping

Inside DBMSUDF pros and cons • Advantages: • arrays and flow control • flexibility in code writing and no side effects • No need to modify DBMS internal code • In general, simple data types • Limitations: • OS and DBMS architecture dependent, not portable • No I/O capability, no side effects • Null handling and fixed memory allocation • Memory leaks with arrays (matrices): fenced/protected mode

Inside DBMSScalar UDF [DNPT2006,SAC] Memory allocation in the stack Returns one value of simple data type Basic SQL data types (e.g. int, float, char) May support UDT Call & return value with every row Useful for vector operations

Inside DBMSAggregate UDF [JM1998,SIGMOD] Table scan Memory allocation in the heap GROUP BY extend their power Also require handling nulls Advantage: parallel & multithreaded processing Drawback: returns a single value, not a table DBMSs: SQL Server, PostgreSQL,Teradata, Oracle, DB2, among others Useful for model computations

Inside DBMSUDF: Aggregate User-Defined Function 1. Initialization: allocates variable storage 2. Accumulate: processes every record, aggregate some value (vector). Bottleneck. 3. Merge: consolidates partial results from multiple threads 4. Terminate: final processing, return result

Inside DBMSTable UDF [BRKPHK2008,SIGMOD] Main difference with aggregate UDF: returns a table (instead of single value) Also, it can take several input values Called in the FROM clause in a SELECT Stream: no parallel processing, external file Computation power same as aggregate UDF Suitable for complex math operations and algorithms Since result is a table it can be joined DBMS: SQL Server ,DB2, Oracle,PostgreSQL

Inside DBMSInternal C code [LTWZ2005,SIGMOD], [MYC2005,VLDB] [SD2001,CIKM] Advantages: access to file system (table record blocks), physical operators (scan, join, sort, search) main memory, data structures, libraries hardware optimizations: multithreading, multicore, caching RAM, caching LI/L2 Disadvantages: requires careful integration with rest of system not available to end users and practitioners may require exposing functionality with DM language or SQL

Inside DBMSPhysical Operators [DG1992,CACM] [SMAHHH2007,VLDB] [WH2009,SIGMOD] • Serial DBMS (one CPU, maybe RAID): • table Scan • join: hash join, sort merge join, nested loop • external merge sort • Parallel DBMS (shared-nothing): • even row distribution, hashing • parallel table scan • parallel joins: large/large (sort-merge, hash); large/short (replicate short) • distributed sort

2.2 Outside DBMS • Alternatives: • MapReduce • Packages, libraries, Java/C++ • Issue: I/O bottleneck

Outside DBMSMapReduce [DG2008,CACM] • Parallel processing; simple; shared-nothing • Functions are programmed in a high-level programming language (e.g. Java, Python); flexible. • <key,value> pairs processed in two phases: • map(): computation is distributed and evaluated in parallel; independent mappers • reduce(): partial results are combined/summarized • Can be categorized as inside/outside DBMS, depending on level of integration with DBMS • DBMS integration: Greenplum, Aster Data, Teradata...

Outside DBMSMapReduce Files and Processing • File Types: • Text Files: Common storage (e.g. CSV files.) • SequenceFiles: Efficient processing • Custom InputFormat (rarely used.) • Processing: • Points are sorted by “key” before sending to reducers • Small files should be merged • Partial results are stored in file system • Intermediate files should be managed in SequenceFiles for efficiency

Outside DBMSPackages, libraries, Java/C++ [ZHY2009,CIDR][ZZY2010,ICDE] • Statistical and data mining packages: • exported flat files; proprietary file formats • Memory-based (processing data records, models, internal data structures) • Programming languages: • Arrays • flexibility of control statements • Limitation: large number of records • Packages: R, SAS, SPSS, KXEN,Matlab, WEKA 30/60

2.3 Naïve Bayes ExampleHorizontal layout NB one pass Gaussian, sufficient statistics (NLQ) Example in: SQL UDF MapReduce Data Structures public double N; public double[] L; public double[] Q;

Naïve BayesSQL (2 passes, n L Q, triangular Q ) /*Lower triangular for PCA and LR*/ SELECT sum(X1*X1), null, ... ,null ,sum(X2*X1), sum(X2*X2), ... ,null ... ,sum(Xd*X1), sum(Xd*X2), ... ,sum(Xd*Xd) FROM X

Naïve BayesAggregate UDF (1 pass)

Naïve Bayes MapReduce (text file, unoptimized)

3. Storage and Optimizations • Storage layouts: • horizontal • vertical • Optimizations: • algorithmic: general • systems-oriented: SQL and MapReduce

Horizontal LayoutDBMS Data Set X(i,X1,…,Xd,G/Y) • Most common format in DM • Join/Aggregations/arithmetic expressions, pivot 36/60

Horizontal LayoutDBMS[DM2006,SIGMOD] • Physical operator (most common): • Table scan (default in SQL query or UDF) • External Sort or hash table: • SQL Group By query • UDF Group By • Join algorithm : • SQL queries • Not required in UDF

Horizontal LayoutDBMS • Size of table: n rows • Limited DDL control in SQL • limited number of columns • requires assigning point id i (Primary index) • Clustered storage by block on i allows processing several rows at the same time

Horizontal LayoutDBMS: X1+X2+X3 String query = ‘SELECT i’; String sum =‘’; for( int h = 1 ; h <= d; h++ ) { query += ‘, X’+h; sum += ‘X’+h+’+’; } query += ‘,‘ +sum.substring(0,sum.length()-1) +’ FROM X’; JAVA double X[4], SUM=0.0; int i = 1,h=0, d=3; while( fscanf(fp,"%lf%lf%lf%d",&X[1],&X[2],&X[3]) != EOF ) { SUM = 0.0; for( h = 1; h <= d; h++ ) { SUM += X[h]; } printf("%d\t%g\r\n",i,SUM); i++; } SELECT i, x1, x2, x3, X1+X2+X3 FROM X C++ • No arrays. • Access to dimensions through SQL generation (Java/ C++) Example: 39/60

Vertical LayoutDBMS-X(i,h,v,g) When d exceeds the DBMS limits;d>n Index by point i and dimension h Clustered row storage by i (correctness in UDF, efficiency in SQL query) Queries require: joins & aggregations Columns as subscripts Size <= dn rows (sparse) Two tables with n rows with same PK can be joined in time O(n) using hash join 40/60

Vertical LayoutDBMS: X1+X2+X3 double X, SUM = 0.0; int h, old_i = 0, i; while( fscanf(fp,"%d%d%lf",&i,&h,&X) != EOF ) { if ( old_i != i ) { if ( old_i != 0 ) printf("%d\t%g\r\n",old_i,SUM); old_i = i; SUM = X; } else { SUM += X; } } printf("%d\t%g\r\n",i,SUM); String query = ‘SELECT i, SUM(v) FROM X’; query += ‘GROUP BY i’ SELECT i, SUM(v) FROM X GROUP BY i JAVA C++ Requires using UDF functions SQL statements using [index] joins are required

Horizontal LayoutMapReduce X.csv Line number represents the point ID (implicit) No indexes in general Flat file 42/60

Horizontal vs Vertical 43/60

3.2 OptimizationsAlgorithmic & Systems Algorithmic 90% research, many efficient algorithms accelerate/reduce computations or convergence database systems focus: reduce I/O approximate solutions Systems (SQL, MapReduce) Platform: parallel DBMS server vs cluster of computers Programming: SQL/C++ versus Java

Algorithmic [ZRL1996,SIGMOD] Implementation: data set available as flat file, binary file required for random access May require data structures working in main memory and disk Programming not in SQL: C/C++ are preferred languages, although Java becoming common MapReduce is becoming popular Assumption d<<n: n has received more attention Issue: d>n produces numerical issues and large covariance/correlation matrix (larger than X)

Algorithmic Optimizations [STA1998,SIGMOD] [ZRL1996,SIGMOD][O2007,SIGMOD] Exact model computation: summaries: sufficient statistics (Gaussian pdf), histograms, discretization accelerate convergence, reduce iterations faster matrix operations: * + Approximate model computation: Sampling: efficient in time O(s) Incremental: math: escape local optima (EM), reseed database systems: favor table scan

Systems OptimizationsDBMS [O2006,TKDE], [ORD2010,TKDE] SQL query optimization mathematical equations as queries Turing-complete: SQL code generation and programming language UDFs as optimization substitute key mathematical operations push processing into RAM memory

Systems OptimizationsDBMS SQL query [O2004,DMKD] Denormalization Issue: Query rewriting (optimizer falls short) Index depends on layout Horizontal layout: indexed by i d may be an issue, thus vertical partition Vertical layout: storage: clustered by point indexing by subscript Use specific join algorithm

Systems OptimizationsDBMS SQL query [O2006,TKDE] [OP2010,TKDE],[OP2010,DKE] ,[MC2002,ICDM] Join: denormalized storage: model, intermediate tables favor hash joins over mrg-srt: both tables PI on i secondary indexing for join: sort-merge join Aggregation (compression): push group-by before join: watch out nulls and high cardinality columns like point i synchronized table scans: several SELECTs on same table; examples: unpivoting; 2+ models Sampling: O(s), random access, truly random; error

Naïve BayesSQL (optimized)

Database Systems Research on Data Mining