Parallel streaming decision trees

Parallel streaming decision trees Yael Ben-Haim & Elad Yom-Tov Presented by: Yossi Richter

Why decision trees? • Simple classification model, short testing time • Understandable by humans • BUT: • Difficult to train on large data (need to sort each feature)

Previous work • Presorting (SLIQ, 1996) • Approximations (BOAT, 1999) (CLOUDS, 1997) • Parallel (e.g. SPRINT 1996) • Vertical parallelism • Task parallelism • Hybrid parallelism • Streaming • Minibatch (SPIES, 2003) • Statistic (pCLOUDS, 1999)

Streaming parallel decision tree Data

Iterative parallel decision tree Time Master Workers Initializeroot Build histogram Build histogram Data Data Merge Compute node splits Build histogram Build histogram Until convergence

Building an on-line histogram • A histogram is a list of pairs (p1, m1) … (pn, mn) • Initialize: c=0, p=[ ], m=[ ] • For each data point p: • If p==pj for any j<=c • mj = mj + 1 • Otherwise • Add a bin to the histogram with the value (p, 1) • c = c + 1 • If c > max_bins • Merge the two closest bins in the histogram • c = max_bins

Merging two histograms • Concatenate the two histogram lists, creating a list of length c • Repeat until c <= max_bins • Merge the two closest bins

Example of the histogram 50 bins, 1000 data points

Pruning • Taken from the MDL-based SLIQ algorithm • Consists of two phases: • Tree construction • Bottom-up pass on the complete tree • During tree construction, for each tree node, set cleaf = 1 + number of samples that reached the node and do not belong to the majority class • The bottom-up pass: • for each leaf, set cboth = cleaf • for each internal node, for which cboth(left) and cboth(right) have been assigned, set cboth = 2 + cboth(left) + cboth(right) • The subtree rooted at a node is to be pruned if cleaf is small, namely: • Only a few samples reach it • A substantial portion of the samples that reach it belongs to the majority class • If cleaf < cboth (i.e., the subtree does not contribute much information) then: • Prune the subtree • Set cboth = cleaf

Shameless PR slide IBM Parallel Machine Learning toolbox • A toolbox for conducting large-scale machine learning • Supports architectures ranging from single machines with multiple cores to large distributed clusters • Works by distributing the computations across multiple nodes • Allows for rapid learning of very large datasets • Includes state-of-the-art machine learning algorithms for: • Classification: Support-vector machines (SVM), decision tree • Regression: Linear and SVM • Clustering: k-means, fuzzy k-means, kernel k-means, Iclust • Feature reduction: Principal component analysis (PCA), and kernel PCA. • Includes an API for adding algorithms • Freely available from alphaWorks • Joint project of the Haifa Machine Learning group and the Watson Data Analytics group K-means, Blue Gene

No statistically Significant difference Results: Comparing single node solvers Ten-fold cross-validation, unless test\train partition exists

80% reduction in size Results: Pruning

Speedup (Strong scalability) Alpha Beta Speedup improves with data size!

Weak scalability Alpha Beta Scalability improves with the number of processors!

Algorithm complexity

Summary • An efficient new algorithm for parallel streaming decision trees • Results as good as single-node trees, but with scalability that improves with the data size and the number of processors • Ongoing work: Proof that the algorithm is only epsilon different from previous decision tree algorithm

Thai Traditional Chinese Gracias Spanish Russian Thank You KIITOS Danish English Simplified Chinese תודה Hebrew (Toda) Danke German Arabic Grazie Merci Italian French Japanese Korean Obrigado Portuguese

Parallel streaming decision trees

Parallel streaming decision trees

Presentation Transcript

Decision Trees

Decision Trees

Decision Trees

Decision Trees

Decision Trees

Decision Trees

Decision Trees

Decision Trees

Decision Trees

Decision Trees

Decision Trees

Decision Trees

Decision Trees

Decision Trees

Decision Trees

DECISION TREES

Decision Trees

Decision Trees

Decision trees

Decision Trees

Decision Trees