170 likes | 328 Views
Parallel streaming decision trees. Yael Ben-Haim & Elad Yom-Tov Presented by: Yossi Richter. Why decision trees?. Simple classification model, short testing time Understandable by humans BUT: Difficult to train on large data (need to sort each feature). Previous work.
E N D
Parallel streaming decision trees Yael Ben-Haim & Elad Yom-Tov Presented by: Yossi Richter
Why decision trees? • Simple classification model, short testing time • Understandable by humans • BUT: • Difficult to train on large data (need to sort each feature)
Previous work • Presorting (SLIQ, 1996) • Approximations (BOAT, 1999) (CLOUDS, 1997) • Parallel (e.g. SPRINT 1996) • Vertical parallelism • Task parallelism • Hybrid parallelism • Streaming • Minibatch (SPIES, 2003) • Statistic (pCLOUDS, 1999)
Iterative parallel decision tree Time Master Workers Initializeroot Build histogram Build histogram Data Data Merge Compute node splits Build histogram Build histogram Until convergence
Building an on-line histogram • A histogram is a list of pairs (p1, m1) … (pn, mn) • Initialize: c=0, p=[ ], m=[ ] • For each data point p: • If p==pj for any j<=c • mj = mj + 1 • Otherwise • Add a bin to the histogram with the value (p, 1) • c = c + 1 • If c > max_bins • Merge the two closest bins in the histogram • c = max_bins
Merging two histograms • Concatenate the two histogram lists, creating a list of length c • Repeat until c <= max_bins • Merge the two closest bins
Example of the histogram 50 bins, 1000 data points
Pruning • Taken from the MDL-based SLIQ algorithm • Consists of two phases: • Tree construction • Bottom-up pass on the complete tree • During tree construction, for each tree node, set cleaf = 1 + number of samples that reached the node and do not belong to the majority class • The bottom-up pass: • for each leaf, set cboth = cleaf • for each internal node, for which cboth(left) and cboth(right) have been assigned, set cboth = 2 + cboth(left) + cboth(right) • The subtree rooted at a node is to be pruned if cleaf is small, namely: • Only a few samples reach it • A substantial portion of the samples that reach it belongs to the majority class • If cleaf < cboth (i.e., the subtree does not contribute much information) then: • Prune the subtree • Set cboth = cleaf
Shameless PR slide IBM Parallel Machine Learning toolbox • A toolbox for conducting large-scale machine learning • Supports architectures ranging from single machines with multiple cores to large distributed clusters • Works by distributing the computations across multiple nodes • Allows for rapid learning of very large datasets • Includes state-of-the-art machine learning algorithms for: • Classification: Support-vector machines (SVM), decision tree • Regression: Linear and SVM • Clustering: k-means, fuzzy k-means, kernel k-means, Iclust • Feature reduction: Principal component analysis (PCA), and kernel PCA. • Includes an API for adding algorithms • Freely available from alphaWorks • Joint project of the Haifa Machine Learning group and the Watson Data Analytics group K-means, Blue Gene
No statistically Significant difference Results: Comparing single node solvers Ten-fold cross-validation, unless test\train partition exists
80% reduction in size Results: Pruning
Speedup (Strong scalability) Alpha Beta Speedup improves with data size!
Weak scalability Alpha Beta Scalability improves with the number of processors!
Summary • An efficient new algorithm for parallel streaming decision trees • Results as good as single-node trees, but with scalability that improves with the data size and the number of processors • Ongoing work: Proof that the algorithm is only epsilon different from previous decision tree algorithm
Thai Traditional Chinese Gracias Spanish Russian Thank You KIITOS Danish English Simplified Chinese תודה Hebrew (Toda) Danke German Arabic Grazie Merci Italian French Japanese Korean Obrigado Portuguese