190 likes | 349 Views
Scalable Regression Tree Learning on Hadoop using OpenPlanet. Wei Yin. Contributions. We implement OpenPlanet , an open-source implementation of the PLANET regression tree algorithm using the Hadoop MapReduce framework.
E N D
Scalable Regression Tree Learning on Hadoop using OpenPlanet Wei Yin
Contributions • We implement OpenPlanet, an open-source implementation of the PLANET regression tree algorithm using the HadoopMapReduce framework. • We tune and analyze the impact of two parameters, HDFS block sizes and threshold value between ExpandNode and InMemoryWeka Tasks of OpenPlanet to improve the default performance.
Motivation for large-scale Machine Learning • Models operate on large data sets • Large number of forecasting models • New data arrives constantly and real-time training requirement
Regression Tree • Classification algorithm maps • features → target variable (prediction) • Classifier uses a Binary Search Tree Structure • Each non-leaf node is a binary classifier with a decision condition • One numeric or categorical feature goes left or right in the tree • Leaf Nodes contain the regression function or a single prediction value • Intuitive to understand by domain users • Effect for each feature
Google’s PLANET Algorithm • Use distributed worker nodes coordinated using a master node to build regression tree USC DR Technical Forum
OpenPlanet • Give an introduction abolutOpenPlanet • Introduce difference between OpenPlanet and PLANET • Give specific re-implementation details Threshold Value(60000)
Start Initialization Populate Queues While Queues are NOT empty Cotroller True False MRExpandQueue NOT Empty True Issue MRInitial Task False Issue MRExpandNode Task MRInMemQueue NOT Empty False True Issue MR-InMemGrow Task Update Model & Populate Queue End
Model File Instance • *Regression Tree Model • *Update Function( ) • *CurrentLeaveNode( ) • *.…... ModelFile It is an object used to contain the regression model, and support relevant functions, like adding node, check node status etc. Advantages: • More convenient to update the model and predict target value, compare to parsing XML file. • Load and Write model file == serialize and de-serialize Java Object
InitHistogram • A pre-processing step to find out potential candidate split points for ExpandNodes • Numerical features: Find fewer candidate points from huge data at expanse of little accuracy lost, e.g feat1, feat2, • Categorical features: All the components, e.g feat 3 Input node (or subset): ExpandNodes Task just need to evaluate all the points in the candidate set without consulting other resources. node 3 Feat1(num) block Map Reduce Sampling: Boundaries of equal-depth histogram block Map Colt: High performance Java library block Map Feat2(num), Feat3 (categ) Reduce block Map
ExpandNode Input node (or subset): node 3 Candidate points Update expending node e.g. sp1 = 23 in feature 2 block Map Local optimal split point , sp1 (value = 23) Reduce block Map Global optimal split point , sp1 (value = 23) Controller node3 f2< 23 block Map Reduce block Map Local optimal split point , sp2 (value = 26)
MRInMemWeka Input node (or subset): node 4 node 5 Node 4 block Map Reduce ….. block Map Controller node4 node5 block Map Reduce Update tree nodes Node 5 block Map Location of Weka Model Weka Model Weka Model
WekaModel WekaModel WekaModel Distinctions between OpenPlanet and PLANET: • Sampling MapReduce method: InitHistogram • Broadcast(BC) function (3)Hybrid Model WekaModel WekaModel
Performance Analysis and Tuning Method Parallel Performance for OpenPlanet with default settings Baseline for Weka, Matlab and OpenPlanet (single machine) Question ? 1. For 17 million data set, very little improvement difference between 2x8 case and 8x8 case 2. Not much performance improvement, especially compare to Weka baseline performance * 1.58 speed-up for 17 M dataSet * no speed-up for small data point (no memory overhead)
Question 1: Similar performance between 2x8 case and 8x8 case ?
Answerfor question 1: • In HDFS, the basic unit is block (default 64MB) • Each Map instance only process one block each time • Therefore, we can say, if N blocks, only N Map instances can running in parallel. Therefore for our problem: • Size of train data: 17 Million = 842 MB • Default block size = 64 MB • Number of blocks = 842/64 = 13 • For 2x8 case, 13 Map can run in parallel, utilization=13/16 = 81% • For 8x8 case, 13 Map can run in parallel, utilization=13/64 = 20% • But for both case, only 13 Map running in parallel => reason for the similar performance Solution: Tuning the block size, which lead to: Number of blocks Number of computing cores • What if number of blocks >> Number of computing cores ? Not necessary to improve performance since the network bandwidth limitation
Question 2: • Weka works better if no memory overhead • Observed from the picture, What about balance those two areas but avoid memory overhead for Weka ? Solution: Increasing the threshold value between ExpandNodes task and InMemWeka Then by experiment, when the JVM for reducer instance is 1GB, the maximum threshold value is 2,000,000.
Performance Improvement • Total running time = 1,835,430 sec vs 4,300,457 sec • Areas balanced • Iteration number decreased • Speed-up = 4300457 / 1835430 = 2.34 • AVG Total speed-up on 17M data set using 8x8 cores: • Weka: 4.93 times • Matlab: 14.3 times • AVG Accuracy (CV-RMSE):
Summary: • OpenPlanet, an open-source implementation of the PLANET regression tree algorithm using the HadoopMapReduce framework. • We tune and analyze the impact of parameters such as HDFS block sizes and threshold for in-memory handoff on the performance of OpenPlanet to improve the default performance. Future work: • Parallel Execution between MRExpand and MRInMemWeka in each iteration • Issuing multiple OpenPlanet instances for different usages, which leads to increase the slots utilization • Optimal block size • Real-time model training method • Move to Cloud platform and give analysis about performance