180 likes | 333 Views
Large-Scale Machine Learning Program For Energy Prediction CEI Smart Grid Wei Yin. Motivation Why need large-scale machine learning program?. Large Data Processing-Oriented: a. City energy prediction b. frequently updated. Memory Limitation bottleneck . Solution
E N D
Large-Scale Machine Learning Program For Energy Prediction CEI Smart Grid Wei Yin
Motivation • Why need large-scale machine learning program? • Large Data Processing-Oriented: • a. City energy prediction • b. frequently updated • Memory Limitation bottleneck
Solution • Process in parallel on distributed system, e.g. Cluster or Cloud • Available and Robust tool: HadoopMapReduce program
MapReduce Parallelism Data locality
Regression Tree Classification algorithm (features , target variable) Classifier using a BST Structure Each non-leaf node is a binary classifier with a decision condition(one feature, numeric or categorical) go left or right side Leaf Nodes contain the prediction value
More details • Training data type Numerical variable Categorical variable, e.g. {M, T, W, Th, F, Sat, Sun} A record is consisted with these two type variables • Evaluation function when training Max{|D| × Var(D) − [ |DL| × Var(DL) + |DR| × Var(DR) ]} Max{ } • Train model in Summation format:Parallel based on MapReduce
PLANET • MapReduce Program for train Regression Tree models • Used to build model with massive datasets • Running on distributed file system, e.g. HDFS deployed on Cloud Basic Idea: • Equally distribute data into each node in Cloud and process each data set in parallel • Large Data Set: find a single split point • Small Data Set: build the sub regression tree
Controller • Control the entire process • Check Current Tree Status • Issue MapReduce jobs • Collects results from MapReduce jobs and chooses the best split for each leaf node • Updates Model
Model File • A file represent model’s current status • Details
How to deal with large data efficiently? A huge Data Set D* ( >1TB) in HDFS Several numerical features and each value in a feature is potential splitting point • Trade off between performance and accuracy !!! • Reduce numerical feature’s size • Need an pre-filter Task
MR_Initilization Task • Find comparably fewer candidate points from huge data for numerical feature at expanse of little accuracy lost
MR_InMemoryGrow • Used for data set of small size data, which can be processed efficiently by a single computer
Controller Initialization : Check tree status from ModelFile Receive all sub-datasets, put large set into MRQueuewhile put small set into InMemroyQueue Dequeue from MRQueue <a> Issues MR_Initialization Task to find out candidate set of best split point for each node <b> Receives each node’s Candidate Set <c> Issues MR_FindBestSplit Task with needed parameters <d> Receive all reducers’ output files containing their own best split points for each node <e> Scan all reducers’ output file and find the best point for each node <d> Update the ModelFile Dequeue current sub-dataset from InMemroyQueue <a> Issues MR_InMemoryGrow Task with needed paramemters <b> Receive the trained weka Regression Tree Model <c> Update ModelFile Back to step (1) until finish building the model When finish, output ModelFile(contain weka model) as the final Regression Tree Model MapReduce Initialization Task (1) Build equi-depth histogram (2) Return candidates of best split point Issue MR_Initialization task Ask ModelFile about tree current status (1) Check ModelFile (2) Fetch Data from Data Repository { ModelFile, Candidate Set, Processed sub-dataset, Total Dataset} MapReduceExpandNodeTask Map: filter out processed data from repository, calculate necessary information and emit to reducer Reducer: calculate the best split point for each data and output its local result MapReduceInMemoryGrowTask Map : filter out processed data from repository and then emit to reducer. Reducer: Receive all needed data, call weka training program and output the RegTree Model {ModelFile, sub-dataset, Total dataset} Return wekaRegTree Model Return all sub-datasets information that need to be processed Data Repository (file or DB) Model File (Contain latest Tree Status) Return all data Return Model latest Status
Summary and future work • Scalable Machine Learning program via MapReduce Framework • Implement PLANET to build regression tree based on large data set • Integrate five components • Try to add book keeping algorithm in PLANET to improve its performance if necessary
Big Thank to Yogesh Professor Prasanna and all my colleagues