1 / 18

Large-Scale Machine Learning Program For Energy Prediction CEI Smart Grid Wei Yin

Large-Scale Machine Learning Program For Energy Prediction CEI Smart Grid Wei Yin. Motivation Why need large-scale machine learning program?. Large Data Processing-Oriented: a. City energy prediction b. frequently updated. Memory Limitation bottleneck . Solution

macha
Download Presentation

Large-Scale Machine Learning Program For Energy Prediction CEI Smart Grid Wei Yin

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Large-Scale Machine Learning Program For Energy Prediction CEI Smart Grid Wei Yin

  2. Motivation • Why need large-scale machine learning program? • Large Data Processing-Oriented: • a. City energy prediction • b. frequently updated • Memory Limitation bottleneck

  3. Solution • Process in parallel on distributed system, e.g. Cluster or Cloud • Available and Robust tool: HadoopMapReduce program

  4. MapReduce Parallelism Data locality

  5. Regression Tree Classification algorithm (features , target variable) Classifier using a BST Structure Each non-leaf node is a binary classifier with a decision condition(one feature, numeric or categorical) go left or right side Leaf Nodes contain the prediction value

  6. More details • Training data type Numerical variable Categorical variable, e.g. {M, T, W, Th, F, Sat, Sun} A record is consisted with these two type variables • Evaluation function when training Max{|D| × Var(D) − [ |DL| × Var(DL) + |DR| × Var(DR) ]} Max{ } • Train model in Summation format:Parallel based on MapReduce

  7. PLANET • MapReduce Program for train Regression Tree models • Used to build model with massive datasets • Running on distributed file system, e.g. HDFS deployed on Cloud Basic Idea: • Equally distribute data into each node in Cloud and process each data set in parallel • Large Data Set: find a single split point • Small Data Set: build the sub regression tree

  8. PLANET

  9. Controller • Control the entire process • Check Current Tree Status • Issue MapReduce jobs • Collects results from MapReduce jobs and chooses the best split for each leaf node • Updates Model

  10. Model File • A file represent model’s current status • Details

  11. How to deal with large data efficiently? A huge Data Set D* ( >1TB) in HDFS Several numerical features and each value in a feature is potential splitting point • Trade off between performance and accuracy !!! • Reduce numerical feature’s size • Need an pre-filter Task

  12. MR_Initilization Task • Find comparably fewer candidate points from huge data for numerical feature at expanse of little accuracy lost

  13. MapReduce_Expand Task

  14. MR_InMemoryGrow • Used for data set of small size data, which can be processed efficiently by a single computer

  15. Controller Initialization : Check tree status from ModelFile Receive all sub-datasets, put large set into MRQueuewhile put small set into InMemroyQueue Dequeue from MRQueue <a> Issues MR_Initialization Task to find out candidate set of best split point for each node <b> Receives each node’s Candidate Set <c> Issues MR_FindBestSplit Task with needed parameters <d> Receive all reducers’ output files containing their own best split points for each node <e> Scan all reducers’ output file and find the best point for each node <d> Update the ModelFile Dequeue current sub-dataset from InMemroyQueue <a> Issues MR_InMemoryGrow Task with needed paramemters <b> Receive the trained weka Regression Tree Model <c> Update ModelFile Back to step (1) until finish building the model When finish, output ModelFile(contain weka model) as the final Regression Tree Model MapReduce Initialization Task (1) Build equi-depth histogram (2) Return candidates of best split point Issue MR_Initialization task Ask ModelFile about tree current status (1) Check ModelFile (2) Fetch Data from Data Repository { ModelFile, Candidate Set, Processed sub-dataset, Total Dataset} MapReduceExpandNodeTask Map: filter out processed data from repository, calculate necessary information and emit to reducer Reducer: calculate the best split point for each data and output its local result MapReduceInMemoryGrowTask Map : filter out processed data from repository and then emit to reducer. Reducer: Receive all needed data, call weka training program and output the RegTree Model {ModelFile, sub-dataset, Total dataset} Return wekaRegTree Model Return all sub-datasets information that need to be processed Data Repository (file or DB) Model File (Contain latest Tree Status) Return all data Return Model latest Status

  16. MR_InMemoryGrow Energy Prediction Result

  17. Summary and future work • Scalable Machine Learning program via MapReduce Framework • Implement PLANET to build regression tree based on large data set • Integrate five components • Try to add book keeping algorithm in PLANET to improve its performance if necessary

  18. Big Thank to Yogesh Professor Prasanna and all my colleagues

More Related