1 / 32

Reduce Instrumentation Predictors Using Random Forests

Reduce Instrumentation Predictors Using Random Forests. Presented By Bin Zhao Department of Computer Science University of Maryland May 3 2005. Motivation. Crash report – too late to collect program information until the program crashes

davina
Download Presentation

Reduce Instrumentation Predictors Using Random Forests

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Reduce Instrumentation Predictors Using Random Forests Presented By Bin Zhao Department of Computer Science University of Maryland May 3 2005

  2. Motivation • Crash report – too late to collect program information until the program crashes • Testing – large number of test cases. Can we focus on the failing cases?

  3. Motivation – failure prediction • Instrument program to monitor behavior • Predict if the program is going to fail • Collect program data if the program is predicted to likely fail • Stop running the test if the test program is not likely to fail

  4. The problem • Large number of instrumentation predictors • What instrumentation predictors to picked?

  5. The questions to answer • Can a good model be found for predicting failing runs based on all available data? • Can an equally good model be created based on a random selection of k% of the predictors?

  6. Experiment • Instrumentation on a calculator program • 295 predictors • Instrumentation data collected every 50 milli-seconds • 100 runs – 81 success, 19 failure • Predictors: 275, 250, 225, 200, 175, 150, 125, 100, 90, 80, 70, 60, 50, 40, 30, 25, 20, 15, 10

  7. Sample data • Pass Run Run Res Rec 0x40bda0 DataItem3 MSF-0x40bda0 MSF-DataItem3 1 pass 1 3244 0 3244 0 1 pass 2 3206 0 3206 0 1 pass 3 3232 0 3232 0 1 pass 4 3203 0 3203 0 1 pass 5 3243 0 3243 0 • Failure Run Run Res Rec 0x40bda0 DataItem3 MSF-0x40bda0 MSF-DataItem3 10 fail 1 3200 0 3200 0 10 fail 2 3200 0 3200 0 10 fail 3 3251 0 3251 0 10 fail 4 3251 0 3251 0 10 fail 5 3248 0 3248 0

  8. Background – Random Forests • Many classification trees • Each tree gives a classification – vote • The classification is chosen by the most votes

  9. Background – Random Forests • Need a training set to grow the forests • M predictors are randomly selected at each node to split the node (mtry) • One-third of the training data (oob) is used to get an estimation error

  10. Background – Random Forests • To classify a test run as pass or fail • Sample model estimation OOB error rate: 0.0044 "fail" "pass" "class.error" "fail" 933 17 0.0178947368421053 "pass" 5 4045 0.00123456790123455

  11. Background - R • Software for data manipulation, analysis and calculation • Provide script capability • Provide an implementation of Random Forests

  12. Experiment steps • Determine which slice of the data to be used as modeling and testing • Find which parameter (ntree, mtry) affect the model • Find the optimal parameter values for all the random models • Build the random models by randomly picking N predictors • Verify the random models by prediction

  13. Find the good data

  14. Influential parameters in Random Forest • Two possible parameters – ntree and mtry • Building model by fixing either ntree or mtry and vary the other variable • Ntree: 200 – 1000 • Mtry: 10 – 295 • Only Mtry matters

  15. Optimal mtry • Need to decide optimal mtry for different number of predictors (N) • The default mtry is square root of N • For different number of predicator (295 – 10): N/2 – 3N

  16. Random model • Randomly pick the predictors from the full set of the predictors • Generate 5 sets of data for each number of predictor • Use the 5 sets of the data to build the random forest model and average the result

  17. Random prediction • For each trained random forest, do prediction on a total different set of test data (records 401 – 450)

  18. Random Prediction Result

  19. Analysis of the random model • Why not linear

  20. Important predictors • Random Forests can give importance to each predictor – the number of correct votes involving the predictor • Top 20 important predictors DataItem11 RT-DataItem11 PC-DataItem11 MSF-DataItem11 AC-DataItem11 RT-DataItem9 RT-DataItem6 PC-DataItem6 AC-DataItem6 MSF-DataItem9 MSF-DataItem6 PC-DataItem9 DataItem9 AC-DataItem9 DataItem6 DataItem12 MSF-DataItem12 AC-DataItem12 RT-DataItem12 PC-DataItem12

  21. Top model • Pick the top important predictors from the full set of the predictors to build the model (top 100, 90, 80, 70, 60, 50, 40, 30, 25, 20, 15, 10)

  22. Top model prediction result

  23. Observation and analysis • The fail error rate is still high (> 30%) • No all the runs fail at the same time • Fail:Success = 19:81 (too few fail cases to build a good model) • Some predictors are raw, while others are derived – MSF, AC, PC, RT

  24. Improvements • Get the last N records for a particular run • For a set of data, randomly drop some pass data and duplicate the fail data • Randomly pick the raw predictors then all its derived predictors

  25. Improved random prediction result

  26. Improved top prediction result

  27. Conclusion so far • Random selection does not achieve a good error rate • Some predictors have a stronger prediction power • A small set of important predictor can achieve good error rate

  28. Future work • Why some predictors have stronger prediction power? • Any pattern for the important predictors? • How many important predictors should we pick? • How soon can we predict a fail run before it actually fails?

  29. Random model estimation result

  30. Top model estimation result

  31. Improved random model

More Related