320 likes | 438 Views
Reduce Instrumentation Predictors Using Random Forests. Presented By Bin Zhao Department of Computer Science University of Maryland May 3 2005. Motivation. Crash report – too late to collect program information until the program crashes
E N D
Reduce Instrumentation Predictors Using Random Forests Presented By Bin Zhao Department of Computer Science University of Maryland May 3 2005
Motivation • Crash report – too late to collect program information until the program crashes • Testing – large number of test cases. Can we focus on the failing cases?
Motivation – failure prediction • Instrument program to monitor behavior • Predict if the program is going to fail • Collect program data if the program is predicted to likely fail • Stop running the test if the test program is not likely to fail
The problem • Large number of instrumentation predictors • What instrumentation predictors to picked?
The questions to answer • Can a good model be found for predicting failing runs based on all available data? • Can an equally good model be created based on a random selection of k% of the predictors?
Experiment • Instrumentation on a calculator program • 295 predictors • Instrumentation data collected every 50 milli-seconds • 100 runs – 81 success, 19 failure • Predictors: 275, 250, 225, 200, 175, 150, 125, 100, 90, 80, 70, 60, 50, 40, 30, 25, 20, 15, 10
Sample data • Pass Run Run Res Rec 0x40bda0 DataItem3 MSF-0x40bda0 MSF-DataItem3 1 pass 1 3244 0 3244 0 1 pass 2 3206 0 3206 0 1 pass 3 3232 0 3232 0 1 pass 4 3203 0 3203 0 1 pass 5 3243 0 3243 0 • Failure Run Run Res Rec 0x40bda0 DataItem3 MSF-0x40bda0 MSF-DataItem3 10 fail 1 3200 0 3200 0 10 fail 2 3200 0 3200 0 10 fail 3 3251 0 3251 0 10 fail 4 3251 0 3251 0 10 fail 5 3248 0 3248 0
Background – Random Forests • Many classification trees • Each tree gives a classification – vote • The classification is chosen by the most votes
Background – Random Forests • Need a training set to grow the forests • M predictors are randomly selected at each node to split the node (mtry) • One-third of the training data (oob) is used to get an estimation error
Background – Random Forests • To classify a test run as pass or fail • Sample model estimation OOB error rate: 0.0044 "fail" "pass" "class.error" "fail" 933 17 0.0178947368421053 "pass" 5 4045 0.00123456790123455
Background - R • Software for data manipulation, analysis and calculation • Provide script capability • Provide an implementation of Random Forests
Experiment steps • Determine which slice of the data to be used as modeling and testing • Find which parameter (ntree, mtry) affect the model • Find the optimal parameter values for all the random models • Build the random models by randomly picking N predictors • Verify the random models by prediction
Influential parameters in Random Forest • Two possible parameters – ntree and mtry • Building model by fixing either ntree or mtry and vary the other variable • Ntree: 200 – 1000 • Mtry: 10 – 295 • Only Mtry matters
Optimal mtry • Need to decide optimal mtry for different number of predictors (N) • The default mtry is square root of N • For different number of predicator (295 – 10): N/2 – 3N
Random model • Randomly pick the predictors from the full set of the predictors • Generate 5 sets of data for each number of predictor • Use the 5 sets of the data to build the random forest model and average the result
Random prediction • For each trained random forest, do prediction on a total different set of test data (records 401 – 450)
Analysis of the random model • Why not linear
Important predictors • Random Forests can give importance to each predictor – the number of correct votes involving the predictor • Top 20 important predictors DataItem11 RT-DataItem11 PC-DataItem11 MSF-DataItem11 AC-DataItem11 RT-DataItem9 RT-DataItem6 PC-DataItem6 AC-DataItem6 MSF-DataItem9 MSF-DataItem6 PC-DataItem9 DataItem9 AC-DataItem9 DataItem6 DataItem12 MSF-DataItem12 AC-DataItem12 RT-DataItem12 PC-DataItem12
Top model • Pick the top important predictors from the full set of the predictors to build the model (top 100, 90, 80, 70, 60, 50, 40, 30, 25, 20, 15, 10)
Observation and analysis • The fail error rate is still high (> 30%) • No all the runs fail at the same time • Fail:Success = 19:81 (too few fail cases to build a good model) • Some predictors are raw, while others are derived – MSF, AC, PC, RT
Improvements • Get the last N records for a particular run • For a set of data, randomly drop some pass data and duplicate the fail data • Randomly pick the raw predictors then all its derived predictors
Conclusion so far • Random selection does not achieve a good error rate • Some predictors have a stronger prediction power • A small set of important predictor can achieve good error rate
Future work • Why some predictors have stronger prediction power? • Any pattern for the important predictors? • How many important predictors should we pick? • How soon can we predict a fail run before it actually fails?