400 likes | 644 Views
RandomForest as a Variable Selection Tool for Biomarker Data. Katja Remlinger GlaxoSmithKline, RTP, NA ICSA Applied Statistics Symposium June 6 th , 2007. Outline. Introduction Example: Liver Fibrosis RandomForest Algorithm Challenges with Variable Selection External Cross Validation
E N D
RandomForest as a Variable Selection Tool for Biomarker Data Katja Remlinger GlaxoSmithKline, RTP, NA ICSA Applied Statistics Symposium June 6th, 2007
Outline • Introduction • Example: Liver Fibrosis • RandomForest Algorithm • Challenges with Variable Selection • External Cross Validation • Summary and Discussion
Introduction • With high dimensional data we wantto reduce the number of variables • Remove “noise” variables • Ease model interpretation • Reduce cost by measuring a subset of variables • Biomarker data are typically high dimensional => excellent candidates for variable reduction
What is a Biomarker? • Characteristic that is objectively measured and evaluated as an indicator of • normal biological processes • pathogenic processes • pharmacologic responses to a therapeutic intervention • Types of biomarkers • Genes • Proteins • Lipids • Metabolites….
Example – Liver Fibrosis • 8th leading cause of death in the US • Scar formation that occurs as the liver tries to repair damaged tissue • Current approach:Liver biopsy to determine fibrosis stage • Goal: • Identify small panel of biomarkers that can predict fibrosis stage of patient (mild or severe) =>Prediction Problem with Variable Selection
Example – Liver Fibrosis • 384 Hepatitis C infected patients of various fibrosis stages • 61% Mild • 39% Severe • Collected 46 serum biomarkers • Select 5-10 biomarkers
Prediction & Variable Selection Tools • Stepwise Regression • PLS, PLS-DA • LARS/LASSO • Elastic Net • RandomForest
New Sample Biomarker 4 = 28.65 Biomarker 32 = 0 Biomarker 4 < 14.45 Biomarker 4 >= 14.45 Daughter Node 5 Mild 0 Severe Gini Index = 0 Daughter Node 5 Mild 10 Severe Gini Index = 0.44 Biomarker 32 = 0 Biomarker 32 = 1 Mild Daughter Node 4 Mild 1 Severe Gini Index = 0.32 Daughter Node 1 Mild 9 Severe Gini Index = 0.18 Mild Severe A Single Tree Candidate Node 10 Mild 10 Severe Gini Index = 0.5 Node purity is measured by Gini Index
S1, S2, S3, S4, S5, S6, S7, S8, S9, S10 Draw Bootstrap Samples S1,S2,S2,S3,S5, S6,S7,S8,S9,S9 S2,S3,S4,S4,S4, S5,S7,S8,S8,S10 S1,S1,S2,S3,S3, S4,S7,S8,S9,S10 S1,S6,S6,S6,S8, S8,S9,S9,S9,S9 ……. Grow Trees Tree 1 Tree 2 Tree3 Tree 5000 RandomForest Data
S1, S2, S3, S4, S5, S6, S7, S8, S9, S10 P P P P S1 S6 S9 S2,S3 S4,S5 S7,S10 S1,S2,S2,S3,S5, S6,S7,S8,S9,S9 S4 S10 S2,S3,S4,S4,S4, S5,S7,S8,S8,S10 S1,S1,S2,S3,S3, S4,S7,S8,S9,S10 S5 S6 S1,S6,S6,S6,S8, S8,S9,S9,S9,S9 Drop Down Trees Tree 1 Tree 2 Tree3 Tree 5000 Prediction Accuracy Variable Importance Data Draw Bootstrap Samples ……. Drop Down Trees Permuted Prediction Accuracy
New Sample M1, M2, …., Mp Making Prediction with RandomForest Tree 5000 Tree3 ……. ……. Mild Severe Mild Mild Results from all Trees: Mild 70% Majority Vote Mild Severe 30%
Challenges with Variable Selection • How many variables are important? • Which variables are important? • How do we validate the model? • Correct way of validating model? • Is prediction accuracy significant? External cross validation Permutation test
Y X* A Common Variable Selection Approach is… • Use all data to select variables • Obtain prediction accuracy on reduced data • Introduces selection bias • Used in many publications Y X
A Better Variable Selection Approach is … • Separate training and test set • External cross validation (ECV) • Avoid selection bias
External Cross Validation 1. Partition data for 5-fold Cross-Validation. Training Set Test Set Y n x 1 X n x p Test Set Training Set Training Set . . . Svetnik et al. 2004
Variable Importance Ranking 1. Marker_6 2. Marker_509 3. Marker_906 98. Marker_57 99. Marker_2 1000. Marker_49 Training Set . . . . RF Y X Test Set Prediction 2. Build RandomForest for each training set. 3. Use importance measure to rank variables. 4. Record test set predictions.
Variable Importance Ranking 1. Marker_6 2. Marker_509 3. Marker_906 98. Marker_57 99. Marker_2 1000. Marker_49 1. Marker_6 2. Marker_509 3. Marker_906 98. Marker_57 99. Marker_2 1000. Marker_49 Training Set . . . . . . . . RF Remove Y X Test Set Prediction Repeat with remaining variables 5. Remove fraction of least important variables and rebuild RandomForest. 6. Record test set predictions. 7. Do not re-rank variables. Repeat 5-7 until small # of variables is left.
8. Compute optimization criterion at each step of variable removal. 9. Replicate to “smooth” out variability. 10. Select p’ = number of variables in the model, based on optimization criterion. mtry = sqrt(p) 3 variables are very important No additional gains by including more variables Optimization Criterion No. of Variables
11. Pick p‘ most important variables. 12. Repeat 1-11 with permuted Y.
We Discussed How to… • Use RandomForest to do variable selection • Use external cross validation to select variables in proper way and to validate model • Return to example – Identify small set of biomarkers that can predict mild or severe fibrosis stage
Liver Fibrosis – Comparison to Commercial Tests ROC Curves TPR =Sensitivity= P( Predsev. | Actualsev.) FPR =1- Specificity = P( Predsev. | Actualmild) True Positive Rate AUC 0.73 0.74 0.70 0.65 RF w. 11 markers RF w. 3 markers FibroTest ActiTest False Positive Rate
Summary and Discussion • Approach has found sets of biomarkers that GSK can use to • Predict fibrosis stage • Monitor progression of patients in non-invasive manner • Safe money • Avoid selection bias
Acknowledgements • Kwan Lee, GSK • Mandy Bergquist, GSK • Lei Zhu, GSK • Jack Liu, GSK • Terry Walker, GSK • Peter Leitner, GSK • Andy Liaw, Merck • Christopher Tong, Merck • Vladimir Svetnik, Merck • Duke University
References • DeLong, E., DeLong, D., and Clarke-Pearson, D. (1988), “Comparing the Areas Under Two or More Correlated Receiver Operating Characteristic Curves: A Nonparametric Approach, “ Biometrics, 44, 837-845. • Svetnik, V., Liaw, A., Tong, C., and Wang, T. (2004), “Application of Breiman's random forest to modeling structure-activity relationships of pharmaceutical molecules,” Multiple Classier Systems, Fifth International Workshop, MCS 2004, Proceedings, 9-11 June 2004, Cagliari, Italy. F. Roli, J. Kittler, and T. Windeatt (eds.). Lecture Notes in Computer Science, vol. 3077. Berlin:Springer, pp. 334-343.
ROC CurvesSeparation of Metavir F0/F1 from F2-F4 • FibroTest biomarkers: • alpha-2 macroglobulin • haptoglobin • ApoA1 • total bilirubin • GGT • Random Forest with all biomarkers: • hyaluronic acid • alpha-2 macroglobulin • VCAM-1 • GGT • RBP • ALT AUC 0.70 0.75 0.70 0.65
FibroTest and Random Forest Comparison The “cut-off” is the algorithm score for predicting a subject is Metavir stage F2-F4
Usual Tree Algorithm chooses the best among all Variables: X3 RandomForest chooses the best among a random subset of variables: X6 Random Input Variables Candidate Node 10 Mild 10 Severe Gini Index = 0.5 X1 X1=0 6/6 X1=14/4 ΔGini=0.00 X2 X2=0 7/6 X2=13/4 ΔGini=0.05 X3 X3=0 9/1 X3=11/9 ΔGini=0.32 X4 X4=0 6/5 X4=14/5 ΔGini=0.01 X5 X5=0 4/6 X5=16/4 ΔGini=0.02 X6 X6=0 9/4 X6=11/6 ΔGini=0.17
Example – Breast Cancer Study • Study Details • Breast cancer patients; stages II-IV • Control subjects; matched by • Age • Race • Smoking status • 42 serum biomarkers • Goal: Identify panel of biomarkers to • Monitor patient response in a non-invasive and longitudinal manner • Provide more information on underlying biology and mechanisms of drug action
Examples of Biomarkers • Laboratory Tests • Routine, non-routine and novel tests, novel applications, genes, proteins, metabolites, lipids, … • Electrophysiological Measures • ECG, EEG, … • Imaging • fMRI, PET, X-ray, BMD, Ultrasound, CT, … • Histological Analyses • Immunohistochemistry, electron microscopy, … • Physiological Measures • Heart rate, blood pressure, pupil size, … • Behavioral Tests • Cognitive function, motor performance, …
Biomarkers in the Research and Drug Discovery Process Targets Drugs Products Disease selection Target family selection Gene to function to target Target to Lead Lead to candidate Selection Candidate selection to FTIH FTIH to Proof of Concept Proof of Conceptto Phase III Phase III File & Launch Commit to product type Target selection Lead (CEDD entry) Candidate selected Commit to FTIH Proof of concept Commit to phase III Commit to file and launch Biomarker
Prognostic & Predictive Biomarkers • Prognostic Biomarkers • Inform you about clinical outcome independent of therapeutic intervention • Stable during treatment course • Patient enrichment strategies • Predictive Biomarkers • Indicate that effect ofnew drug relative to control is related to biomarker • Change over course of treatment • High importance for successful drug discovery
RandomForest on Weight Loss Data:Protein Marker Model Based on Baseline Markers and Baseline Weight • The following markers were selected as having the highest median and mean importance ranking in the protein marker model: • Weight Week 0, IGFBP-3, CRP, TNF-α, CD40L,and MMP-9 • Lipid model not as good as Protein model, but still better than Weight model.
RandomForest on Weight Loss Data:Models Based on Early Changes in Markers, Baseline Weight, and Early Change in Weight Error Rate = # subjects misclassified as fast weight losers / # subjects classified as fast weight losers Note: All results in the table are median numbers based on 50 replicates
RandomForest on Weight Loss Data:Lipid Marker Model Based on Early Change in Markers, Baseline Weight, and Early Change in Weight • The following markers were selected as having the highest median and mean importance ranking in the lipid marker model: • Weight Week 0 – Week 3, Weight Week 0, and 2 lipid markers • Very similar results are obtained if Weight Week 0 is excluded. The following markers now have highest importance: • Weight Week 0 – Week 3, and three lipid markers
Obesity - Models Based on Baseline Markers and Baseline Weight Error Rate = # subjects misclassified as fast weight losers / # subjects classified as fast weight losers Permutation p-value for Protein Markers: 0.01 Note: All results in the table are median numbers based on 50 replicates
Start 1200 calorie liquid diet Start 900 calorie liquid diet Subjects return home and regain diet control Week 1 Week 3 Week 6 Week 26 Week 52 Subjects Reside in Clinic Sample Collection for Biomarkers Example – Obesity Average Weight = 266 lbs Average BMI = 43 • 50 obese patients • Several hundred protein and lipid biomarkers at different time points • Weight at different time points
% Weight Change from Baseline Week 3 Week 6 Week 26 Example – Obesity • 50 obese patients • Several hundred protein and lipid biomarkers at different time points • Weight at different time points
Example – Obesity • 50 obese patients • 266 lbs at baseline • BMI of 43 at baseline • Several hundred protein and lipid biomarkers at baseline • Weight at different time points