200 likes | 324 Views
Challenges In Progressing Biomarkers To Clinical Use Proteomic Experiences. Chris Harbron Technical Lead For High Dimensional Data AstraZeneca FDA Industry Statistics Workshop September 2006. Gap Between Published Biomarkers And Biomarkers Being Approved For Use.
E N D
Challenges In Progressing Biomarkers To Clinical UseProteomic Experiences Chris Harbron Technical Lead For High Dimensional Data AstraZeneca FDA Industry Statistics Workshop September 2006
Gap Between Published Biomarkers And Biomarkers Being Approved For Use
Why Might This Be?Challenges • Pressures from the contextual environment • High quality data is essential • These are new technologies - not simple to use or analyse • Robust study design including : • Consistent sample collection and processing • Need to understand reproducibility between & within labs & within subjects • Failure leads to poor data quality, frequently dominated by nuisance factors • Rigorous validation is also essential • Occurs at many levels • Avoid overfitting data • Omics may not do it alone • Applications will require combining -omics with other data types
Example : Case-Control Study • Interest in identifying a peptidomic profile that could predict an adverse event • Potential use as a personalised medicine predictive marker • Blood samples taken from subjects at start of treatment • Subjects monitored for adverse event using a rigorous definition • Subjects entered in cohorts • Samples processed in batches within cohorts • Analysed on a LC/MS-MS platform
100 95 90 85 80 75 70 65 60 55 50 Relative Abundance 45 40 35 30 25 20 15 10 5 0 200 300 400 500 600 700 800 900 1000 1100 1200 1300 1400 m/z LC-MS/MS Proteomics Clinical Plasma Samples Preparation & Digestion Mass / Charge Ratio Ion intensity Peptides Liquid Chromatography Separation By Retention Time Retention Time Separation By Mass/Charge Measurement Of Intensity 690.81 Mass Spectrometry Fragment Ion intensity 1027.87 570.33 1156.84 599.13 579.3 635.85 1138.86 643.8 1122.83 1251.79 371.25 799.93 1010.89 242.26 727.23 1252.9 258.19 Protein Identification 881.99 389.22 561.21 958.89 276.24 832.76 1269.83 286.28 1234.85 1107.00 1346.63 MS/MS Mass / Charge Ratio
Distribution Of Average Intensities ~5,500,000 RT / MZ / Intensity Measurements Per Sample Distribution Of Average Intensities High Intensity • Pre-Processing • Alignment Of Retention Times • Scaling • Binning Mass-Charge Ratio Low Intensity ~25,000 Common Peaks Per Sample Retention Time
Proteomic DataExploratory Analysis - PCA Considerable batch to batch variation Control Case Non-Index Case Cohort 1 Cohort 2 Cohort 3 Cohort 4
Proteomic DataExploratory Analysis - PCA Within all batches with both cases and controls, there is separation of cases and controls
Univariate Analyses Within BatchesHistograms Of t-Test p-Values
Global Test Of Agreement Between Batches Using A Permutation Test Identify peaks where direction of effect agrees in all 3 batches Summarise by maximum p-value Global test of expected level due to multiple testing by permutation Observed Permuted
Typical Highly Significant Peak Within each batch, cases are highly expressed compared to controls Not possible to define a global cut-off between cases and controls Intensity Batches CASE CONTROLNIC
Multivariate Analyses • Identified consistent effect • BUT, may be difficult to use as a predictive biomarker in a clinical setting due to batch variation • Would a combination of markers, a peptidomic profile, work as a predictive biomarker? • Use Random Forests to generate multivariate predictive models • Assess predictive power using a nested cross-validation • Within and between batch prediction
Modelling Process Data Control Only batches Batch excluded Observation excluded Mixed Case-Control batches Exclude Batches In Turn Exclude Observations By LOO Observation Excluded Training Set Test Set Batch Excluded Analyse Each Peak Within Each Batch Take Maximum p-Value For Each Peak Rank Peaks By p-Value Number Of Peaks Build Model With Top n Peaks Test Model In Test Set
Leave One Out Cross ValidationProteomic Model Predictions Leave One Out Training Set Batches Cases Leave One Out Training Set Batches Controls Other Mixed Batch Cases Other Mixed Batch Controls Other Batches - Controls
Mass Charge Ratio Retention Time Mask Data By Restricting To High Quality Regions Of Proteomic Space • TECHNICALLY • Region of focus for instrument • EMPIRICALLY • Lowest residual • variability • Highest average intensity
Analysis Of Unmasked Peaks • Batch Effects Still Dominate • Consistent Case-Control Effect Can Identify Peaks Separating Cases & Controls Across Batches
Cross-Validation PredictionsUnmasked Peaks Leave One Out Same Batch – Cases Leave One Out Same Batch - Controls Other Mixed Batch - Cases Other Mixed Batch - Controls Other Batches - Controls • Good Predictions Within Same Batch • Prediction Rate Falls When Extrapolated To Other Batches • Need To Prospectively Test In Another Set Of Patients
How To Combine Other Non-omic Information Into A Biomarker? • Combining different data types is challenging • The “bigger” data type will dominate the modelling • Greater signal in data, but doesn’t extrapolate as well • Exploring options turning the random part of random forests to our advantage Known Clinical Prognostic Proteomic Peaks
Proteomic Quality Control Consortium? • MAQC recently reported a reproducibility study for microarrays • Wealth of valuable information • Mammoth effort • Could we do the same for proteomics? • Less mature technology • Greater diversity of platforms • Diversity of pre-processing methodologies • Issues of identification making large scale comparisons challenging
Conclusions • Complicated new technologies • Many challenges • Technical, Data Quality, Data Analysis, Practical • Essential role for statistics • Need to integrate statistical approaches with understanding of technologies and biology • Great potential • Better treatments for patients • Improved use of compounds • Greater biological understanding