500 likes | 630 Views
Statistical, Computational, and Informatics Tools for Biomarker Analysis. Methodology Development at the D ata M anagement and C oordinating C enter of the E arly D etection R esearch N etwork. Early Detection Research Network. 18 Laboratories. 2 Laboratories NIST. 8 Centers
E N D
Statistical, Computational, and Informatics Tools for Biomarker Analysis Methodology Development at the Data Management and Coordinating Center of the Early Detection Research Network
Early Detection Research Network 18 Laboratories 2 Laboratories NIST 8 Centers CDCP Chair: Bernard Levin Chair: David Sidransky EDRN ORGANIZATIONAL STRUCTURE An “infrastructure” for supporting collaborative research on molecular, genetic and other biomarkers in human cancer detection and risk assessment.
Early Detection Research Network INFRASTRUCTURE BIOREPOSITORY • Specimens with matching controls and • epidemiological data • Infrastructure to provide preneoplastic tissues: • - Prostate • - Lung • - Ovarian • - Colon • - Breast
Early Detection Research Network INFRASTRUCTURE LABORATORY CAPACITY • Capability in high-throughput molecular and biochemical assays • Ability to respond to evolving technologies for EDRN needs • Extensive experience and scale-up ability in proteomics and • molecular assays • Outstanding infrastructure for handling multiple assays and • validation requests
Early Detection Research Network INFRASTRUCTURE DATA STORAGE AND MINING • Outstanding track record in biomarker research • Statistical and data mining technology • Statistical and predictive models for multiple biomarkers • Novel statistical methods to interpret high-throughput data
Early Detection Research Network INFRASTRUCTURE DATA EXCHANGE AND SHARING • Improving informatics and information flow • Network web sites • public web site • secure web site • Early Detection Research Network Exchange (ERNE) • Standardizing of Data Reporting: CDEs Developed
Early Detection Research Network (EDRN) INFORMATICS AND INFORMATION FLOW
EARLY DETECTION RESEARCH NETWORK COLLABORATION How To Become an Associate Member • Contact one of the EDRN Principal Investigators to serve as a sponsor for an application. Three types of collaborative opportunities are available: • Type A: Novel research ideas complementing EDRN ongoing efforts; one year of funding at $100,000 • Type B: Share tools, technology and resources, no time limit • Type C: Allow to participate in the EDRN Meetings and Workshop • For details on how to apply, see http://www.cancer.gov/edrn
DMCC Statisticians • Margaret Pepe, Lead of Methodology Group • Ziding Feng, Principal Investigator • Yinsheng Qu • Mary Lou Thompson • Mark Thornquist • Yutaka Yasui
Biomarker Lab Collaborators at Eastern Virginia Medical School • Bao-Ling Adam • John Semmes • George Wright
Focus of Presentation • Design:Phase Structure for Biomarker Research • Analysis:Statistical Methods for Biomarker Discovery from High-Dimensional Data Sets
Design: Phase Structure for Biomarker Research Three phase structure for therapeutic trials well-established Structure promotes coherent, thorough, efficient development Similar structure needs to be developed for biomarker research
Biomarker Development • Categorize process into 5 phases • Define objectives for each phase • Define ideal study designs, evaluation and criteria for proceeding further • Standardize the process to promote efficiency and rigor
The Details of Study Design • Specific Aims • Subject/Specimen Selection • Outcome measures • Evaluation of Results • Sample Size Calculations • Limitations / Pitfalls
Phase 1 Identify leads for potentially useful biomarkers Prioritize these leads Phase 2 Determine the sensitivity and specificity or ROC curve for the clinical biomarker assay in discriminating clinical cancer from controls Specific Aims
Phase 1 Cancers that are ultimately serious if not treated early, but treatable in early stage Spectrum of sub-types Collected at diagnosis Phase 2: same criteria as for phase 1 Wide spectrum of cases Clinical specimen at diagnosis From target screening population Specimen Selection -- Cases
Phase 1 Non-cancer tissue same organ same patient Normal tissue non-cancer patient Benign growth tissue non-cancer patient Phase 2 From potential target population for screening Specimen Selection -- Controls
Phase 1 True positive and False positive rates (binary result) True positive rate at threshold yielding acceptable false positive rate ROC curve Phase 2 Results of clinical biomarker assay Outcome Measures
Phase 1 Algorithms select and prioritize markers that best distinguish tumor from non-tumor tissue Initial exploratory studies need confirmation with new validation specimens Phase 2 ROC curves ROC regression to determine if characteristics of cases and/or characteristics of controls effect biomarker’s discriminatory capacity Evaluation of Results
Phase 1 Should be large enough so that very promising biomarkers are likely to be selected for phase 2 development Phase 2 Based on a confidence intervals for the TPR or FPR, or confidence intervals for the ROC curve at selected critical points Sample Size
Findings: Sample Size Estimation • For phase 1 microarray experiments, use of ROC curves is more efficient than comparing means • For phase 2 studies, equal numbers of cases and controls is often not optimally efficient • Sample size calculations and look-up tables are now in EDRN website
Pepe et al. Phases of biomarker development for early detection of cancer. Journal of the National Cancer Institute 93(14):1054–61, 2001. Pepe et al. “Elements of Study Design for Biomarker Development” InTumor Markers, Diamandis, Fritsche, Lilja, Chan, and Schwartz , eds. AAAC Press, Washington, DC. 2002. 3. Pepe. “Statistical Evaluation of Diagnostic Tests & Biomarkers” Oxford U. Press, 2003.
Selecting Differentially Expressed Genes from Microarray ExperimentsLead: Margaret Pepe • Context • gene expression arrays for nD tumor tissues and nCnormal tissues • Yig = logarithm relative intensity at gene g for tissue i. • for which genes are Yig different in some/most cases from the normals? • how many tissues, nD andnC,should be evaluated in these experiments? • illustrated with ovarian cancer data
Statistical Measures for Gene Selection — typically use a two sample t-test for each gene — we argue that sensitivity and specificity are more directly relevant for cancer biomarker research. — focus attention on high specificity (or high sensitivity) — use the partial area under the ROC curve to rank genes, instead of the t-test
Sample Sizes for Gene Discovery Studies • traditional calculations based on statistical hypothesis testing • These are exploratory studies, need new methods • Propose to base calculations on the probability that a differentially expressed gene will rank high among all genes • Use computer simulation for sample size calculations
with 50 tumor and 50 normal tissues we can be 83.6% sure that the top 30 genes will rank in the top 100 in the experiment.
Pepe et al. Selecting differentially expressed genes from microarray experiments. Biometrics (in press)
Summary • The method we developed for selecting genes and calculating sample sizes are more appropriate for the purpose of diagnosis and early detection
Analysis:Statistical Methods for Biomarker Discovery from High-Dimensional Data Sets • Method development motivated by SELDI data from John Semmes/George Wright at Eastern Virginia Medical School • Data consist of protein intensities at tens of thousands of mass/charge points on each of 297 individuals • Developed three approaches to biomarker discovery: wavelets, boosting decision tree, and automated peak identification
The EVMS prostate cancer biomarker project • Prostate cancer patients: N=99 early-stage N=98 late-stage • Normal controls N=96 • Serum samples for proteomic analysis by Surface Enhanced Laser Desorption/Ionization (SELDI) • Goal: To discover protein signals that distinguish cancers from normals
An example of SELDI output 48,000 mass/charge points (200K Da)
Test Data Training Data 30 PCa 15 Normal (Blinded) 167 PCa (84 early, 83 late) vs. 81 Normal The design of the biomarker analysis Normal PCa-early PCa-late N=96 N=99 N=98
Wavelet AnalysisLead: Yinsheng Qu Steps in the wavelet analysis: • Represent original data plot with a set of wavelets (dimension reduction) • Determine those wavelets that distinguish between subgroups (information criterion) • Define discriminating functions based on the distinguishing wavelets (Fisher discrimination)
Three Group Classification:Normal, Cancer, BPH 12,352 mass spectrum data points, reduced to 3,420 Haar wavelet coefficients, of which 17 coefficients distinguish between the three cases. 2 classification functions generated. Truth: Predicted: Normal Cancer BPH Normal 14 0 0 Cancer 1 27 7 BPH 0 3 8
Qu Y et al. Data reduction using discrete wavelet transform in discriminant analysis with very high dimension. Biometrics, in press.
Boosted Decision Tree Method. Lead: Yinsheng Qu/Yutaka Yasui • This method combines multiple weak learners into a very accurate classifier • It can be used in cancer detection • It can also be used in identification of tumor markers • Using this method we can separate controls, BPH, and PCA without error in test set
Outline of boosting decision tree • The combined classifier is a committee with the decision stumps, the base classifiers, as its members. It makes decisions by majority vote. • The base classifiers are constructed on weighted examples: the examples misclassified will increase their weights on next round. • The 2nd stump’s specialty is to correct the 1st stump’s mistakes, and the 3rd stump’s specialty is to correct the 2nd stump’s mistakes, and so on. • The combined classifier with dozens and even hundreds of decision stumps will be accurate. • Boosting technique is resistant to over fitting.
Classifier 2: A boosted decision stump classifier with 21 peaks (potential markers)
The Boosting procedure • Yi={cancer, normal}={1, -1}, fm(xi)={1, -1} • Initial weights (m=1), wi = 1 (i = 1, . . .,N). • Choose first peak and threshold c. • For m =1 to M: wi = wi exp{amI(incorrect)} • where am = ln(1-err)/err) and err is the classification error rate at the current stage • normalize the weights so they sum to N. • choose a peak and c (i-th subject with weight wi) • Final classifier: f(x) = sum(amfm(x)) over m=1 to M. f(xi)> 0 i-th subject classified as cancer
When to stop iteration? • minimal margin: minimum of yi f(xi) over all N subjects • The minimal margin in the training sample measures how well the two classes are separated by classifier. • Even classifier reaches zero error on training sample, if iteration still increases the minimal margin --> improve prediction in future samples.
Qu et al. 2002. Boosted Decision Tree Analysis of SELDI Mass Spectral Serum Profiles Discriminates Prostate Cancer from Non-Cancer Patients. Clinical Chemistry. In press. Adam et al. 2002. Serum Protein Fingerprinting Coupled with a Pattern Matching Algorithm that Distinguishes Prostate Cancer from Benign Prostate Hyperplasia and Healthy Men. Cancer Research. 62:3609-3614.
Summary • Wavelets approach: Does not require peak identification (black-box classification) • Boosting decision tree: Requires peak identification first. Useful for both classification and protein mass identification
Final Summary • The methods developed in the past two years are mainly for Phase 1&2 studies, reflecting the current needs of EDRN. • EDRN DMCC statisticians are working on key design and analysis issues in early detection research. • More work remains to be done (e.g., In classification, consider the mislabeling of Prostate cancer by BPH; exam gene by environmental interactions).