240 likes | 365 Views
On the Importance of Data Cleansing and Pre-processing for Genomic Studies. Raymond Ng Computer Science, UBC (ICapture and BC Cancer Research). My Key Genomic Projects. Better biomarkers in transplantation (Genome Canada)
E N D
On the Importance of Data Cleansing and Pre-processing for Genomic Studies Raymond Ng Computer Science, UBC (ICapture and BC Cancer Research)
My Key Genomic Projects • Better biomarkers in transplantation (Genome Canada) • Rational chemotherapy selection for Non-Small Cell lung cancer (Genome Canada) • Nanosilver effects on amphibian wildlife using novel molecular assays (NSERC) • Frog Sentinel species comparative “omics” for the environment (Genome BC)
Overview: Better Biomarkers in Transplantation • Vital organ failure a leading cause of premature death world-wide • Organ transplantation restores life and health to over 40,000 patients per year • Post-transplants not clear sailing: • Immunosuppressive drugs cause infection, cancer, diabetes, heart diseases, and kidney failures • Transplant failure and treatment complications consume enormous health care resources
BiT OverallObjectives • Identify effective and widely applicable markers that… • predict rejection or immune accommodation of solid organ transplants • diagnose acute and chronic rejection • forecast the response to therapies that individual transplant recipients receive
BiT Analysis Pipeline 54,675 probe sets ~ 38,500 genes ~15,000 probe sets ~200 probe sets PAXgene Whole Blood Affymetrix HG U133 plus 2 Microarrays Normalization and Pre-filtering Univariate feature selection Classifier building + Pathway analysis Biomarker Panel
Complex Data Types Common Theme: data cleansing and pre-processing
My Studies on Cleansing and Pre-processing • Clinical:“Detecting potential labeling errors in microarrays by data perturbation,’’Bioinformatics 2006 (Malossini, Blanzieri) • mRNA:“MDQC: a new quality assessment method for microarrays based on quality control reports,”Bioinformatics 2007 (Cohen-Freue, Hollander et al.) • DNA:“Modelling Recurrent DNA Copy Number Alterations in array CGH Data,”Bioinformatics 2007 (Shah, Murphy, Lam) • Proteomics:“Linking Protein Groups Across Multiple Experiments,”in preparation (Cohen-Freue et al.)
A. Detecting Potential Labeling Errors [MBN06] • Biomedical data can be very noisy: • Laboratory environment could change • Diagnostic decisions are not completely objective • Different “gold-standards” are used for grading • Essential to check for label (e.g., grade of rejection) consistency
A. Our Approach • Propose a leave-one-out perturbed classification matrix: • Flip every training sample and compare the resulting classifier with the classifier trained on the original training set • Look for differences in the sets obtained from Support Vector Machines • E.g., sample A is a suspect of mislabeling if flipping A’s label increases accuracy • Effectiveness shown on 3 real microarray data sets with ground truths
B. MDQC: Our Approach to Microarray Quality Control [FZN+07] • collapses all values in QC reportsinto measures to assess the quality of the array, the sample, and the RNA • measures the distances of each array to an “average” array in the study, adjusting for covariances • accounts for interrelation among measures to identify outlying arrays that is not evident from inspection of each one in isolation
Sample Quality 1500 21-4 1000 500 0 0 50 100 150 200 Chip Quality RNA Quality 400 21-4 13-3 15 300 320-1 13-4 10 13-6 13-2 200 19-1 13-5 317-10 5 100 17-6 302-7 25-5 0 0 0 50 100 150 200 0 50 100 150 200 Sample Sample
B. MDQC Advantages • Performs a multidimensional analysis and not requiring absolute thresholds (which are often arbitrary) • Easy to implement and visualize, and computationally inexpensive (as compared with Affy PLM) • Can suggest potential sources of problems and possible batch effects
Possible Batch Effects Some of the samples from batch 9 Batches 1, 2, 3 and 4)
C. DNA Copy Number Analysis with CGH arrays [SNLM06,07] • Segments of DNA that get duplicated (gains) or deleted (losses) • Chromosomal aberrations are being used to form signatures • Chemotheraphy selection for NSCLC • Staging in cancer (e.g., lung and oral)
Computational challenges • Noisy signals • Spatial dependence between adjacent clones • Outliers • Systematic errors • Copy number polymorphisms
C. Two State-of-the-Art Methods MERGELEVELS Base-HMM
C. Our Approach • Use a Hidden Markov Model (HMM) to capture spatial dependency between clones • Use a Gaussian mixture model to model the outliers separately from the inliers • Outliers have no spatial dependence • Use prior knowledge about locations of CNPs to ‘inform’ the model about possible locations of outliers
Example results LSP-HMM
Distinct Indistinguishable Subset Shared Peptide D. From Peptides to Proteins p1 p1’ p2 p3 p4 p5 p5 p6 p6 p7 p8 p9 p10 p10 p11 p11 p12 A B C D E F G H
p5 p5 p6 p6 p7 p8 p9 C D E F D. Protein Groups Which protein is present in the sample? Form protein groups where proteins in a group are not distinguishable
p10 p11 p12 p4 • Set 1 G H divide merge G H I • Set 2 D. Linking Proteins Groups Across Different Sets: a Challenge Two protein groups One protein group Proteins G and H may not be identified in Set 2
Conclusions • We propose an algorithm to solve the protein group linking problem (submitted) • We discussed some approaches to cleansing and pre-processing for clinical, mRNA, DNA and proteomics • In our biomarker studies, dealing with these issues proven to be significant to the analysis