190 likes | 199 Views
A pipeline based on multivariate correspondence analysis for cancer genomics with various biological level information sources and data mining techniques. Explore techniques for DNA, RNA, protein, and phenotype analysis.
E N D
Max Planck Institutefor Molecular Genetics A pipeline based on multivariate correspondence analysis with supplementary variables for cancer genomics Christine Steinhoff Max Planck Institute for Molecular Genetics Berlin, Germany
Biological Level Information Source Technology Examples Max Planck Institutefor Molecular Genetics • DNA/Genome • RNA • Protein • Phenotype ESTlibrary; physical parameters of DNA, RNA, Proteins, etc; DNA sequence, datamining, literature mining, ... Literature/ database Methylation prediction: TFBS prediction; functional annotations (repetitive elements, functional categories,... ), Splicing, In silico experimental Profiling/ characterizing Epigenetics; SNP arrays, arrayCGH; sequencing; expression arrays; ... interaction ChIP chip; Preotein interaction; MASS of complexes; ... phenomics Imaging; RNAi techniques; MASS; medical observations Data Sources
Max Planck Institutefor Molecular Genetics Cat ( m , c ) PROBLEMS Discrete categories After appropriate normalization Approx lognormal symmetric Not symmetric skew Scale and Distribution differ!
Max Planck Institutefor Molecular Genetics Data INPUT Procedure Discretization Filtering Indicator coding Multiple Correspondence Analysis
Max Planck Institutefor Molecular Genetics Step 1: Discretization Patients covariates arrayCGH Expression Categorical: e.g. Staging Grading Smoking Mutation ....
Max Planck Institutefor Molecular Genetics Step 1: Discretization arrayCGH Expression Package: DNAcopy Segmentation and discretization of arrayCGH data Probability of expression Fold Change Criterion
Max Planck Institutefor Molecular Genetics Step 1: Discretization Patients covariates arrayCGH Expression Typically: n~23,000 -> reduce number
Max Planck Institutefor Molecular Genetics Step 2: Filtering (optional) • Possibilities • Neglect all genes with no change in any patient • Choose genes with highest Variance across patients • Select for high Correlation between arrayCGH and expression
Max Planck Institutefor Molecular Genetics Data INPUT Procedure Discretization Filtering Indicator coding Multiple Correspondence Analysis
Max Planck Institutefor Molecular Genetics Step 3: Indicator Matrix - Binary Coding Indicator matrix With binary coding Original matrix With categories
Max Planck Institutefor Molecular Genetics From: Multiple Correspondence Analysis and related Methods
Max Planck Institutefor Molecular Genetics EXAMPLE: PUBLISHED DATA
Max Planck Institutefor Molecular Genetics Covariate States‘ Display
Max Planck Institutefor Molecular Genetics Explore ERBB2 and MYC ERBB2 Amplified in ACGH ERBB2 overexpression ERBB2 normal in ACGH
Max Planck Institutefor Molecular Genetics ERBB2 underexpr ERBB2 loss in ACGH
Max Planck Institutefor Molecular Genetics MYC Overexpression MYC amplification
Max Planck Institutefor Molecular Genetics MYC Normal acgh MYC underexpression
Max Planck Institutefor Molecular Genetics Enrichment of GO Categories
Max Planck Institutefor Molecular Genetics Thank you for your attention ! ACKNOWLEDGEMENT Sensor Lab, CNR-INFM Max Planck Institutefor Molecular Genetics Martin Vingron Matteo Pardo