Robust diagnosis DLBCL from gene expression data from different laboratories

Robust diagnosis DLBCL from gene expression data from different laboratories Gyan Bhanot, IBM Research Dimacs Workshop, June 22, 2005

Collaborators: Gabriela Alexe1 Arnold Levine2,3 Gustavo Stolovitzky1 1 IBM 2 IAS 3 UMDNJ

Overview • Motivation • Pattern-based meta-classifiers • Case study – compare data from two labs for DLBCL vs FL diagnosis

Motivation Cancer is a genetic/proteomic disease • Genetic mutations/virus’s/radiation modify pathways to create survival advantage for damaged cell • Gene Arrays are a way to study the variation of mRNA levels between diseased and healthy cells. • This allows diagnosis and inference of pathways that cause disease

Cancer diagnosis: Input Training (biomedical / proteomic, microarray) data: k 2 classes (m samples) described by N >> features Output Collection of robust biomarkers, models Robust, accurate classifier / tested on out-of-sample data

Strategy of present paper 1. Transform original data to “pattern ” space 2. Find robust sets of biomarkers with significant collective discriminatory power 3. Use many machine learning tools on original and pattern data ANN, SVM, kNN, Weighted voting, Classification trees 4. Validate the results on data from a different lab

Patterns Observed dataset System response

Pattern basics Positive patterns Negative patterns

Individual classifiers used SVM, ANN, WV, KNN, CART, LR Trained / calibrated (leave-one-out): raw data pattern data

Application: Progression of Follicular Lymphoma (FL) to Diffuse Large B Cell Lymphoma (DLBCL) Gene Array data from different laboratories Shipp et al. (2002) Nature Med.; 8(1), 68-74. (Whitehead Lab) Stolovitzky G. (2005) In Deisboeck et al Complex Systems Science in BioMedicine (in press) (preprint: http://www.wkap.nl/prod/a/Stolovitzky.pdf). (DellaFavera Lab) Alexe et al (2005) Artificial Intelligence in Medicine (in press)

Non-Hodgkin lymphomas FL low grade non-Hodgkin lymphoma t(14;18) translocation: over-expression of anti-apoptotic bcl2 25-60% FL cases evolve to DLBCL DLBCL high grade non-Hodgkin lymphoma < 2 years survival if untreated Biomarkers: FL transformation to DLBCL • p53/MDM2 (Moller et al., 1999) • p16 (Pyniol, 1998) • p38MAPK (Elenitoba-Johnson et al., 2003) • c-myc (Lossos et al., 2002)

Lymphoma datasets Data:WI (Shipp et al., 2002) Affy HuGeneFL CU (DallaFavera Lab, Stolovitzky, 2005) Affy Hu95Av2 Samples: WI: 58 DLBCL & 19 FL CU: 14 DLBCL & 7 FL Genes: WI: 6817 CU: 12581

Data Preprocessing • 50 % P calls, UL = 16000, LL = 20 • 2/1 stratify WI data to train/test. CU data test • Compute SD per gene across samples • Normalize data to mean 0, SD 1 per gene • Generate 500 data sets using noise + k fold stratified sampling + jacknife • Find genes with high correlation to phenotype using t-test or SNR. Keep genes that are in > 450/501 of datasets

Choosing Support Sets • Create good patterns using small subsets of genes, validate using weighted voting with 10 fold cross validation • Sort genes by their appearance in good patterns • Select top genes to cover each sample by at least 10 patterns

The 30 genes that best distinguish FL from DLBCL

Examples of FL and DLBCL patterns WI training data: Each DLBCL case satisfies at least one of the patterns P1 and P2 Each FL case satisfies the pattern N1 (and none of the patterns P1 and P2)

Pattern data

Meta-classifier performance

Error distribution: raw and pattern data

Biology Based Methods

FL  DLBCL progression p53 related genes identified by filtering procedure

p53 pattern data

Examples of p53 responsive genes patterns WI data: Each DLBCL case satisfies one of the patterns P1, P2, P3 Each FL case satisfies one of the patterns N1, N2, N3

p53 combinatorial biomarker 77% FL & 21% DLBCL cases (3.7 fold) at most one gene over-expressed 79% DLBCL & 23% FL cases (3.4 fold) at least two genes over-expressed Each individual gene: over- expressed in about 40-70% DLBCL & 20-40% FL (specificity 50-60%, sensitivity 60-70%)

What are these genes? • Plk1 (stpk13): polo-like kinase serine threonine protein kinase 13, M-phase specific • cell transformation, neoplastic, drives quiescent cells into mitosis • over-expressed in various human tumors • Takai et al., Oncogene, 2005: plk1 potential target for cancer therapy, new prognostic marker for cancer • Mito et al, Leuk Lymph, 2005: plk1 biomarker for DLBCL • Cdk2 (p33): cyclin -dependent kinase: G2/M transition of mitotic cell cycle, interacts with cyclins A, B3, D, E • P53 tumor suppressor gene (Levine 1982)

Conclusions • Pattern-based meta-classifier is robust against noise • Good prediction of FL  DLBCL • Biology Based Analysis also possible • Yields useful Biomarker • Should Study Biologically motivated sets of genes  build pathways

<> Thank you for your attention !

Artificial neural networks

Support vector machines Find a maximum margin hyperplane in pattern space (Vapnik) (P) (D)

k-Nearest neighbors • Training data : samples in normalized peptide space • Prediction for test data: The dominant class of the k-nearest neighbors in Euclidean metric Positive New case: Negative Negative

Weighted voting Pattern data: • each pattern P is a voter • weight = fraction of correctly classified cases by the pattern • each test case: compute sum of weights of triggered positive patterns and negative patterns • classify by highest weight

Logistic regression • Dataset of two phenotypes (e.g., cancer vs. non-cancer) • Transform into logit space y->ln(p/1-p) • Find phenotype predictor as a linear combination of data values in logit space Insightful Miner

Decision trees / forests • Find rules in training data: • find root feature which best classifies samples by phenotype • iterate on each branch to find two new features which best split each branch by phenotype • if necessary prune weak support nodes • CART =Classification and Regression Trees (Breiman) • Many trees = forest

Robust diagnosis DLBCL from gene expression data from different laboratories

Robust diagnosis DLBCL from gene expression data from different laboratories

Presentation Transcript

From DNA to Protein: Gene Expression

Efficient Gene Selection with Rough Sets From Gene Expression Data

Gene Expression: From Gene to Protein

MINING THE GENE EXPRESSION MATRIX: INFERRING GENE RELATIONSHIPS FROM L ARGE SCALE GENE EXPRESSION DATA

From gene expression to metabolic fluxes.

gene expression… from DNA to protein

Robust diagnosis DLBCL from gene expression data from different laboratories

Gene Expression: From Gene to Protein

Accurate Estimation of Gene Expression Levels from Digital Gene Expression Sequencing Data

Gene Expression: From DNA to Protein

Mining Association Rules from Microarray Gene Expression Data

Gene Expression From gene to protein

Gene Expression: From Gene to Protein

Inferring Regulatory Networks from Gene Expression Data

Gene Expression: From Gene to Protein

Gene expression From Gene to Protein

From DNA to Protein: Gene Expression

Gene Expression Data

Mining Association Rules from Microarray Gene Expression Data

Generating Robust and Consensus Clusters from Gene Expression Data

Gene expression From Gene to Protein