160 likes | 257 Views
Transcriptional Diagnosis by Bayesian Network. Hsun-Hsien Chang and Marco F. Ramoni. Children’s Hospital Informatics Program Harvard-MIT Division of Health Sciences and Technology Harvard Medical School March 17, 2009. Background.
E N D
Transcriptional Diagnosis by Bayesian Network Hsun-Hsien Chang and Marco F. Ramoni Children’s Hospital Informatics Program Harvard-MIT Division of Health Sciences and Technology Harvard Medical School March 17, 2009
Background • Microarray technology enables profiling expression of thousands of genes in parallel on a single chip. • Comparative analysis of gene expression across tissue states extracts signature genes for disease diagnosis. • Challenge: • Number of variables (i.e., genes) is much greater than the number observations (i.e., biological samples), inducing the problem of overfitting. • Existing methods: • Gene selection: compute statistics (eg., t-statistics, SNR, PCA) of individual genes and select high rank genes. • Classification model: create a classification function of selected genes.
Proposed Approach • Issues: • Assumption on gene independencies is inadequate. • Other genes may be collinearly expressed with the signature. • Selection and classification are two non-integrated steps. Need a cut-off threshold to select high rank genes. • Proposed strategies: • Adopt system biology approach to infer the functional dependence among genes. • Use the dependence network for tissue discrimination. • Integrate gene selection and classification model in Bayesian network framework.
. . . . Tissue state 1 Tissue state 2 Pheno Case 1 Case 2 Case M Gene 1 G2 G1 Gene 2 . . . . . . . . . . . . Gene N GN Data Representation by Bayesian Network • Bayesian networks are directed acyclic graphs where: • Node corresponds to random variables. • Directed arcs encode conditional probabilities of the target nodes on the source nodes.
Pheno G1 G1 G2 gene selection by Bayes factor G2 . . . . . . Gp Pheno Gq GN GN Gene Selection by Bayes Factor
G1 G1 Gp GN collinearity elimination G2 G2 Gp Gp Pheno Pheno Gq Gq GN GN Collinearity Elimination via Network Learning
G1 G2 Pheno Gp Gq GN Sample Classification • The phenotype variable is independent of the blue genes, given the green genes. • Technically, the green genes are under the Markov blanket of the phenotype variable, and they are the signature genes used for phenotype determination. • Tissue classification:
. . . . . . . . . . . . . . . Optimize Performance Optimize Hyperparameters Algorithm Summary Gene Selection by Bayes Factor Collinearity Elimination Sample Classification (sensitivity analysis)
Discriminate Lung Carcinoma Subtypes • Adenocarcinoma (AC) and squamous cell carcinoma (SCC) are major subtypes of lung cancer: • AC and SCC are distinct in survival, chances of metastasis, and responses to chemotherapy and targeted therapy. • Physicians lack confidence in correct recognition when there are multiple primary carcinomas. • Training: • 58 ACs and 53 SCCs. • 77 genes selected in the network. • 25 signature genes.
Large-Scale Testing on Independent Samples • 422 samples (232 ACs and 190 SCCs) aggregated from 7 cohorts (including Caucasians, African-Americans, Chinese). • Accuracy = 95.2% AUROC.
Comparisons with Other Popular Methods • Higher classification accuracy. • Small-sized signature to avoid overfitting.
KRT6 Family Characterizes the Lung Carcinoma Discrimination • Keratin-6 family genes (KRT6A, KRT6B, KRT6C) are important for distinguishing lung cancer subtypes. • Accounting for 95% of the accuracy of the whole 25-gene signature. • Located on chromosome 12q12-q13. • A nonlinear, concave discriminative surface.
Verification by Chr12q12-q13 Aberrations • Investigate DNA copy number changes in comparative genomic hybridization (CGH) array. • 12 ACs and 13 SCCs from Vrije University Medical Center, Netherland. • A dumbbell discriminative surface achieves 80% classification accuracy. • Treat average CGH values of genes occupying q12, q13, and q12-13 respectively as three features to construct a Naïve Bayes Classifier.
Conclusion • Reverse engineer regulatory network information for tissue classification. • Adopt the system biology approach to infer gene dependencies network. • Select genes by Bayes factor. • Eliminate collinearity via network learning. • Integrate gene selection and classification model in a single Bayesian network framework. • Demonstrate the promising translational value of the system biology approach in clinical study.