190 likes | 208 Views
Analyze influence of network and data size on Bayesian network learning. Compare classifier accuracy. Data generation and structural learning using WEKA software. Analyze leukemia gene data for classification. Preprocess gene data, filter genes using mutual information. Goal to submit results by November 17.
E N D
Artificial Intelligence Project #3 : Diagnosis Using Bayesian Networks October 18, 2005
Goals of the Project • Analysis of the influence of network size and data size on structural learning of Bayesian networks • Six Bayesian networks of various sizes are given. • Generate data examples from each Bayesian network. • Learn Bayesian network structures from the generated data. • Analyze the learned results according to the network size and the data size. • Classification using Bayesian networks • A microarray dataset consisting of two classes of samples is given. • Learn Bayesian network classifiers from the dataset. • Compare the classification accuracy of Bayesian network classifiers with that of other classifiers such as neural networks. (c) 2005 SNU CSE Biointelligence Lab
Given Bayesian Networks • Randomly generated • # of variables: 10, 30, and 45 • All variables are binary • Network file format: *.dsc for MSBNX(http://research.microsoft.com/adapt/MSBNx/) (c) 2005 SNU CSE Biointelligence Lab
Example Bayesian Network Structure I (c) 2005 SNU CSE Biointelligence Lab
Example Bayesian Network Structure II (c) 2005 SNU CSE Biointelligence Lab
*.dsc Files Node name Possible states Parents Child Conditional probability distribution (c) 2005 SNU CSE Biointelligence Lab
Data Generation X1 X2 Sample X1 from P(X1) Sample X2 from P(X2) Sample X3 from P(X3| X1) Sample X4 from P(X4| X1, X2) Sample X5 from P(X5| X3) Sample X6 from P(X6| X4) X3 X4 X5 X6 (c) 2005 SNU CSE Biointelligence Lab
Data Generation Tool • data_generator • Usage: data_generator [network file style] [# of nodes] [# of data samples] [input file] [output file]... (c) 2005 SNU CSE Biointelligence Lab
Structural Learning of Bayesian Networks • Using WEKA software (http://www.cs.waikato.ac.nz/ml/weka/) (c) 2005 SNU CSE Biointelligence Lab
Learning Example Learned network structure The original network structure (c) 2005 SNU CSE Biointelligence Lab
Materials for the First One • Given • Bayesian networks • sf_10.dsc, sf_30.dsc, sf_45.dsc, md_10.dsc, md_30.dsc, md_45.dsc • Data generation tool • data_generator.exe [for Windows], data_generator [for Linux] • Downloadable • MSBNX (http://research.microsoft.com/adapt/MSBNx/) • WEKA (http://www.cs.waikato.ac.nz/ml/weka/) • You should write your own code for comparing Bayesian network structures. (c) 2005 SNU CSE Biointelligence Lab
60 leukemia patients Bone marrow samples Affymetrix GeneChip arrays Gene expression data Study • Treatment-specific changes in gene expression discriminate in vivo drug response in human leukemia cells, MH Cheok et al., Nature Genetics 35, 2003. (c) 2005 SNU CSE Biointelligence Lab
Gene Expression Data • # of data examples • 120 (60: before treatment, 60: after treatment) • # of genes measured • 12600 (Affymetrix HG-U95A array) • Task • Classification between “before treatment” and “after treatment” based on gene expression pattern (c) 2005 SNU CSE Biointelligence Lab
Affymetrix GeneChip Arrays • Use short oligos to detect gene expression level. • Each gene is probed by a set of short oligos. • Each gene expression level is summarized by • Signal: numerical value describing the abundance of mRNA • A/P call: denotes the statistical significance of signal (c) 2005 SNU CSE Biointelligence Lab
Preprocessing • Remove the genes having more than 60 ‘A’ calls • # of genes: 12600 3190 • Discretization of gene expression level • Criterion: median gene expression value of each sample • 0 (low) and 1 (high) (c) 2005 SNU CSE Biointelligence Lab
Gene Filtering • Using mutual information • Estimated probabilities were used. • # of genes: 3190 1000 • Final dataset • # of attributes: 1001 (one for the class) • Class: 0 (after treatment), 1 (before treatment) • # of data examples: 120 (c) 2005 SNU CSE Biointelligence Lab
Final Dataset 1000 120 (c) 2005 SNU CSE Biointelligence Lab
Materials for the Second One • Given • Preprocessed microarray data file: data2.txt • Downloadable • WEKA (http://www.cs.waikato.ac.nz/ml/weka/) (c) 2005 SNU CSE Biointelligence Lab
Due: November 17, 2005 • Analysis of the influence of network size and data size on structural learning of Bayesian networks • Six Bayesian networks of various sizes are given. • Generate data examples from each Bayesian network. • Learn Bayesian network structures from the generated data. • Analyze the learned results according to the network size and the data size. • Classification using Bayesian networks • A microarray dataset consisting of two classes of samples is given. • Learn Bayesian network classifiers from the dataset. • Compare the classification accuracy of Bayesian network classifiers with that of other classifiers such as neural networks. • Submission • Kyu-BaekHwang (301-419) • 1 hard copy. (c) 2005 SNU CSE Biointelligence Lab