260 likes | 348 Views
Evaluating Machine Learning Approaches for Aiding Probe Selection for Gene-Expression Arrays. J. Tobler, M. Molla, J. Shavlik University of Wisconsin-Madison M. Molla, E. Nuwaysir, R. Green Nimblegen Systems Inc. probes. surface. Oligonucleotide Microarrays.
E N D
Evaluating Machine Learning Approaches for Aiding Probe Selection for Gene-Expression Arrays J. Tobler, M. Molla, J. Shavlik University of Wisconsin-Madison M. Molla, E. Nuwaysir, R. Green Nimblegen Systems Inc.
probes surface Oligonucleotide Microarrays • Specific probes synthesized atknown spot on chip’s surface • Probes complementary to RNA of genes to be measured • Typical gene (1kb+) MUCH longer than typical probe (24 bases)
Probes: Good vs. Bad Blue = Probe Red = Sample good probe bad probe
Probe-Picking Method Needed • Hybridization characteristics differ between probes • Probe set represents very small subset of gene • Accurate measurement of expression requires good probe set
Related Work • Use known hybridization characteristics Lockhardt et al. 1996 • Melting point (Tm) predictions Kurata and Suyama 1999 Li and Stormo 2001 • Stable secondary structure Kurata and Suyama 1999
Our Approach • Apply established machine-learning algorithms • Train on categorized examples • Test on examples with category hidden • Choose features to represent probes • Categorize probes as good or bad
Feature Name Description fracA, fracC, fracG, fracT The fraction of A, C, G, or T in the 24-mer fracAA, fracAC, fracAG, fracAT, fracCA, fracCC, fracCG, fracCT, fracGA, fracGC, fracGG, fracGT, fracTA, fracTC, fracTG, fracTT The fraction of each of these dimers in the 24-mer n1, n2, …., n24 The particular nucleotide (A, C, G, or T) at the specified position in the 24-mer d1, d2, …, d23 The particular dimer (AA, AC,…TT) at the specified position in the 24-mer The Features
The Data • Tilings of 8 genes (from E. coli & B. subtilus) • Every possible probe (~10,000 probes) • Genes known to be expressed in sample Gene Sequence: GTAGCTAGCATTAGCATGGCCAGTCATG… Complement: CATCGATCGTAATCGTACCGGTCAGTAC… Probe 1: CATCGATCGTAATCGTACCGGTCA Probe 2: ATCGATCGTAATCGTACCGGTCAG Probe 3: TCGATCGTAATCGTACCGGTCAGT … …
Defining our Categories Low Intensity = BAD Probes (45%) Mid-Intensity = Not Used in Training Set (23%) High Intensity = GOOD Probes (32%) Frequency 0 .05 .15 1.0 Normalized Probe Intensity
The Machine Learning Techniques • Naïve Bayes (Mitchell 1997) • Neural Networks (Rumelhart et al. 1995) • Decision Trees (Quinlan 1996) • Can interpret predictions of each learner probabilistically
Naïve Bayes • Assumes conditional independence between features • Make judgments about test set examples based on conditional probability estimates made on training set
Naïve Bayes For each example in the test set, evaluate the following:
Neural Network(1-of-n encoding with probe length = 3) Weights A1 C1 G1 T1 Example probe sequence: “CAG” Good or Bad … A2 C2 G2 T2 ACTIVATION A3 C3 G3 T3 ERROR …
fracC Decision Tree Low High fracG fracT Automatically builds a tree of rules Low High … Low High fracTC fracG Low High … fracAC … Low High Low High … n14 … Good Probe A C G T … … … Bad Probe
Decision Tree The information gain of a feature, F, is:
Information Gain per Feature Probe Composition Features Normalized Information Gain Base Position Features Base Position Dimer Position
Cross-Validation • Leave-one-out testing: • For each gene (of the 8) Train on all but this gene Test on this gene Record result Forget what was learned • Average results across 8 test genes
Typical Probe-Intensity Prediction Across Short Region Actual Normalized Probe Intensity Starting Nucleotide Position for 24-mer Probe
Typical Probe-Intensity Prediction Across Short Region Neural Network Naïve Bayes Decision Tree Actual Normalized Probe Intensity Starting Nucleotide Position for 24-mer Probe
Probe-Picking Results Perfect Selector Number of probes selected with intensity >= 90th percentile Number of probes selected
Probe-Picking Results Perfect Selector Neural Network Number of probes selected with intensity >= 90th percentile Naïve Bayes Primer Melting Point Decision Tree Number of probes selected
Current and Future Directions • Consider more features • Folding patterns • Melting point • Feature selection • Evaluate specificity along with sensitivity • Ie, consider false positives • Evaluate probe selection + gene calling • Try more ML techniques • SVMs, ensembles, …
Take-Home Message • Machine learning does a good job on this part of probe-selection problem • Easy to collect large number of training ex’s • Easily measured features work well • Intelligent probe selection can increase microarray accuracy and efficiency
Acknowledgements • NimbleGen Systems, Inc. for providing the intensities from the eight tiled genes measured on their maskless array. • Darryl Roy for helping in creating the training data. • Grants NIH 2 R44 HG02193-02, NLM 1 R01 LM07050-01, NSF IRI-9502990, NIH 2 P30 CA14520-29, and NIH 5 T32 GM08349.