400 likes | 504 Views
For written notes on this lecture, please read Chapters 4 and 7 of The Practical Bioinformatician. Gene Feature Recognition. Limsoon Wong. Recognition of Splice Sites. A simple example to start the day . Splice Sites. Donor. Acceptor. Image credit: Xu. Acceptor Site (Human Genome).
E N D
For written notes on this lecture, please read Chapters 4 and 7 of The Practical Bioinformatician Gene Feature Recognition Limsoon Wong
Recognition of Splice Sites A simple example to start the day
Splice Sites Donor Acceptor
Image credit: Xu Acceptor Site (Human Genome) • If we align all known acceptor sites (with their splice junction site aligned), we have the following nucleotide distribution • Acceptor site: CAG | TAG | coding region
Image credit: Xu Donor Site (Human Genome) • If we align all known donor sites (with their splice junction site aligned), we have the following nucleotide distribution • Donor site: coding region | GT
What Positions Have “High” Information Content? • For a weight matrix, information content of each column is calculated as – X{A,C,G,T}Prob(X)*log (Prob(X)/0.25) • When a column has evenly distributed nucleotides, its information content is lowest • Only need to look at positions having high information content
Image credit: Xu Information Content Around Donor Sites in Human Genome • Information content • column –3 = – .34*log (.34/.25) – .363*log (.363/.25) – .183* log (.183/.25) – .114* log (.114/.25) = 0.04 • column –1 = – .092*log (.92/.25) – .03*log (.033/.25) – .803* log (.803/.25) – .073* log (.73/.25) = 0.30
Image credit: Xu Weight Matrix Model for Splice Sites • Weight matrix model • build a weight matrix for donor, acceptor, translation start site, respectively • use positions of high information content
Image credit: Xu Splice Site Prediction: A Procedure • Add up freq of corr letter in corr positions: • Make prediction on splice site based on some threshold AAGGTAAGT: .34 + .60 + .80 +1.0 + 1.0 + .52 + .71 + .81 + .46 = 6.24 TGTGTCTCA: .11 + .12 + .03 +1.0 + 1.0 + .02 + .07 + .05 + .16 = 2.56
Recognition of Translation Initiation Sites An introduction to the World’s simplest TIS recognition system A simple approach to accuracy and understandability
A Sample cDNA • What makes the second ATG the TIS? 299 HSU27655.1 CAT U27655 Homo sapiens CGTGTGTGCAGCAGCCTGCAGCTGCCCCAAGCCATGGCTGAACACTGACTCCCAGCTGTG 80 CCCAGGGCTTCAAAGACTTCTCAGCTTCGAGCATGGCTTTTGGCTGTCAGGGCAGCTGTA 160 GGAGGCAGATGAGAAGAGGGAGATGGCCTTGGAGGAAGGGAAGGGGCCTGGTGCCGAGGA 240 CCTCTCCTGGCCAGGAGCTTCCTCCAGGACAAGACCTTCCACCCAACAAGGACTCCCCT ............................................................ 80 ................................iEEEEEEEEEEEEEEEEEEEEEEEEEEE 160 EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE 240 EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE
Approach • Training data gathering • Signal generation • k-grams, distance, domain know-how, ... • Signal selection • Entropy, 2, CFS, t-test, domain know-how... • Signal integration • SVM, ANN, PCL, CART, C4.5, kNN, ...
Training & Testing Data • Vertebrate dataset of Pedersen & Nielsen [ISMB’97] • 3312 sequences • 13503 ATG sites • 3312 (24.5%) are TIS • 10191 (75.5%) are non-TIS • Use for 3-fold x-validation expts
Signal Generation • K-grams (ie., k consecutive letters) • K = 1, 2, 3, 4, 5, … • Window size vs. fixed position • Up-stream, downstream vs. any where in window • In-frame vs. any frame
Signal Generation: An Example 299 HSU27655.1 CAT U27655 Homo sapiens CGTGTGTGCAGCAGCCTGCAGCTGCCCCAAGCCATGGCTGAACACTGACTCCCAGCTGTG 80 CCCAGGGCTTCAAAGACTTCTCAGCTTCGAGCATGGCTTTTGGCTGTCAGGGCAGCTGTA 160 GGAGGCAGATGAGAAGAGGGAGATGGCCTTGGAGGAAGGGAAGGGGCCTGGTGCCGAGGA 240 CCTCTCCTGGCCAGGAGCTTCCTCCAGGACAAGACCTTCCACCCAACAAGGACTCCCCT • Window = 100 bases • In-frame, downstream • GCT = 1, TTT = 1, ATG = 1… • Any-frame, downstream • GCT = 3, TTT = 2, ATG = 2… • In-frame, upstream • GCT = 2, TTT = 0, ATG = 0, ...
Too Many Signals • For each value of k, there are 4k * 3 * 2 k-grams • If we use k = 1, 2, 3, 4, 5, we have 24 + 96 + 384 + 1536 + 6144 = 8184 features! • This is too many for most machine learning algorithms
Signal Selection (Basic Idea) • Choose a signal w/ low intra-class distance • Choose a signal w/ high inter-class distance
Signal Selection (eg., CFS) • Instead of scoring individual signals, how about scoring a group of signals as a whole? • CFS • Correlation-based Feature Selection • A good group contains signals that are highly correlated with the class, and yet uncorrelated with each other
Sample k-grams Selected by CFS • Position –3 • in-frame upstream ATG • in-frame downstream • TAA, TAG, TGA, • CTG, GAC, GAG, and GCC Leaky scanning Kozak consensus Stop codon Codon bias?
Signal Integration • kNN • Given a test sample, find the k training samples that are most similar to it. Let the majority class win • SVM • Given a group of training samples from two classes, determine a separating plane that maximises the margin of error • Naïve Bayes, ANN, C4.5, ...
Neighborhood 5 of class 3 of class = Illustration of kNN (k=8) Image credit: Zaki Typical “distance” measure =
Results (3-fold x-validation) * Using top 20 2-selected features from amino-acid features
Using top 100 features selected by entropy and trained on Pedersen & Nielsen’s Our method ATGpr Validation Results (on Chr X and Chr 21)
Pedersen&Nielsen [ISMB’97] 85% accuracy Neural network No explicit features Zien [Bioinformatics’00] 88% accuracy SVM+kernel engineering No explicit features Hatzigeorgiou [Bioinformatics’02] 94% accuracy (with scanning rule) Multiple neural networks No explicit features Our approach 89% accuracy (94% with scanning rule) Explicit feature generation Explicit feature selection Use any machine learning method w/o any form of complicated tuning Technique Comparisons
Recognition of Transcription Start Sites An introduction to the World’s best TSS recognition system A heavy tuning approach
-200 to +50 window size Model selected based on desired sensitivity Structure of Dragon Promoter Finder
GC-rich submodel #C + #G Window Size (C+G) = GC-poor submodel Each model has two submodels based on GC content
sp se si K-gram (k = 5) positional weight matrix Data Analysis Within Submodel
Pentamer at ith position in input Window size s jth pentamer at ith position in training window Frequency of jth pentamer at ith position in training window Promoter, Exon, Intron Sensors • These sensors are positional weight matrices of k-grams, k = 5 (aka pentamers) • They are calculated as s below using promoter, exon, intron data respectively
Simple feedforward ANN trained by the Bayesian regularisation method Tuning parameters wi Tuned threshold sE tanh(net) sI sIE ex- e-x ex+ e-x tanh(x) = net = si * wi Data Preprocessing & ANN
with C+G submodels without C+G submodels Accuracy Comparisons
References (TIS Recognition) • A. G. Pedersen, H. Nielsen, “Neural network prediction of translation initiation sites in eukaryotes”, ISMB 5:226--233, 1997 • H.Liu, L. Wong, “Data Mining Tools for Biological Sequences”, Journal of Bioinformatics and Computational Biology, 1(1):139--168, 2003 • A. Zien et al., “Engineering support vector machine kernels that recognize translation initiation sites”, Bioinformatics 16:799--807, 2000 • A. G. Hatzigeorgiou, “Translation initiation start prediction in human cDNAs with high accuracy”, Bioinformatics 18:343--350, 2002
References (TSS Recognition) • V. B. Bajic et al., “Computer model for recognition of functional transcription start sites in RNA polymerase II promoters of vertebrates”, J. Mol. Graph. & Mod. 21:323--332, 2003 • J. W. Fickett, A. G. Hatzigeorgiou, “Eukaryotic promoter recognition”, Gen. Res. 7:861--878, 1997 • A. G. Pedersen et al., “The biology of eukaryotic promoter prediction---a review”, Computer & Chemistry 23:191--207, 1999 • M. Scherf et al., “Highly specific localisation of promoter regions in large genome sequences by PromoterInspector”, JMB 297:599--606, 2000