260 likes | 482 Views
Splice Site Recognition in DNA Sequences Using K-mer Frequency Based Mapping for Support Vector Machine with Power Series Kernel. Dr. Robertas Damaševičius Software Engineering Department, Kaunas University of Technology Student ų 50-415, Kaunas, Lithuania robertas.damasevicius @ktu.lt.
E N D
Splice Site Recognition in DNA Sequences Using K-mer Frequency Based Mapping for Support Vector Machine with Power Series Kernel Dr. Robertas Damaševičius Software Engineering Department, Kaunas University of Technology Studentų 50-415, Kaunas, Lithuania robertas.damasevicius@ktu.lt
What is splicing? • Splicing: modification of genetic information after transcription, in which introns are removed and exons are joined • Splice junctions: boundary points between exons and introns where splicing occurs • Donor: upstream part of intron, conserved dinucleotide GT • Acceptor: downstream part of intron, conserved dinucleotide AG • Pseudo splice-sites Int. Workshop on Intelligent Informatics in Biology and Medicine (IIBM’2008), March 4-7, 2008, Barcelona, Spain
Problem • Splice-junction site recognition • Important for successful gene prediction • Study of genetical deseases • Understanding of genetic mechanisms • Difficulties • Noisy data • Pseudo splice sites • Non-canonical splice sites (intron is not GT...AG) • Alternative splicing • Multitude of consensus sequences • Machine Learning: Support Vector Machine (SVM) • Feature space mapping for SVM • Which frequency-based feature mapping is the best? Int. Workshop on Intelligent Informatics in Biology and Medicine (IIBM’2008), March 4-7, 2008, Barcelona, Spain
Support Vector Machine (SVM) are training data vectors, are unknown data vectors , is a targetspace is the kernel function. Int. Workshop on Intelligent Informatics in Biology and Medicine (IIBM’2008), March 4-7, 2008, Barcelona, Spain
What factors influence quality of classification? • Training data • size of dataset, generation of negative examples, imbalanced datasets • Mapping of data into feature space • Orthogonal, single nucleotide, nucleotide grouping, ... • Selection of an optimal kernel function • linear, polynomial, RBF, sigmoid • Kernel parameters • SVM learning parameters • Regularization parameter, Cost factor Int. Workshop on Intelligent Informatics in Biology and Medicine (IIBM’2008), March 4-7, 2008, Barcelona, Spain
SVM feature space • Feature space: multidimensional vector representing data instances • Mapping of data into features:achieving better classification accuracy • Feature space construction: • nucleotide position-dependent • nucleotide position-independent • both nucleotide position-dependent and -independent information • Feature mapping rule: • N –the lengthof a DNA sequence, M – thelength of feature vector Int. Workshop on Intelligent Informatics in Biology and Medicine (IIBM’2008), March 4-7, 2008, Barcelona, Spain
K-mers • K-mer: a k-base long sequence (k-tuple) of DNA • K-mer feature vector: constructed using a frequency (or probability) of each k-mer in a DNA sequence Σ – alphabet, N – length of a DNA sequence, k – length of k-mer, nj– number of j-th k-mer in a DNA sequence Int. Workshop on Intelligent Informatics in Biology and Medicine (IIBM’2008), March 4-7, 2008, Barcelona, Spain
K-mer frequency mapping rules • 4-letter (ACGT) :Σ = {A, C, G, T}, ||Σ|| = 4 • Disadvantage: feature space growth ~ 4k • Nucleotide grouping based: SW, KM & RY • SW : Σ = {S, W}, ||Σ|| = 2 • Strong (C, G) nucleotides– 3 H bonds • Weak (A, T) nucleotides– 2 H bonds • RY : Σ = {R, Y}, ||Σ|| = 2 • A and G – purines (R) • C and T – pyrimidines (Y) • KM : Σ = {K, M}, ||Σ|| = 2 • A and C – amines (M) • G and T – ketones (K) Int. Workshop on Intelligent Informatics in Biology and Medicine (IIBM’2008), March 4-7, 2008, Barcelona, Spain
Example: 2-mer frequency mapping Int. Workshop on Intelligent Informatics in Biology and Medicine (IIBM’2008), March 4-7, 2008, Barcelona, Spain
Case study • Dataset: UCI repository, Genbank 64.1 primate data • 3175 sequences, each (-30 bp, +30 bp)with regard to splice site • Three splice site recognition sub-problems: • Exon/Intron(EI) vs. Negative(N) • Intron/Exon (IE) vs. Negative (N) • Exon/Intron (EI) vs. Intron/Exon (IE) • Three datasets: • EI vs. N : 767 EI and 1655 N • IE vs. N : 768 EI and 1655 N • EI vs. IE: 767 EI and 768 EI • Power series kernel • Accuracy evaluation metric: F-measure Int. Workshop on Intelligent Informatics in Biology and Medicine (IIBM’2008), March 4-7, 2008, Barcelona, Spain
Classification results: Exon/Intron vs. Negative Int. Workshop on Intelligent Informatics in Biology and Medicine (IIBM’2008), March 4-7, 2008, Barcelona, Spain
Classification results:Intron/Exon vs. Negative Int. Workshop on Intelligent Informatics in Biology and Medicine (IIBM’2008), March 4-7, 2008, Barcelona, Spain
Classification results:Intron/Exon vs. Exon/Intron Int. Workshop on Intelligent Informatics in Biology and Medicine (IIBM’2008), March 4-7, 2008, Barcelona, Spain
Classification time Int. Workshop on Intelligent Informatics in Biology and Medicine (IIBM’2008), March 4-7, 2008, Barcelona, Spain
Feature vector size Intron/exon splice sites, 2422 sequences Int. Workshop on Intelligent Informatics in Biology and Medicine (IIBM’2008), March 4-7, 2008, Barcelona, Spain
Evaluation of results • Classification accuracy: • Exon/Intron vs. N. – 4-mer ACGT frequency mapping (78.05%) • Intron/Exon vs. N. – 6-mer ACGT frequency mapping (70.75%) • E/I vs. I/E – 6-mer ACGT frequency mapping (90.59%) • 4-mers and 6-mers better than 5-mers • RY always better than SW or KM • Feature space size: • ACGT k-mer: 4k • SW, RY, KM k-mer: 2k • Classification speed: • SW/KM/RY k-mer frequency based classification can be ~ 2 times faster than ACGT k-mer classficaion Int. Workshop on Intelligent Informatics in Biology and Medicine (IIBM’2008), March 4-7, 2008, Barcelona, Spain
Why RY is better than SW or KM? • Acceptor consensus sequence has long runs of Pyrimidines (Y) Int. Workshop on Intelligent Informatics in Biology and Medicine (IIBM’2008), March 4-7, 2008, Barcelona, Spain
Conclusions • Selection of the appropriate feature mapping rule can greatly influence the DNA sequence classification results • Anomalies in consensus sequences (such as long runs) can be exploited for better classification results when selecting mapping rules • For trade-off between classification accuracy and speed, RY k-mer frequency based mapping can be used instead of 4-letter k-mer frequency • Open research problem: “forbidden” k-mers Int. Workshop on Intelligent Informatics in Biology and Medicine (IIBM’2008), March 4-7, 2008, Barcelona, Spain
Questions? Int. Workshop on Intelligent Informatics in Biology and Medicine (IIBM’2008), March 4-7, 2008, Barcelona, Spain
SVM kernel function optimization • Introduction of additional kernel parameters • Introduction of new kernels • Power series kernel function • Advantage: • more parameters for optimization • better separation of classes in feature space Int. Workshop on Intelligent Informatics in Biology and Medicine (IIBM’2008), March 4-7, 2008, Barcelona, Spain
SW k-mer frequency mapping rule • SW ({A,T} vs. {C,G}) mapping rule • reflects the difference in the number of hydrogen bonds in the DNA molecule • Strong (C, G) nucleotides- 3 H bonds • Weak (A, T) nucleotides- 2 H bonds • related to physical-chemical properties of DNA • transport of electrons • mechanical waves along the DNA helix Int. Workshop on Intelligent Informatics in Biology and Medicine (IIBM’2008), March 4-7, 2008, Barcelona, Spain
RY k-mer frequency mapping rule • The RY mapping rule ({A, G} vs.{C, T}) • describes how purines (R) and pyrimidines (Y) are distributed along the DNA sequence. • A and G – purines (R) • C and T – pyrimidines (Y) • corresponds to the chemical composition bias in the DNA strand Int. Workshop on Intelligent Informatics in Biology and Medicine (IIBM’2008), March 4-7, 2008, Barcelona, Spain
KM k-mer mapping rule • The KM mapping rule ({A,C} vs. {G,T}) • describes how ketones (K) and amines (M) are distributed along the DNA sequence • A and C – amines (M) • G and T – ketones (K) Int. Workshop on Intelligent Informatics in Biology and Medicine (IIBM’2008), March 4-7, 2008, Barcelona, Spain
Classification metric • F-measure • Advantage: • One measure that takes into account both recall and precision: aspectacular score in one does notcompensate for a bad score in the other Int. Workshop on Intelligent Informatics in Biology and Medicine (IIBM’2008), March 4-7, 2008, Barcelona, Spain