Predicting E. Coli Promoters Using SVM

Predicting E. Coli Promoters Using SVM DIEP MAI (dmai@wisc.edu) Course: CS/ECE 539 – Fall 2008 Instructor: Prof. Yu Hen Hu

Purpose • Build and train a SVM system to predict E. Coli promoters based on the given gene sequences. • Example: Given a gene sequence aagcaaagaaatgcttgactctgtagcgggaaggcgtattatgcacaccgccgcgcc Is it an E. Coli promoter? • For more theoretical information about E. Coli promoter: http://homepages.cae.wisc.edu/~ece539/data/gene/theory.txt

Dataset • Data file is obtained from http://homepages.cae.wisc.edu/~ece539/data/gene/data.txt • Dataset information: • Number of instances: 106 • Attributes: • Number of attributes: 57 • Type: Non-numeric nominal values (A, C, G, or T) • Classes: • Number of classes: 2 • Type: Positive (+1) or Negative (-1)

Data preprocessing • Randomly partition the dataset to TRAINSET and TESTSET • Ratio = TESTSET / (TRAINSET + TESTSET) • Encode non-numeric attributes • A  00012 = 110 • C  00102 = 210 • G  01002 = 410 • T  10002 = 810 • Scaling each feature to [-1, 1] to avoid the domination of large on small values.

Approach • RBF kernel is used  need to find “good” C (cost) and G (gamma) parameters. • Parameter scanning: • Set the range of C to [2-15, 25] and G to [2-15, 22] • For each pair (C, G), use leave-one-out method to determine the pairs that yield high accuracy rates • This process is repeated a few times; the pair that "often" produces high accuracy rates is more likely to be selected. • Training/Testing: • Use selected parameters and the whole TRAINSET to train the system. • Use the trained system to predict the TRAINSET. • preferred accuracy rate = 100% • Use the trained system to predict the TESTSET.

Results • Configuration: • Ratio of partitioning dataset = 1/5 • Split the dataset to 5 roughly equal sets; one is preserved as TESTSET • K-fold = 15 (15 folds in total) • Number of repetitions to select paras. = 10 • After running the system several times:

Observation/Conclusion • SVM: • For this dataset, the number of attributes is not large, the use of RBF kernel seems appropriate to map the feature to a higher dimension • Scanning (C, G) takes a large amount of time. One of approaches to speed up this process: • Split the range to “large” equal intervals • Pick the interval that yields high accuracy rates • Divide this range to smaller equal intervals • Repeat • K-fold method: • The larger the number of folds is, the more time the process requires • For this dataset, the number of instances is not large, large numbers of folds seem to work well.

Predicting E. Coli Promoters Using SVM

Predicting E. Coli Promoters Using SVM

Presentation Transcript

E. Coli bacteria

Promoters

Music Classification Using SVM

E. coli Interventions

E. Coli Testing

E. Coli Fluorescing Red Using Cold Temperature Sensor

E. coli

Adult Image Detection Using SVM

E. coli Genome

EWTG Assessment Using IERM/SVM

Cling- E. coli

Promoters

Speaker Verification System using SVM

e coli o121

E. Coli

e coli advantages

Text Classification using SVM-light