70 likes | 181 Views
Predicting E. Coli Promoters Using SVM. DIEP MAI (dmai@wisc.edu) Course: CS/ECE 539 – Fall 2008 Instructor: Prof. Yu Hen Hu. Purpose. Build and train a SVM system to predict E. Coli promoters based on the given gene sequences. Example: Given a gene sequence
E N D
Predicting E. Coli Promoters Using SVM DIEP MAI (dmai@wisc.edu) Course: CS/ECE 539 – Fall 2008 Instructor: Prof. Yu Hen Hu
Purpose • Build and train a SVM system to predict E. Coli promoters based on the given gene sequences. • Example: Given a gene sequence aagcaaagaaatgcttgactctgtagcgggaaggcgtattatgcacaccgccgcgcc Is it an E. Coli promoter? • For more theoretical information about E. Coli promoter: http://homepages.cae.wisc.edu/~ece539/data/gene/theory.txt
Dataset • Data file is obtained from http://homepages.cae.wisc.edu/~ece539/data/gene/data.txt • Dataset information: • Number of instances: 106 • Attributes: • Number of attributes: 57 • Type: Non-numeric nominal values (A, C, G, or T) • Classes: • Number of classes: 2 • Type: Positive (+1) or Negative (-1)
Data preprocessing • Randomly partition the dataset to TRAINSET and TESTSET • Ratio = TESTSET / (TRAINSET + TESTSET) • Encode non-numeric attributes • A 00012 = 110 • C 00102 = 210 • G 01002 = 410 • T 10002 = 810 • Scaling each feature to [-1, 1] to avoid the domination of large on small values.
Approach • RBF kernel is used need to find “good” C (cost) and G (gamma) parameters. • Parameter scanning: • Set the range of C to [2-15, 25] and G to [2-15, 22] • For each pair (C, G), use leave-one-out method to determine the pairs that yield high accuracy rates • This process is repeated a few times; the pair that "often" produces high accuracy rates is more likely to be selected. • Training/Testing: • Use selected parameters and the whole TRAINSET to train the system. • Use the trained system to predict the TRAINSET. • preferred accuracy rate = 100% • Use the trained system to predict the TESTSET.
Results • Configuration: • Ratio of partitioning dataset = 1/5 • Split the dataset to 5 roughly equal sets; one is preserved as TESTSET • K-fold = 15 (15 folds in total) • Number of repetitions to select paras. = 10 • After running the system several times:
Observation/Conclusion • SVM: • For this dataset, the number of attributes is not large, the use of RBF kernel seems appropriate to map the feature to a higher dimension • Scanning (C, G) takes a large amount of time. One of approaches to speed up this process: • Split the range to “large” equal intervals • Pick the interval that yields high accuracy rates • Divide this range to smaller equal intervals • Repeat • K-fold method: • The larger the number of folds is, the more time the process requires • For this dataset, the number of instances is not large, large numbers of folds seem to work well.