1 / 7

Predicting E. Coli Promoters Using SVM

Predicting E. Coli Promoters Using SVM. DIEP MAI (dmai@wisc.edu) Course: CS/ECE 539 – Fall 2008 Instructor: Prof. Yu Hen Hu. Purpose. Build and train a SVM system to predict E. Coli promoters based on the given gene sequences. Example: Given a gene sequence

andra
Download Presentation

Predicting E. Coli Promoters Using SVM

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Predicting E. Coli Promoters Using SVM DIEP MAI (dmai@wisc.edu) Course: CS/ECE 539 – Fall 2008 Instructor: Prof. Yu Hen Hu

  2. Purpose • Build and train a SVM system to predict E. Coli promoters based on the given gene sequences. • Example: Given a gene sequence aagcaaagaaatgcttgactctgtagcgggaaggcgtattatgcacaccgccgcgcc Is it an E. Coli promoter? • For more theoretical information about E. Coli promoter: http://homepages.cae.wisc.edu/~ece539/data/gene/theory.txt

  3. Dataset • Data file is obtained from http://homepages.cae.wisc.edu/~ece539/data/gene/data.txt • Dataset information: • Number of instances: 106 • Attributes: • Number of attributes: 57 • Type: Non-numeric nominal values (A, C, G, or T) • Classes: • Number of classes: 2 • Type: Positive (+1) or Negative (-1)

  4. Data preprocessing • Randomly partition the dataset to TRAINSET and TESTSET • Ratio = TESTSET / (TRAINSET + TESTSET) • Encode non-numeric attributes • A  00012 = 110 • C  00102 = 210 • G  01002 = 410 • T  10002 = 810 • Scaling each feature to [-1, 1] to avoid the domination of large on small values.

  5. Approach • RBF kernel is used  need to find “good” C (cost) and G (gamma) parameters. • Parameter scanning: • Set the range of C to [2-15, 25] and G to [2-15, 22] • For each pair (C, G), use leave-one-out method to determine the pairs that yield high accuracy rates • This process is repeated a few times; the pair that "often" produces high accuracy rates is more likely to be selected. • Training/Testing: • Use selected parameters and the whole TRAINSET to train the system. • Use the trained system to predict the TRAINSET. • preferred accuracy rate = 100% • Use the trained system to predict the TESTSET.

  6. Results • Configuration: • Ratio of partitioning dataset = 1/5 • Split the dataset to 5 roughly equal sets; one is preserved as TESTSET • K-fold = 15 (15 folds in total) • Number of repetitions to select paras. = 10 • After running the system several times:

  7. Observation/Conclusion • SVM: • For this dataset, the number of attributes is not large, the use of RBF kernel seems appropriate to map the feature to a higher dimension • Scanning (C, G) takes a large amount of time. One of approaches to speed up this process: • Split the range to “large” equal intervals • Pick the interval that yields high accuracy rates • Divide this range to smaller equal intervals • Repeat • K-fold method: • The larger the number of folds is, the more time the process requires • For this dataset, the number of instances is not large, large numbers of folds seem to work well.

More Related