140 likes | 255 Views
Some Ideas on Final Project. Feature extraction. TGGCCGTACGAGTAACGGACTGGCTGTCTTCTCGT n CCGATACCCCCCACGCGAAACCCTACACATCAAAT p AGCTAACTAGAGTCACTCCTTAGGATAGTGAGCGT n AGACAAGAATCAATGCTCGCCCCCGGGTACTGAAT p GTAGGACAACAATATTGGTCCGGTGGTACCGGTAC n ACGCGGGTGGCGGCATGGTGCTCCGAAAGTGTTGT n
E N D
Feature extraction TGGCCGTACGAGTAACGGACTGGCTGTCTTCTCGT n CCGATACCCCCCACGCGAAACCCTACACATCAAAT p AGCTAACTAGAGTCACTCCTTAGGATAGTGAGCGT n AGACAAGAATCAATGCTCGCCCCCGGGTACTGAAT p GTAGGACAACAATATTGGTCCGGTGGTACCGGTAC n ACGCGGGTGGCGGCATGGTGCTCCGAAAGTGTTGT n CTCATATCCTACGCGGCCCCAACTATTAGCTCATG p TGCTCCTTTCGCGGTCCAGCAGGCAAGCGAAAGAC n
K-mer based features • Exact k-mers: k = 1 to 10 • Larger k • More accurate representation: accurate classification • More features: overfitting => lower accuracy • Less efficient (both time and memory) • Feature selection may be useful • Smaller k • Less features: less likely to overfit, shorter runing time • Less accurate representation: inaccurate classification • More efficient (both time and memory) • K-mers with degenerate chars (wild cards) • K-mers with mismatches • Two-block kmers: e.g., ACGN{1-3}CGT
PWM-based features • Can represent all forms in the previous slides in some way • Key: how to get them? • Cluster top-ranked k-mers • Use existing tools (e.g., MEME or alignACE) • Find motifs from top positive sequences (for efficiency) • Use MAST (in MEME) or ScanACE (in alignACE) to find matches in all sequences. • HMMs for learning and classification
Submission data format TGCTCCTTTCGCGGTCCAGCAGGCAAGCGAAAGAC n 0.005 AGCTAACTAGAGTCACTCCTTAGGATAGTGAGCGT n 0.345 AGACAAGAATCAATGCTCGCCCCCGGGTACTGAAT p 0.985 GTAGGACAACAATATTGGTCCGGTGGTACCGGTAC n 0.489 CTCATATCCTACGCGGCCCCAACTATTAGCTCATG p 0.523
Input format for weka @relation experiment @attribute seq string @attribute AAAA numeric @attribute AAAC numeric @attribute AAAG numeric … @attribute TTTT numeric @attribute class {n,p} @data TGCTCCTTTCGCGGTCCAGCAGGCAAGCGAAAGAC 0 1 1 0 0 …. p AGCTAACTAGAGTCACTCCTTAGGATAGTGAGCGT 1 0 0 1 1…. n
Output format from weka With option –p 0 –distribution (you could try –p 1 to get the seq printed) inst# actual predicted error distribution 1 2:p 1:n + *1,0 2 2:p 1:n + *0.868,0.132 3 1:n 1:n *1,0 … 12 1:n 1:n *0.996,0.004
More about “–p 1” • With the seq as the first attribute, you’ll get an UnsupportedAttributeTypeException: Cannot handle string attributes • Solution – use an unsupervised attribute filter to remove it: java -cp weka.3.6.6.jar weka.classifiers.meta.FilteredClassifier -F "weka.filters.unsupervised.attribute.RemoveType -T string" -W weka.classifiers.trees.J48 -t TF_10_data_1.4mer.arff -T TF_10_data_2.4mer.arff -p 1 -distribution -- -C 0.1 • This has the same effect of calling java -cp weka.3.6.6.jar weka.classifiers.trees.J48 -C 0.1-t TF_10_data_1.4mer.arff -T TF_10_data_2.4mer.arff -p 0 -distribution (but without having the seq as the first attribute in the .arff file. And of course you won’t have the seq in your output file.)
About MEME • Input seqs in FASTA format: > seq1 ATGACGTTACGTAATCCGTGATTATCCCGGCGTAC > seq2 CAGAATGGAACTTAAGGGAGAATTGTGCAATATTA > seq3 CCAACCAGGATGTTGTGCAACCGGTTCTTTTTATA > seq4 CTTGGTTGCGTCATGTCCTGGGATGCATTTACTAT > seq5 GAGATCTCCCTCTTATGTTGACATCCGTAAGCCCA > seq6 GATCGTGATGCATTGACTTGCGTAATAGTGTTATG > seq7 GCTGCAGAGATGGGTCATTATGTAATGTTGCCCCC
Running MEME <ROOT>/meme/bin/meme tf_1_top_100p.fasta -dna -minw 6 -maxw 15 -minsites 10 -mod zoops -nmotifs 5 -text > meme.out.txt tf_1_top_100p.fasta: my input was the top 100 positive seqs for TF_1 (you should try something differently: e.g. randomly split the positive seqs into 10 subsets) -dna: input is dna seq -minw 6: minimum motif length = 6 -maxw 6: maximum motif length = 15 -minsites: motif occurs in at least 10 sequences -mod zoops: each seq contains zero or one copy of the motif -nmotifs 5: return top 5 motifs -text: output is in plain text format (as opposed to HTML format)
Output from MEME ******************************************************************************** MOTIF 1 width = 10 sites = 99 llr = 717 E-value = 3.8e-141 ******************************************************************************** -------------------------------------------------------------------------------- Motif 1 Description -------------------------------------------------------------------------------- Simplified A 5::2:12991 pos.-specific C 1:::6:51:3 probability G 4:16:9:::1 matrix T :a924:3::5 bits 2.2 2.0 1.7 * * 1.5 * * ** Information 1.3 * * ** content 1.1 ** ** ** (10.3 bits) 0.9 ** ** ** 0.7 ***** ** 0.4 ********* 0.2 ********** 0.0 ---------- Multilevel ATTGCGCAAT consensus G AT T C sequence A
Running MAST <ROOT>/meme/bin/mast meme.out.txt -d TF_1_allseq.fasta -text –norc –m <k> -ev 100 -mt 0.01 –b [-stdout > mast_out.txt] • -norc: do not search motif from the reverse-complementary strand • -m <k>: find matches to the k-th motif present in meme.out.txt (vary k from 1 to max number of motifs) • Easier to use a scripting language to loop through k and parse the output • -ev 100: use an E-value cutoff = 100. this allows sequences with poor matches to be output. (default is -ev 10) • -mt 0.01: similar to –ev 100. (default is –mt 0.0001) • -b: simplify output • -stdout: redirect the output into a given file instead of the default file (mast.<motiffilename>.<seqfilename>.<paraValues>)
Output from MAST • Only need Section 1 in output SEQUENCE NAME DESCRIPTION E-VALUE LENGTH ------------- ----------- -------- ------ seq22 0.11 35 seq13 0.11 35 seq72 0.3 35 seq19 0.3 35 seq29 0.38 35 … seq446 96 35 seq298 96 35 seq264 96 35 seq251 96 35 • Larger e-values correspond to lower scores (poorer matches). • Seq not in the output has lowest score. • Therefore, need to transform e-values to scores. • E.g.: if seq not in output: score = 0; else score = max_evalue – evalue (max_evalue = 100 was specified by the –ev parameter.) • Or score = 1 / evalue.
AlignACE and ScanACE • AlignACE: AlignACE –i tf_1_top_100p.fasta > tf_1_top_100p.ace • ScanACE: ScanACE -i tf_1_top_100p.ace -z tf_1_data_1.fasta -c 1 -o <result> -c: allows more motif matches to be output Results saved in: <result>_n1.scn, <result>_n2.scn, etc. Higher scores mean better match. No transformation needed.