330 likes | 470 Views
Applying NLP models to the Biological Domain. Eugen Buehler Lyle Ungar. Overview. “Languages” of Computers and Biology Probability Models for NL and Biology Maximum Entropy Basic ME amino acid model The “Whole Protein Model” Results in a gene prediction model.
E N D
Applying NLP models to the Biological Domain Eugen Buehler Lyle Ungar CLUNCH
Overview • “Languages” of Computers and Biology • Probability Models for NL and Biology • Maximum Entropy • Basic ME amino acid model • The “Whole Protein Model” • Results in a gene prediction model CLUNCH
Bits and Bytes: The Alphabet of Computers • Computer electronics are complicated: RAM, processor, etc. • It all comes down to bits (1s and 0s). • Bits can be organized into bytes (8). • Bytes can represent, among other things, letters (ASCII), which can form sentences. CLUNCH
DNA: Biology’s Alphabet • Biology is complicated. • It comes down to nucleotides (A,C,G,T). • Nucleotides can be grouped into codons. • Codons represent amino acids, amino acids make proteins/genes. CLUNCH
Find the words! 0101000110010100100100011010100011100101101101101001011101010101000001110101010100010001001110011101001100111001001110010100110010100100010010010001000100100010001001001100010001001100110010011101010100110011001001100101010001000110100100010000100100100010100100100010001101010100010101011100101011100011110001111000110011101001111101000011010000011110100111110010011000111100101111000111010101011001 CLUNCH
Find the genes! AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTCTGATAGCAGC TTCTGAACTGGTTACCTGCCGTGAGTAAATTAAAATTTTATTGACTTAGGTCACTAAATACTTTAACCAA TATAGGCATAGCGCACAGACAGATAAAAATTACAGAGTACACAACATCCATGAAACGCATTAGCACCACC ATTACCACCACCATCACCATTACCACAGGTAACGGTGCGGGCTGACGCGTACAGGAAACACAGAAAAAAG CCCGCACCTGACAGTGCGGGCTTTTTTTTTCGACCAAAGGTAACGAGGTAACAACCATGCGAGTGTTGAA GTTCGGCGGTACATCAGTGGCAAATGCAGAACGTTTTCTGCGTGTTGCCGATATTCTGGAAAGCAATGCC AGGCAGGGGCAGGTGGCCACCGTCCTCTCTGCCCCCGCCAAAATCACCAACCACCTGGTGGCGATGATTG AAAAAACCATTAGCGGCCAGGATGCTTTACCCAATATCAGCGATGCCGAACGTATTTTTGCCGAACTTTT GACGGGACTCGCCGCCGCCCAGCCGGGGTTCCCGCTGGCGCAATTGAAAACTTTCGTCGATCAGGAATTT GCCCAAATAAAACATGTCCTGCATGGCATTAGTTTGTTGGGGCAGTGCCCGGATAGCATCAACGCTGCGC TGATTTGCCGTGGCGAGAAAATGTCGATCGCCATTATGGCCGGCGTATTAGAAGCGCGCGGTCACAACGT TACTGTTATCGATCCGGTCGAAAAACTGCTGGCAGTGGGGCATTACCTCGAATCTACCGTCGATATTGCT GAGTCCACCCGCCGTATTGCGGCAAGCCGCATTCCGGCTGATCACATGGTGCTGATGGCAGGTTTCACCG CCGGTAATGAAAAAGGCGAACTGGTGGTGCTTGGACGCAACGGTTCCGACTACTCTGCTGCGGTGCTGGC TGCCTGTTTACGCGCCGATTGTTGCGAGATTTGGACGGACGTTGACGGGGTCTATACCTGCGACCCGCGT CAGGTGCCCGATGCGAGGTTGTTGAAGTCGATGTCCTACCAGGAAGCGATGGAGCTTTCCTACTTCGGCG CTAAAGTTCTTCACCCCCGCACCATTACCCCCATCGCCCAGTTCCAGATCCCTTGCCTGATTAAAAATAC CGGAAATCCTCAAGCACCAGGTACGCTCATTGGTGCCAGCCGTGATGAAGACGAATTACCGGTCAAGGGC ATTTCCAATCTGAATAACATGGCAATGTTCAGCGTTTCTGGTCCGGGGATGAAAGGGATGGTCGGCATGG CGGCGCGCGTCTTTGCAGCGATGTCACGCGCCCGTATTTCCGTGGTGCTGATTACGCAATCATCTTCCGA ATACAGCATCAGTTTCTGCGTTCCACAAAGCGACTGTGTGCGAGCTGAACGGGCAATGCAGGAAGAGTTC CLUNCH
NL and Biological Modeling “Mary went to the____ .” MSGTIPSCPTAL ___ CLUNCH
Markov Models CLUNCH
ME, In a Nutshell • Constrain the model. • Maximize entropy. CLUNCH
Constraining features • “is the” occurs with frequency 1/10000. • Define a feature: • Require that: CLUNCH
Exponential Solution • A unique solution exists with maximum entropy: CLUNCH
Triggers • Triggers – Words that increase the likelihood of other words. Crop→ Harvest Cuban→ Havana Iran → Hashemi Hate → Hate CLUNCH
Unigram and Bigram Caches • Caches – frequency tables built from the history. • Is “supercalifragilisticexpialidocious” a common word? • Allow for model adaptation. CLUNCH
Applying ME Models in Computational Biology • Significant improvement for NLP. • Same for biological models? • AA sequences: a simple test case. CLUNCH
Feature Sets • Unigrams and Bigrams • Self-triggers - frequency of a specific amino acid. • Class based self-triggers - frequency of a specific amino acid class. • Unigram Cache - Amino acid frequency for this protein. CLUNCH
Training and testing data • Burset et al. set of 571 proteins. • Homologous proteins eliminated. • Resulting set of 204 proteins split into 2 groups of 102 each. CLUNCH
Results • “Long distance” features help. • Best model gives a 30% reduction in perplexity over unigram reduction. • Our model may improve predictions made by Genscan, a eukaryotic gene finding algorithm. CLUNCH
Limitations of this model • Artificial model. • Cannot represent all global features. CLUNCH
The “Whole Sentence” Model CLUNCH
Secondary Structure MAGTVTEAWDVAVFAARRRNDEDDTTRDSLFTYTNSNNTRGPFEGPNYHIAPRWVYNITSVWMIFVVIASIFTNGLVLVATAKFKKLRHPLNWILVNLAIADLGETVIASTISVINQISGYFILGHPMCVLEGYTVSTCGISALWSLAVISWERWVVVCKPFGNVKFDAKLAVAGIVFSWVWSAVWTAPPVFGWSRYWPHGLKTSCGPDVFSGSDDPGVLSYMIVLMITCCFIPLAVILLCYLQVWLAIRAVAAQQKESESTQKAEKEVSRMVVVMIIAYCFCWGPYTVFACFAAANPGYAFHPLAAALPAYFAKSATIYNPIIYVFMNRQFRNCIMQLFGKKVDDGSELSSTSRTEVSSVSNSSVSPA -----HHHHHHHHHHH--------------EEE--------------------EEEE---EEEEEEEEEEEEH--HEEHHHHHHHH------HHHHHHHHH---HHEEEEEEEEEEE---EEE-----EEE----EEEE-EHEHHHHEHHHH-HEEEE---------HHHHHEHEEEEEEEHH-------------H------------E----------EEEEEEEEE------EHHHHHHHHHHHHHHHHHH-H----HHHHHHHHHHHHHEEEEEEEE-------EEEHHH----------HHHHHHHHHH--------EEEEEH-----HHHHHHH---------------EEEEE--------- CLUNCH
“Whole Protein” Results • 19 features evaluated • Two were selected: • Mean length of alpha helix region • Maximum length of any structural region • 59% increase in protein likelihood CLUNCH
Improved Glimmer Models • Glimmer used IMMs to predict genes in bacteria. • Will adding amino acid triggers improve these models? How much? CLUNCH
H. Pylori Genome • 1562 Coding Sequences • Split into: • Training (>500bp) – 1154 genes, 1,354,167 bp • Testing (<500bp) – 408 genes, 129,045 bp CLUNCH
Glimmer Depth CLUNCH
Lateral Gene Transfer • Many genes in bacteria come not from their ancestors but from other bacterial species. • Different bacteria “prefer” to use different codons. • Analogous to detection of plagiarism detection? CLUNCH
Model Adaptation • Gene models are trained for every organism. • Lots of unused information • Analogous to cross-domain application of NLP models. CLUNCH
Thanks • Lyle Ungar • Roni Rosenfeld • NIH Grant CLUNCH
N-Gram Features • Unigram (frequency of individual words) • Bigram (frequency of pairs of words) CLUNCH
Trigger feature function CLUNCH