470 likes | 648 Views
Parsing A Bacterial Genome. Mark Craven Department of Biostatistics & Medical Informatics University of Wisconsin U.S.A. craven@biostat.wisc.edu www.biostat.wisc.edu/~craven. The Task. Given : a bacterial genome
E N D
Parsing A Bacterial Genome Mark Craven Department of Biostatistics & Medical Informatics University of Wisconsin U.S.A. craven@biostat.wisc.edu www.biostat.wisc.edu/~craven
The Task Given: a bacterial genome Do: use computational methods to predict a “parts list” of regulatory elements
Outline • background on bacterial gene regulation • background on probabilistic language models • predicting transcription units using probabilistic language models • augmenting training with “weakly” labeled examples • refining the structure of a stochastic context free grammar
terminator promoter gene gene gene Operons in Bacteria mRNA • operon: sequence of one or more genes transcribed as a unit under some conditions • promoter: “signal” in DNA indicating where to start transcription • terminator: “signal” indicating where to stop transcription
The Task Revisited Given: • DNA sequence of E. coli genome • coordinates of known/predicted genes • known instances of operons, promoters, terminators Do: • learn models from known instances • predict complete catalog of operons, promoters, terminators for the genome
Our Approach: Probabilistic Language Models • write down a “grammar” for elements of interest (operons, promoters, terminators, etc.) and relations among them • learn probability parameters from known instances of these elements • predict new elements by “parsing” uncharacterized DNA sequence
Transformational Grammars • a transformational grammar characterizes a set of legal strings • the grammar consists of • a set of abstract nonterminal symbols • a set of terminal symbols (those that actually appear in strings) • a set of productions
A Grammar for Stop Codons • this grammar can generate the 3 stop codons: taa, tag, tga • with a grammar we can ask questions like • what strings are derivable from the grammar? • can a particular string be derived from the grammar?
A Probabilistic Version of the Grammar 1.0 1.0 0.7 0.2 • each production has an associated probability • the probabilities for productions with the same left-hand side sum to 1 • this grammar has a corresponding Markov chain model 0.8 0.3 1.0
c u u a c c g stem g c c g c g a u prefix loop g c suffix c-u-c-a-a-a-g-g- c g -u-u-u-u-u-u-u-u A Probabilistic Context Free Grammar for Terminators PREFIX STEM_BOT1 SUFFIX START B B B B B B B B B PREFIX tlSTEM_BOT2tr STEM_BOT1 tl*STEM_MIDtr* | tl*STEM_TOP2tr* STEM_BOT2 tl*STEM_MIDtr* | tl*STEM_TOP2tr* STEM_MID STEM_TOP2 tl*STEM_TOP1tr* STEM_TOP1 tlLOOPtr LOOP B B LOOP_MID B B LOOP_MID B LOOP_MID | SUFFIX B B B B B B B B B a| c | g | u B t = {a,c,g,u}, t* = {a,c,g,u,}
Inference with Probabilistic Grammars • for a given string there may be many parses, but some are more probable than others • we can do prediction by finding relatively high probability parses • there are dynamic programming algorithms for finding the most probable parse efficiently
Learning with Probabilistic Grammars • in this work, we write down the productions by hand, but learn the probability parameters • to learn the probability parameters, we align sequences of a given classs (e.g. terminators) with the relevant part of the grammar • when there is hidden state (i.e. the correct parse is not known), we use Expectation Maximization (EM) algorithms
Outline • background on bacterial gene regulation • background on probabilistic language models • predicting transcription units using probabilistic language models [Bockhorst et al., ISMB/Bioinformatics ‘03] • augmenting training with “weakly” labeled examples • refining the structure of a stochastic context free grammar
stem loop RIT suffix ORF intra ORF RIT prefix spacer -35 prom intern -10 post prom TSS UTR last ORF pre term end spacer end RDT prefix start spacer start stem loop RDT suffix SCFG ORF position specific Markov model semi-Markov model A Model for Transcription Units untranscribed region transcribed region
position-specific Markov models represent fixed-length sequence motifs • semi-Markov models represent variable-length sequences The Components of the Model • stochastic context free grammars (SCFGs) represent variable-length sequences with long-range dependencies
Gene Expression Data experimental conditions • in addition to DNA sequence data, we also use expression data to make our parses • microarrays enable the simultaneous measurement of the transcription levels of thousands of genes genes/ sequence positions
ACGTAGATAGACAGAATGACAGATAGAGACAGTTCGCTAGCTGACAGCTAGATCGATAGCTCGATAGCACGTGTACGTAGATAGACAGAATGACAGATAGAGACAGTTCGCTACGTAGATAGACAGAATGACAGATAGAGACAGTTCGCTAGCTGACAGCTAGATCGATAGCTCGATAGCACGTGTACGTAGATAGACAGAATGACAGATAGAGACAGTTCGCT Incorporating Expression Data • our models parse two sequences simultaneously • the DNA sequence of the genome • a sequence of expression measurements associated with particular sequence positions • the expression data is useful because it provides information about which subsequences look like they are transcribed together
Outline • background on bacterial gene regulation • background on probabilistic language models • predicting transcription units using probabilistic language models • augmenting training data with “weakly” labeled examples [Bockhorst & Craven, ICML ’02] • refining the structure of a stochastic context free grammar
Key Idea: Weakly Labeled Examples • regulatory elements are inter-related • promoters precede operons • terminators follow operons • etc. • relationships such as these can be exploited to augment training sets with “weakly labeled”examples
g2 g3 g4 g1 ACGTAGATAGACAGAATGACAGATAGAGACAGTTCGCTAGCTGACAGCTAGATCGATAGCTCGATAGCACGTGTACGTAGATAGACAGAATGACAGATAGAGACAGTTCGCT TGCATCTATCTGTCTTACTGTCTATCTCTGTCAAGCGATCGACTGTCGATCTAGCTATCGAGCTATCGTGCACATGCATCTATCTGTCTTACTGTCTATCTCTGTCAAGCGA • if we know that an operon ends at g4, then there must be a terminator shortly downstream g5 • if we know that an operon begins at g2, then there must be a promoter shortly upstream Inferring “Weakly” Labeled Examples • we can exploit relations such as this to augment our training sets
Strongly vs. Weakly Labeled Terminator Examples strongly labeled terminator: sub-class: rho-independent end of stem-loop gtccgttccgccactattcactcatgaaaatgagttcagagagccgcaagatttttaattttgcggtttttttgtatttgaattccaccatttctctgttcaatg extent of terminator weakly labeled terminator: gtccgttccgccactattcactcatgaaaatgagttcagagagccgcaagatttttaattttgcggtttttttgtatttgaattccaccatttctctgttcaatg
Training the Terminator Models:Strongly Labeled Examples rho-independent examples rho-dependent examples negative examples negative model rho-independent terminator model rho-dependent terminator model
Training the Terminator Models:Weakly Labeled Examples weakly labeled examples negative examples rho-independent terminator model rho-dependent terminator model negative model combined terminator model
Do Weakly Labeled Terminator Examples Help? • task: classification of terminators (both sub-classes) in E. coli K-12 • train SCFG terminator model using: • S strongly labeled examples and • W weakly labeled examples • evaluate using area under ROC curves
Learning Curves using Weakly Labeled Terminators 1 0.9 0.8 Area under ROC curve 0.7 250 weak examples 0.6 25 weak examples 0 weak examples 0.5 0 20 40 60 80 100 120 140 Number of strong positive examples
Are Weakly Labeled Examples Better than Unlabeled Examples? • train SCFG terminator model using: • S strongly labeled examples and • U unlabeled examples • vary S and U to obtain learning curves
Training the Terminator Models:Unlabeled Examples unlabeled examples rho-independent terminator model rho-dependent terminator model negative model combined model
0 40 80 120 120 0 40 80 Learning Curves: Weak vs. Unlabeled Weakly Labeled Unlabeled 1 0.8 Area under ROC curve 250 weak examples 250 unlabeled examples 25 weak examples 25 unlabeled examples 0.6 0 weak examples 0 unlabeled examples Number of strong positive examples
Are Weakly Labeled Terminators from Predicted Operons Useful? • train operon model with S labeled operons • predict operons • generate W weakly labeled terminators from W most confident predictions • vary S and W
Learning Curves using Weakly Labeled Terminators 1 0.9 0.8 Area under ROC curve 0.7 200 weak examples 100 weak examples 0.6 25 weak examples 0 weak examples 0.5 0 20 40 60 80 100 120 140 160 Number of strong positive examples
Outline • background on bacterial gene regulation • background on probabilistic language models • predicting transcription units using probabilistic language models • augmenting training with “weakly” labeled examples • refining the structure of a stochastic context free grammar [Bockhorst & Craven, IJCAI ’01]
Learning SCFGs • given the productions of a grammar, can learn the probabilities using the Inside-Outside algorithm • we have developed an algorithm that can add new nonterminals & productions to a grammar during learning • basic idea: • identify nonterminals that seem to be “overloaded” • split these nonterminals into two; allow each to specialize
0.4 0.1 0.1 0.4 0.1 0.4 0.4 0.1 • if the probabilities for look very different, depending on its context, we add a new nonterminal and specialize Refining the Grammar in a SCFG • there are various “contexts” in which each grammar nonterminal may be used • consider two contexts for the nonterminal
0.4 0.1 0.1 0.4 0.1 0.4 0.4 0.1 Refining the Grammar in a SCFG • we can compare two probability distributions P and Q using Kullback-Leibler divergence Q P
Learning Terminator SCFGs • extracted grammar from the literature (~ 120 productions) • data set consists of 142 known E. coli terminators, 125 sequences that do not contain terminators • learn parameters using Inside-Outside algorithm (an EM algorithm) • consider adding nonterminals guided by three heuristics • KL divergence • chi-squared • random
Conclusions • summary • we have developed an approach to predicting transcription units in bacterial genomes • we have predicted a complete set of transcription units for the E. coli genome • advantages of the probabilistic grammar approach • can readily incorporate background knowledge • can simultaneously get a coherent set of predictions for a set of related elements • can be easily extended to incorporate other genomic elements • current directions • expanding the vocabulary of elements modeled (genes, transcription factor binding sites, etc.) • handling overlapping elements • making predictions for multiple related genomes
Acknowledgements • Craven Lab: Joe Bockhorst, Keith Noto • David Page, Jude Shavlik • Blattner Lab: Fred Blattner, Jeremy Glasner, Mingzhu Liu, Yu Qiu • funding from National Science Foundation, National Institutes of Health