340 likes | 639 Views
PLPTH 890 Introduction to Genomic Bioinformatics Lecture 16. Biological Sequence Pattern Analysis. Liangjiang (LJ) Wang ljwang@ksu.edu March 8, 2005. Outline. Basic concepts and biological problems. Regular expression for: Pattern matching (sequence motifs),
E N D
PLPTH 890 Introduction to Genomic Bioinformatics Lecture 16 Biological Sequence Pattern Analysis Liangjiang (LJ) Wang ljwang@ksu.edu March 8, 2005
Outline • Basic concepts and biological problems. • Regular expression for: • Pattern matching (sequence motifs), • Pattern discovery (promoter elements). • Position Weight Matrix (PWM) for: • Pattern matching (TransFac, TESS, etc), • Pattern discovery (MEME, Gibbs sampling). • Hidden Markov Models (HMMs) for protein domain analysis (next lecture).
Biological Sequence Patterns • In nucleotide sequences: • Transcription start and termination sites, • Promoter cis regulatory elements, • Intron/exon splice sites, • Translation start and stop sites, • mRNA cis regulatory elements. • In protein sequences: • Functional motifs such as signal peptides, • Conserved protein domains.
Promoter cis Regulatory Elements • Cells respond to various stimuli by regulating the expression of particular genes. • Transcription factors regulate gene expression by binding to specific • DNA sequence motifs. MyoD HLH Dimer • Transcription factor binding sites are often short (5 – 25 bases) and degenerate DNA motifs. • Co-regulated genes may have common regulatory motifs in their promoters. H2 H2 L L DNA H1 H1 CAACTGAC
How to Represent a Sequence Pattern? • Regular expressions: • A pattern is represented by a string of characters such as TATAAAA (the TATA box). • Ambiguous characters, wild-cards and gaps are allowed, but no position-specific information. • Position Weight Matrices (PWM): • Also called Position-Specific Score Matrix (PSSM). • Often an ungapped pattern specified by a table. • Stochastic models: • Hidden Markov Models (HMM), neural nets, etc. • Based on probability / machine learning theory.
Pattern Matching vs. Pattern Discovery • Pattern matching: • Scanning a nucleotide or protein sequence for matches to a known pattern. • How to get better sensitivity and specificity is the major consideration. • Pattern discovery: • Given a set of sequences, discovering a pattern that is shared by the sequences. It is unknown in advance about what is the pattern. • Using search or learning approaches. • A much harder problem than pattern matching.
Pattern Matching with RegExp • Regular Expression (RegExp) can represent: • Ambiguous character: e.g., [AG] or R. • Wild-card: e.g., X for any amino acids. • Gap: e.g., x(i, j) in PROSITE patterns. • Pattern matching with regular expression is straightforward, but sometimes very useful. • For example, find all the Arabidopsis proteins which contain the following motif: • [RK][LVI]X{5}[QH][LA] • (These proteins may be targeted to peroxisome) Patmatch at TAIR (http://www.arabidopsis.org/)
Pattern Discovery Using RegExp Enumerate all the possible regular expression patterns with ambiguous characters. e.g., CWTNC, CRTGTW, YCGGAYRRAWG, …… over {A, C, G, T, R, Y, S, W, M, K, V, H, D, B, N} Count the occurrences of all the patterns in the input sequences (word counting). e.g., z-score: Compute statistical significance based on the background distribution. (The method works for simple patterns such as short nucleotide motifs, but not for long and/or complex patterns)
Applications to Promoter Analysis • The RegExp pattern enumeration method has been used to find cis regulatory motifs that are statistically overrepresented in a given promoter sequence dataset: • Sinha and Tompa, 2002. Discovery of novel transcription factor binding sites by statistical overrepresentation. NAR, 30:5549-5560. • YMP is available athttp://wingless.cs.washington.edu/YMF/YMFWeb/YMFInput.pl. • Complete search: all motifs in the search space are enumerated and tested for statistical overrepresentation.
Problems with RegExp • Do not specify the relative frequencies of nucleotides at a position. • Cannot express the relative importance of a position for the pattern. • Cannot capture a possible relationship between two positions. A GT C C A GT C C A GTA C A GTA C A GTG G A GTG G A A CT T A A G T T A R B N B
PWM Representation of a Motif • A motif is assumed to have a fixed width, W. • In the PWM, pnk is the probability (relative frequency) of nucleotide n in column k. • Background probability: pn0 is the probability of n in the background (i.e., outside the motif). • Equal distribution: pA0 = pC0 = pG0 = pT0 = ¼. AGTCC AGTCC AGTAC AGTAC AGTGG AGTGG AACTT AAGTT Have we lost information here?
Visualization of PWM Patterns • The pattern captured by an MSA or PWM may be visualized using a sequence logo. • Information Content (IC) of the nucleotide PWM at position k is: where pnk is the probability of n at position k. Assuming equal background probability for A, C, G and T (1/4).
Information Content (IC) • IC is a measure of a site’s tolerance for substitution: high IC, low tolerance. • If pA1 = 1, pC1 = 0, pG1 = 0, pT1 = 0, • If pA4 = ¼, pC4 = ¼, pG4 = ¼, pT4 = ¼, AGTCC AGTCC AGTAC AGTAC AGTGG AGTGG AACTT AAGTT
Match with the PWM Sequence Pattern Matching with PWM • Given a Position Weight Matrix (PWM) of a pattern, find all the occurrences of the pattern on the input sequence. • Sliding window analysis: • How to score a match? pckis the PWM entry at position k and corresponding to character c of the sequence, and qc is the background probability of c. (Often use log-odd score)
Resources for Promoter Analysis • TransFac(http://www.gene-regulation.com/): • A database on eukaryotic transcription factors (TF) and their DNA binding sites (PWMs). • Provide TF classification and search options. • TESS (Transcription Element Search System athttp://www.cbil.upenn.edu/cgi-bin/tess/tess?RQ=WELCOME): • A web tool for predicting TF binding sites. • Using PWMs from TransFac and others. • SCPD(http://cgsigma.cshl.org/jian/): • The promoter database of Saccharomycescerevisiae. • Tools for site prediction and promoter retrieval.
Sequences Motif Pattern Discovery Using PWM • The Problem: • Given a set of unaligned sequences, discover a PWM pattern shared by the sequences. • The pattern locations on the sequences are also unknown in advance. • Two sets of parameters to estimate (or learn): • PWM of a potential pattern. • Pattern offset matrix. • Algorithmic approaches: • Expectation Maximization. • Gibbs sampling.
Pattern Offset Matrix • The element Zij of the pattern offset matrix Z is the probability that the pattern (given in p) starts at position j of sequence i (Xi): • The probability of a • sequence Xiwith the • pattern starting at j is: before motif motif after motif
E M p Z pZ p Z Expectation Maximization (EM) Given: length W, sequence dataset set initial values for p do { re-estimate Z from p(E-step) re-estimate p from Z(M-step) } until (change in p <ε) return p, Z
More about the EM Algorithm • EM is a heuristic algorithm for discovering PWM motifs shared by a set of sequences. • EM converges to a local maximum in the likelihood of the data given the model p: • EM usually converges in a small number of iterations. • EM is sensitive to initial starting point (i.e., the initial values in p).
MEME • MEME (Multiple EM for Motif Elicitation) is widely used for motif discovery. • MEME is based on the EM algorithm with several extensions. • MEME is available athttp://meme.sdsc.edu/meme/website/meme.html. • The dataset contains 30 yeast promoters from a co-regulated gene cluster. These genes are mostly involved in respiration, and are co-regulated in various stress conditions. • What is the TF binding site in the shared motif?
The MEME Algorithm MEME(dataset, W, NSITES, PASSES) { fori = 1 to PASSES { for each subsequence in dataset { run EM for 1 iteration with starting point derived from this subsequence choose a motif model with the highest likelihood run EM to convergence from starting point which generated that model print converged model of the shared motif erase appearances of the motif from the dataset } } }
MEME Enhancements to the Basic EM Approach • Trying many starting points by using every distinct subsequences of length w in the dataset. • Not assuming that there is exactly one motif occurrence in every sequence. • Allowing multiple motifs to be learned.
Gibbs Sampling • For motif discovery, Gibbs sampling can be viewed as a stochastic analog of EM: • In the EM algorithm, we maintained a distribution Zi over the possible motif starting positions for each sequence; • In the Gibbs sampling approach, we maintain a specific starting position for each sequence, but keep re-sampling the starting positions. • Gibbs sampling may be less susceptible to local minima than EM.
A Gibbs Sampling Algorithm Given: length W, sequence dataset choose random motif positions for a do { pick a sequence Xi estimate p using motif positions in a (all sequences but Xi)(update step) sample a new motif position aifor Xi (sampling step) } until (change in p <ε) return p, a
Gibbs Motif Sampler and AlignACE • Gibbs Motif Sampler: • Based on the work by Lawrence, et al. 1993. Science, 262:208-214. • Available at http://bayesweb.wadsworth.org/gibbs/gibbs.html. • AlignACE: • Based on the Gibbs sampling algorithm with several extensions. • Available at http://atlas.med.harvard.edu/.
Summary • For simple sequence patterns, regular expression is a useful tool. • For some complex sequence patterns, position weight matrix (PWM) is preferred. • Expectation Maximization (EM) and Gibbs sampling are two useful approaches for sequence pattern discovery. • Next: protein domain analysis using HMM
Reading • (Optional) Lawrence et al., 1993. Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment. Science, 262:208-214. • Eddy, 2004. What is a hidden Markov model? Nature Biotechnology, 22:1315-1316. • Eddy, 1998. Multiple alignment and multiple sequence based searches. Trends Guide to Bioinformatics, 15-18.
For This Week’s Lab • Collect a set of promoter sequences (10-500 sequences in FASTA format) from co-regulated or related genes. The promoter sequences should be the 500-1500 nucleotides upstream of the transcription start sites. • Collect a set of protein sequences (10-50 sequences in FASTA format) from a gene family or superfamily.