330 likes | 427 Views
Bioinformatics. Ayesha M. Khan Spring 2013. What’s in a secondary database?. It should be noted that within multiple alignments can be found conserved motifs that reflect shared structural or functional characteristics of the constituent sequences.
E N D
Bioinformatics Ayesha M. Khan Spring 2013
What’s in a secondary database? • It should be noted that within multiple alignments can be found conserved motifs that reflect shared structural or functional characteristics of the constituent sequences. • Such conserved motifs may be used to build characteristic signatures that aid family and/or functional diagnoses of newly determined structures. Lec-7
Conservation patterns-functional cues • E.g. the amino acids that are consistently found at enzyme active sites, or the nucleotides that are associated with transcription factor binding sites. ATP/GTP binding proteins Lec-7
Conservation patterns …..-functional cues GAL4 binding sequence Lec-7
So what exactly is a pattern? • Pattern describes a motif using a qualitative consensus sequence • Early patterns were first reported as consensus sequences These patterns were essentially composite sequences consisting of the most common residue occurring at a position in an alignment. • A later approach stored the pattern as a regular expression. A regular expression is much more flexible than a consensus sequence because more than one residue can be stored at each position. There are many patterns that can be described as regular expressions Lec-7
pattern • Uses regular expression (reducing the sequence data to a consensus) • Mismatches are not tolerated • E.g., [GA]-[IMFAT]-H-[LIVF]-H-{S}-x-[GP]-[SDG]-x-[STAGDE] • Each position in pattern is separated with a hyphen • x can match any residue • [ ] are used to indicate ambiguous positions in the pattern • { } are used to indicate residues that are not allowed at this position • ( ) surround repeated residues, e.g. A(3) means AAA Lec-7
“Rules” • “Rules” are patterns which are much shorter, generic and not associated with specific protein families. • They may denote sugar attachment sites, phosphorylation or hydroxylation sites etc. • N-glycosylation site: N-{P}-[ST]-{P} • Protein kinase C phosphorylation site: [ST]-x-[RK] • Realistically, short motifs can only be used to provide a guide as to whether a certain type of functional site might exist in a sequence, which must be verified by experiment. Lec-7
Consensus sequences • The consensus sequence method is the simplest method to build a model from a multiple sequence alignment. • The consensus sequence is built using the following rules: • Majority wins. • Skip too much variation. Lec-7
Consensus sequences (contd.) Advantages: • This method is very fast and easy to implement. Limitations: • Models have no information about variations in the columns. • Very dependent on the training set. • No scoring, only binary result (YES/NO). When I use it? • Useful to find highly conserved signatures, as for example enzyme restriction sites for DNA. Lec-7
In cases of extreme sequence divergence: • The following approaches can be used to identify distantly related members to a family of protein (or DNA) sequences •Position-specific scoring matrix (PSSM) •Profile •Hidden Markov Model These methods work by providing a statistical frame where the probability of residues or nucleotides at specific sequences are tested Thus, in multiple alignments, information on all the members in the alignment is retained. Lec-7
Sequence Profiles • A sequence profile is a position-specific scoring matrix (PSSM) that gives a quantitative description of a sequence motif. • Unlike deterministic patterns, profiles assign a score to a query sequence and are widely used for database searching. • A simple PSSM has as many columns as there are positions in the alignment, and either 4 rows (one for each DNA nucleotide) or 20 rows (one for each amino acid). Lec-7
PSSM Mkj score for the jth nucleotide at position k pkj probability of nucleotide j at position k pj “background” PSSM probability of nucleotide j Lec-7
Computing a PSSM Ckj : No. of jth type nucleotide at position k Z: Total no of aligned sequences pj: background probability of nucleotide j pkj: probability of nucleotide j at position k Lec-7
Computing a PSSM… Lec-7
Computing a PSSM… Lec-7
Computing a PSSM… Lec-7
PSI-BLASTPosition-Specific Iterated BLAST • Many proteins in a database are too distantly related to a query to be detected using standard BLAST. • In many other cases matches are detected but are so distant that the inference of homology is unclear. • Enter the more sensitive PSI-BLAST Lec-7
PSI-BLAST scheme Lec-7
PSI-BLAST… • The search process is continued iteratively, typically about 5 times, and at each step a new PSSM is built. • The search process can be stopped at any point, typically whenever few new results are returned or no new sensible results are found. Lec-7
PSI BLAST errors • Unrelated hits- how to avoid them? • Perform multi-domain splitting of your query sequence • Inspect each PSI-BLAST iteration, removing suspicious hits • Lower the Expect-level (E-value) Lec-7
Markov Model • Markov Chain • A Markov chain describes a series of events or states • There is a certain probability to move from one state to the next state • This is known as the transition probability • Probability of going to future state depends on current state not previous state. • In a Markov model all states are observable Lec-7
Hidden Markov model • A Markov model may consist of observable states and unobservable or “hidden” states. • The hidden states also affect the outcome of the observed states. • In a sequence alignment, a gap is an unobserved state that influences the probability of the next nucleotide. • In DNA, there are four symbols or states: G, A, T and C (20 in proteins). The probability value associated with each symbol is the emission probability. Lec-7
Markov Model-example Emission probability Transition probability 0.40 • This particular Markov model has a probability of • 0.80 X0.40 X 0.32 = 0.102 • to generate the sequence AG • This model shows that the sequence AT has the highest probability to occur Where do these numbers come from? A Markov model has to be “trained” with examples Lec-7
Hidden Markov model… The frequencies of occurrence of nucleotides in a multiple aligned sequence is used to calculate the emission and transition probabilities of each symbol at each state The trained HMM is then used to test how well a new sequence fits to the model •A state can either be a match/mismatch (mismatch is low probability match) (observable) Insertion (hidden) Deletion (hidden) Lec-7
Markov models (contd) Example: A general Markov chain modeling DNA *note that any sequence can be traced through the model by passing from one state to the next via transitions A Markov chain is defined by: A finite set of states, S1, S2, S3….SN A set of transition probabilities, aij An initial state probability distribution (or emission probability) πi Lec-7
Markov chain examplex={a, b} We observe the following sequence: abaaababbaa Transition probabilities: Initial state probabilities: Lec-7
Markov models (contd) Typical questions we can ask with Markov chains are: • What is the probability of being in a particular state at a particular time?(By time here we can read position in our query sequence) • What is the probability of seeing a particular sequence of states? (i.e., the score for a particular query sequence given the model) Lec-7
Markov chains-positional dependencies The connectivity or topology of a Markov chain can easily be designed to capture dependencies and variable length motifs. Lec-7
Markov chains-boundary detection Given a sequence we wish to label each symbol in the sequence according to its class (e.g. transmembrane regions or extracellular/cytosolic) How is it possible? Lec-7
Markov chains-boundary detectioncontd. • Given a training set of labeled sequences we can begin by modeling each amino acid as hydrophobic (H) or hydrophilic (L) • i.e. reduce the dimensionality of the 20 amino acids into two classes e.g., A peptide sequence can be represented as a sequence of Hs and Ls. • e.g. HHHLLHLHHLHL... Lec-7
Markov chains-boundary detectioncontd. A simpler question: is a given sequence a transmembrane sequence? A Markov chain for recognizing transmembrane sequences: Question: Is sequence HHLHH a transmembrane protein? P(HHLHH) = 0.6 x 0.7 x 0.7 x 0.3 x 0.7 x 0.7 = 0.043 Lec-7