730 likes | 868 Views
SEQUENCE ANALYSIS. By Jyotika Bhati. Bioinformatics. The design , construction and use of software tools to generate , store , annotate , access and analyse data and information relating to Molecular Biology. OR. Biologists doing “stuff” with computers?. What is Sequence ?.
E N D
SEQUENCE ANALYSIS By JyotikaBhati
Bioinformatics The design, construction and use of software tools to generate, store, annotate, access and analyse data and information relating to Molecular Biology OR Biologists doing “stuff” with computers?
What is Sequence ? • A sequence is an ordered list of objects (or events). • Biological sequence is a single, continuous molecule of nucleic acid or protein. • Sequence analysis in bioinformatics is an automated, computer-based examination of characteristic fragments, e.g. of a DNA strand. • The term "sequence analysis" in biology implies subjecting a DNA or peptide sequence to sequence alignment, sequence databases, repeated sequence searches, or other bioinformatics methods on a computer.
Nucleotide Sequence Databases • NCBI (National Center for Biotechnology Information) • EMBL (European Molecular Biology Laboratory) • DDBJ (DNA DataBank of Japan)
Protein Sequence Database • SWISS-PROT • TrEMBL
Sequence Alignment • The identification of residue-residue correspondences • The basic tool in bioinformatics WHY Sequence Alignment ? • For discovering functional, structural and evolutionary information in biological sequences • Eases further tasks like: • Annotation of new sequences • Modeling of protein structures • Design and analysis of gene expression experiments
Basic Steps in Sequence Alignment • Comparison of sequences to find similarity and dissimilarity in compared sequences • Identification of gene-structures, reading frames, distributions of introns and exons and regulatory elements • Finding and comparing point mutations to get the genetic marker • Revealing the evolutionary and genetic diversity • Function annotation of genes.
The Concept • An alignment is a mutual arrangement of two sequences • Exhibits where two sequences are similar, and where they differ • An ‘optimal’ alignment – most correspondences and the least differences • Sequences that are similar probably have the same function Sequence alignment involves the identification of the correct location of deletions and insertions that have occurred in either of the two lineages since the divergence from a common ancestor.
Terms of sequence comparison Sequence identity • Exactly same Nucleotide/AminoAcid in same position Sequence similarity • Substitutions with similar chemical properties Sequence homology • General term that indicates evolutionary relatedness among sequences • Sequences are homologous if they are derived from a common ancestral sequence.
Homology • Homology designates a qualitative relationship of common descent between entities • Two genes are either homologs or not ! • It doesn’t make sense to say “two genes are 43% homologous” • It doesn’t make sense to say “John is 43% diabetic” Two genes are orthologs if they originated from single ancestral gene in the most recent common ancestor of their respective genomes Two genes are paralogs if they are related by duplication
Things to consider • To find the best alignment one needs to examine all possible alignment • To reflect the quality of the possible alignments one needs to score them • There can be different alignments with the same highest score • Variations in the scoring scheme may change the ranking of alignments
Manual alignment • When there are few gaps and the two sequences are not too different from each other, a reasonable alignment can be obtained by visual inspection. • Advantages: (1) use of a powerful and trainable tool (the brain, well… some brains).(2) ability to integrate additional data Disadvantage : The method is subjective and unscalable.
Types of Alignment • Pairwise Alignment - Multiple Alignment • Dot Matrix Method • Dynamic Programming • Word Method • Dynamic Programming • Progressive Methods • Iterative Methods • Motif Finding
Pairwise Sequence Alignment • One pair of elements at a time • Challenge – Find optimum alignment of 2 seqs with some degree of similarity • Optimality is based on SCORE • Score reflects the no. of paired characters in the 2 seqs and the no. and length of gaps introduced to adjust the seqs so that max no. of characters are in alignment
A pairwise alignment consists of a series of paired bases, one base from each sequence. There are three types of pairs:(1) matches = the same nucleotide appears in both sequences. (2) mismatches = different nucleotides are found in the two sequences. (3) gaps = a base in one sequence and a null base in the other. Gap Match Mismatch GCGGCCCATCAGGTACTTGGTG -G GCGT TCCATC - - CTGGTTGGTGTG
Dot Matrix Method • Established in 1970 by A.J. Gibbs and G.A.McIntyre • Method for comparing two nucleotide/aa sequences • each sequence builds one axis of the grid • one puts a dot, at the intersection of same letters appearing in both sequences • scan the graph for a series of dots reveals similarity or a string of same characters • longer sequences can also be compared on a single page, by using smaller dots
Dot Matrix Method • the dot matrix method reveals the presence of insertions or deletions • comparing a single sequence to itself can reveal the presence of a repeat of a subsequence • self comparison can reveal several features: – similarity between chromosomes – tandem genes – repeated domains in a protein sequence – regions of low sequence complexity (same characters are often repeated)
Tools generating Dot Matrices • Dotlet (Java based web-application) http://www.isrec.isb-sib.ch/java/dotlet/Dotlet.html • Compare & dotplot programmes in GCG Wisconsin Package (Genetics Computer Group [commercial]) • GeneAssist package of ABI/Perkin Elmer • DOTTER (available on dapsas, UNIX X-Windows) • DNA Strider (Macintosh only)
Dot Matrix Methods • When to use : – unless the sequences are known to be very much alike • Demerits – doesn’t readily resolve similarity that is interrupted by insertion or deletions – Difficult to find the best possible alignment (optimal alignment) – most computer programs don’t show an actual alignment
Pairwise alignment: the problem The number of possible pairwise alignments increases explosively with the length of the sequences: Two protein sequences of length 100 amino acids can be aligned in approximately 1060 different ways Time needed to test all possibilities is same order of magnitude as the entire lifetime of the universe.
Global versus local alignments Global alignment: align full length of both sequences. (The “Needleman-Wunsch” algorithm). Local alignment: find best partial alignment of two sequences (the “Smith-Waterman” algorithm). Global alignment Seq 1 Local alignment Seq 2
Global Sequence Alignment • The Needleman–Wunsch algorithm performs a global alignment • An example of dynamic programming • First application of dynamic programming to biological sequence comparison • Suitable when the two sequences are of similar length, with a significant degree of similarity throughout • Aim: The best alignment over the entire length of two sequences
Steps in NW Algorithm • Initialization • Scoring • Trace back (Alignment) Consider the two DNA sequences to be globally aligned are: ATCG (x=4, length of sequence 1) TCG (y=3, length of sequence 2)
Why Gap Penalties? • The optimal alignment of two similar sequences is usually that which • maximizes the number of matches and • minimizes the number of gaps. • There is a tradeoff between these two • - adding gaps reduces mismatches • Permitting the insertion of arbitrarily many gaps can lead to high scoring alignments of non-homologous sequences. • Penalizing gaps forces alignments to have relatively few gaps.
Initialization Step • Create a matrix with X +1 Rows and Y +1 Columns • The 1st row and the 1st column of the score matrix are filled as multiple of gap penalty
Scoring • The score of any cell C(i, j) is the maximum of: scorediag = C(i-1, j-1) + S(i, j) scoreup = C(i-1, j) + g scoreleft = C(i, j-1) + g where S(i, j) is the substitution score for letters i and j, and g is the gap penalty
Scoring …. • Example: The calculation for the cell C(2, 2): scorediag = C(i-1, j-1) + S(i, j) = 0 + -1 = -1 scoreup = C(i-1, j) + g = -1 + -1 = -2 scoreleft = C(i, j-1) + g = -1 + -1 = -2
Scoring …. • Final Scoring Matrix Note: Always the last cell has the maximum alignment score: 2
Trace back • The trace back step determines the actual alignment(s) that result in the maximum score • There are likely to be multiple maximal alignments • Trace back starts from the last cell, i.e. position X, Y in the matrix • Gives alignment in reverse order
Trace back …. • There are three possible moves: diagonally (toward the top-left corner of the matrix), up, or left • Trace back takes the current cell and looks to the neighbor cells that could be direct predecessors. This means it looks to the neighbor to the left (gap in sequence #2), the diagonal neighbor (match/mismatch), and the neighbor above it (gap in sequence #1). The algorithm for trace back chooses as the next cell in the sequence one of the possible predecessors
Trace back …. • The only possible predecessor is the diagonal match/mismatch neighbor. If more than one possible predecessor exists, any can be chosen. This gives us a current alignment of Seq 1: G | Seq 2: G
Trace back …. • Final Trace back Best Alignment: A T C G | | | | _ T C G
Local Sequence Alignment • The Smith-Waterman algorithm performs a local alignment on two sequences • It is an example of dynamic programming • Useful for dissimilar sequences that are suspected to contain regions of similarity or similar sequence motifs within their larger sequence context • Aim: The best alignment over the conserved domain of two sequences
Differences in Needleman-Wunsch and Smith-Waterman Algorithms • In the initialization stage, the first row and first column are all filled in with 0s • While filling the matrix, if a score becomes negative, put in 0 instead • In the traceback, start with the cell that has the highest score and work back until a cell with a score of 0 is reached.
Three steps in Smith-Waterman Algorithm • Initialization • Scoring • Trace back (Alignment) Consider the two DNA sequences to be globally aligned are: ATCG (x=4, length of sequence 1) TCG (y=3, length of sequence 2)
Initialization Step • Create a matrix with X +1 Rows and Y +1 Columns • The 1st row and the 1st column of the score matrix are filled with 0s
Scoring • The score of any cell C(i, j) is the maximum of: scorediag = C(i-1, j-1) + S(I, j) scoreup = C(i-1, j) + g scoreleft = C(i, j-1) + g And 0 (here S(i, j) is the substitution score for letters i and j, and g is the gap penalty)
Scoring …. • Example: The calculation for the cell C(2, 2): scorediag = C(i-1, j-1) + S(I, j) = 0 + -1 = -1 scoreup = C(i-1, j) + g = 0 + -1 = -1 scoreleft = C(i, j-1) + g = 0 + -1 = -1
Scoring …. • Final Scoring Matrix Note: It is not mandatory that the last cell has the maximum alignment score!
Trace back • The trace back step determines the actual alignment(s) that result in the maximum score • There are likely to be multiple maximal alignments • Trace back starts from the cell with maximum value in the matrix • Gives alignment in reverse order
Trace back …. • There are three possible moves: diagonally (toward the top-left corner of the matrix), up, or left • Trace back takes the current cell and looks to the neighbor cells that could be direct predecessors. This means it looks to the neighbor to the left (gap in sequence #2), the diagonal neighbor (match/mismatch), and the neighbor above it (gap in sequence #1). The algorithm for trace back chooses as the next cell in the sequence one of the possible predecessors. This continues till cell with value 0 is reached.
Trace back …. • The only possible predecessor is the diagonal match/mismatch neighbor. If more than one possible predecessor exists, any can be chosen. This gives us a current alignment of Seq 1: G | Seq 2: G
Trace back …. • Final Trace back Best Alignment: T C G | | | T C G
The true alignment between two sequences is the one that reflects accurately the evolutionary relationships between the sequences. • Since the true alignment is unknown, in practice we look for the optimal alignment, which is the one in which the numbers of mismatches and gaps are minimized according to certain criteria. • Unfortunately, reducing the number of mismatches results in an increase in the number of gaps, and viceversa.
FASTA 1) Derived from logic of the dot plot • compute best diagonals from all frames of alignment 2) Word method looks for exact matches between words in query and test sequence • hash tables (fast computer technique) • DNA words are usually 6 bases • protein words are 1 or 2 amino acids • only searches for diagonals in region of word matches = faster searching FastA searches can be done on the WWW FastA server at EBI: http://www2.ebi.ac.uk/fasta3/
Makes Longest Diagonal 3) after all diagonals found, tries to join diagonals by adding gaps 4) computes alignments in regions of best diagonals
FASTA Format • simple format used by almost all programs • >header line with a [return] at end • Sequence (no specific requirements for line length, characters, etc) >URO1 uro1.seq Length: 2018 November 9, 2000 11:50 Type: N Check: 3854 .. CGCAGAAAGAGGAGGCGCTTGCCTTCAGCTTGTGGGAAATCCCGAAGATGGCCAAAGACA ACTCAACTGTTCGTTGCTTCCAGGGCCTGCTGATTTTTGGAAATGTGATTATTGGTTGTT GCGGCATTGCCCTGACTGCGGAGTGCATCTTCTTTGTATCTGACCAACACAGCCTCTACC CACTGCTTGAAGCCACCGACAACGATGACATCTATGGGGCTGCCTGGATCGGCATATTTG TGGGCATCTGCCTCTTCTGCCTGTCTGTTCTAGGCATTGTAGGCATCATGAAGTCCAGCA GGAAAATTCTTCTGGCGTATTTCATTCTGATGTTTATAGTATATGCCTTTGAAGTGGCAT CTTGTATCACAGCAGCAACACAACAAGACTTTTTCACACCCAACCTCTTCCTGAAGCAGA TGCTAGAGAGGTACCAAAACAACAGCCCTCCAAACAATGATGACCAGTGGAAAAACAATG GAGTCACCAAAACCTGGGACAGGCTCATGCTCCAGGACAATTGCTGTGGCGTAAATGGTC CATCAGACTGGCAAAAATACACATCTGCCTTCCGGACTGAGAATAATGATGCTGACTATC CCTGGCCTCGTCAATGCTGTGTTATGAACAATCTTAAAGAACCTCTCAACCTGGAGGCTT
BLAST Searches GenBank [BLAST= Basic Local Alignment Search Tool] The NCBI BLASTweb server lets you compare your query sequence to various sections of GenBank: • nr = non-redundant (main sections) • month = new sequences from the past few weeks • ESTs • human, drososphila, yeast, or E.coli genomes • proteins (by automatic translation) • This is a VERY fast and powerful computer.