200 likes | 233 Views
Learn about sequence alignment, edit distance, alignment measurements, and scoring functions in bioinformatics. Discover how to find optimal alignments and assess alignment quality.
E N D
Alignment, Part I Vasileios Hatzivassiloglou University of Texas at Dallas
Other databases • NCBI BLAST • Basic Local Alignment Search Tool • Multiple programs for sequence searching and comparisons • Gene Expression Omnibus (GEO) • maintained by NCBI • contains output of gene expression experiments
Links • GenBank (http://www.ncbi.nlm.nih.gov/GenBank/) • ExPASy (http://www.expasy.org/) • SwissProt (http://www.expasy.org/sprot/) • GO (http://www.geneontology.org/) • PubMed (http://www.ncbi.nlm.nih.gov/sites/entrez) • MeSH browser (http://www.nlm.nih.gov/mesh/MBrowser.html) • NCBI Blast (http://blast.ncbi.nlm.nih.gov/Blast.cgi) • NCBI GEO (http://www.ncbi.nlm.nih.gov/projects/geo/) • Human Protein Atlas (http://www.proteinatlas.org/)
Assignment • Search the above databases for information on a gene/protein of your choice • Briefly report your findings (90 seconds) next Tuesday, September 30 • Examples: interleukin-N (e.g., 3), elastase, thrombin, creatine kinase, myosin-N (e.g., 2)
Sequences • Sequences of symbols central to bioinformatics • DNA • RNA • proteins • Fixed alphabet (size 4 for DNA/RNA, 20 for proteins)
Sequence similarity • Important for many biological problems • Examples • Similar primary structure in proteins implies similar form and function • Similar sequences in genes / proteins imply homologues across organisms • Similar short sequences lead to motif finding • Similarities between gene regions can be used for phylogenetic classification
How to measure similarity • Given two sequences S and T, we look into ways to derive T from S using elementary operations • Substitution (change a letter) • Deletion • Insertion • Process is reversible (S→T and T→S) • Many ways, some obviously more efficient
Edit distance • Each elementary operation is assigned a cost • Overall cost is the sum of the costs for each operation taken (linear model) • The editdistance between two strings is the minimum total cost among all possible sequences of operations that transform S into T
Alignment • An equivalent way to measuring edit distance is to align the two sequences • An alignment extends the sequences S and T into S′ and T′ using the same alphabet plus “-” (the space character), and matches S′[i] with T′[i]
Definitions • A string is a finite sequence of characters from a finite alphabet Σ • The length of a string S, denoted |S|, is the number of characters it contains (can be 0) • S[i] is the i-th character of S • A subsequence of a string S is the string formed by omitting a number of characters from S (order of characters does not change)
Defining alignment formally • An alignment is the mapping of two strings S and T from alphabet Σ into strings S′ and T′ where • The alphabet of S′ and T′ is Σ plus “-” • S is a subsequence of S′. All characters in S′ not in this subsequence must be “-”. • T is a subsequence of T′. All characters in T′ not in this subsequence must be “-”. • |S′| = |T′| • There is no i for which S′ [i] = T′ [i] = “-”
Example alignment Sequences: • GCGCATGGATTGAGCGA • TGCGCCATTGATGACCA A possible alignment: -GCGC-ATGGATTGAGCGA TGCGCCATTGAT-GACC-A
Alignment operations -GCGC-ATGGATTGAGCGA TGCGCCATTGAT-GACC-A Three elements: • Perfect matches • Mismatches • Insertions & deletions (indel)
Alignments are not unique For example, compare: -GCGC-ATGGATTGAGCGA TGCGCCATTGAT-GACC-A to ------GCGCATGGATTGAGCGA TGCGCC----ATTGATGACCA--
Measuring alignment quality • For each position i in the alignment, calculate the scoring functionσ(S′[i], T′[i]) • The scoring function depends only on the symbols S′[i] and T′[i], not on position • A very simple scoring function might be • σ(x, x) = +1 for x a letter • σ(x, y) = –2 for x,y different letters • σ(x, -) = σ(-, x) = -1 for indel
Overall alignment score • Defined as the sum of the applicable values of the scoring function • As with our definition of edit distance, this is a linear model
Scoring functions • Usually based on how similar the two symbols are • Derived from confusion probabilities • In biology, chemically similar amino-acids have lower penalties for substitution • In speech recognition, “p”→ “b” costs less than “p”→ “r” • Cost of indels depends on application
Comparing alignments -GCGC-ATGGATTGAGCGA TGCGCCATTGAT-GACC-A 4 indel, 13 matches, 2 mismatches score: +5 ------GCGCATGGATTGAGCGA TGCGCC----ATTGATGACCA-- 12 indel, 5 matches, 6 mismatches score: -19
Optimal alignment • An alignment which maximizes the overall alignment score is called optimal • Often, there is more than one optimal alignment for two strings • depends on sophistication of scoring function • The optimal alignment score can be used as a similarity value
Finding the optimal alignment • Simple algorithm: Construct all possible alignments, score them, and pick the best • How many alignments are there for two strings of length n and m?