200 likes | 230 Views
Alignment, Part I. Vasileios Hatzivassiloglou University of Texas at Dallas. Other databases. NCBI BLAST Basic Local Alignment Search Tool Multiple programs for sequence searching and comparisons Gene Expression Omnibus (GEO) maintained by NCBI
E N D
Alignment, Part I Vasileios Hatzivassiloglou University of Texas at Dallas
Other databases • NCBI BLAST • Basic Local Alignment Search Tool • Multiple programs for sequence searching and comparisons • Gene Expression Omnibus (GEO) • maintained by NCBI • contains output of gene expression experiments
Links • GenBank (http://www.ncbi.nlm.nih.gov/GenBank/) • ExPASy (http://www.expasy.org/) • SwissProt (http://www.expasy.org/sprot/) • GO (http://www.geneontology.org/) • PubMed (http://www.ncbi.nlm.nih.gov/sites/entrez) • MeSH browser (http://www.nlm.nih.gov/mesh/MBrowser.html) • NCBI Blast (http://blast.ncbi.nlm.nih.gov/Blast.cgi) • NCBI GEO (http://www.ncbi.nlm.nih.gov/projects/geo/) • Human Protein Atlas (http://www.proteinatlas.org/)
Assignment • Search the above databases for information on a gene/protein of your choice • Briefly report your findings (90 seconds) next Tuesday, September 30 • Examples: interleukin-N (e.g., 3), elastase, thrombin, creatine kinase, myosin-N (e.g., 2)
Sequences • Sequences of symbols central to bioinformatics • DNA • RNA • proteins • Fixed alphabet (size 4 for DNA/RNA, 20 for proteins)
Sequence similarity • Important for many biological problems • Examples • Similar primary structure in proteins implies similar form and function • Similar sequences in genes / proteins imply homologues across organisms • Similar short sequences lead to motif finding • Similarities between gene regions can be used for phylogenetic classification
How to measure similarity • Given two sequences S and T, we look into ways to derive T from S using elementary operations • Substitution (change a letter) • Deletion • Insertion • Process is reversible (S→T and T→S) • Many ways, some obviously more efficient
Edit distance • Each elementary operation is assigned a cost • Overall cost is the sum of the costs for each operation taken (linear model) • The editdistance between two strings is the minimum total cost among all possible sequences of operations that transform S into T
Alignment • An equivalent way to measuring edit distance is to align the two sequences • An alignment extends the sequences S and T into S′ and T′ using the same alphabet plus “-” (the space character), and matches S′[i] with T′[i]
Definitions • A string is a finite sequence of characters from a finite alphabet Σ • The length of a string S, denoted |S|, is the number of characters it contains (can be 0) • S[i] is the i-th character of S • A subsequence of a string S is the string formed by omitting a number of characters from S (order of characters does not change)
Defining alignment formally • An alignment is the mapping of two strings S and T from alphabet Σ into strings S′ and T′ where • The alphabet of S′ and T′ is Σ plus “-” • S is a subsequence of S′. All characters in S′ not in this subsequence must be “-”. • T is a subsequence of T′. All characters in T′ not in this subsequence must be “-”. • |S′| = |T′| • There is no i for which S′ [i] = T′ [i] = “-”
Example alignment Sequences: • GCGCATGGATTGAGCGA • TGCGCCATTGATGACCA A possible alignment: -GCGC-ATGGATTGAGCGA TGCGCCATTGAT-GACC-A
Alignment operations -GCGC-ATGGATTGAGCGA TGCGCCATTGAT-GACC-A Three elements: • Perfect matches • Mismatches • Insertions & deletions (indel)
Alignments are not unique For example, compare: -GCGC-ATGGATTGAGCGA TGCGCCATTGAT-GACC-A to ------GCGCATGGATTGAGCGA TGCGCC----ATTGATGACCA--
Measuring alignment quality • For each position i in the alignment, calculate the scoring functionσ(S′[i], T′[i]) • The scoring function depends only on the symbols S′[i] and T′[i], not on position • A very simple scoring function might be • σ(x, x) = +1 for x a letter • σ(x, y) = –2 for x,y different letters • σ(x, -) = σ(-, x) = -1 for indel
Overall alignment score • Defined as the sum of the applicable values of the scoring function • As with our definition of edit distance, this is a linear model
Scoring functions • Usually based on how similar the two symbols are • Derived from confusion probabilities • In biology, chemically similar amino-acids have lower penalties for substitution • In speech recognition, “p”→ “b” costs less than “p”→ “r” • Cost of indels depends on application
Comparing alignments -GCGC-ATGGATTGAGCGA TGCGCCATTGAT-GACC-A 4 indel, 13 matches, 2 mismatches score: +5 ------GCGCATGGATTGAGCGA TGCGCC----ATTGATGACCA-- 12 indel, 5 matches, 6 mismatches score: -19
Optimal alignment • An alignment which maximizes the overall alignment score is called optimal • Often, there is more than one optimal alignment for two strings • depends on sophistication of scoring function • The optimal alignment score can be used as a similarity value
Finding the optimal alignment • Simple algorithm: Construct all possible alignments, score them, and pick the best • How many alignments are there for two strings of length n and m?