Introduction to Sequence Alignment in Bioinformatics

Alignment, Part I Vasileios Hatzivassiloglou University of Texas at Dallas

Other databases • NCBI BLAST • Basic Local Alignment Search Tool • Multiple programs for sequence searching and comparisons • Gene Expression Omnibus (GEO) • maintained by NCBI • contains output of gene expression experiments

Links • GenBank (http://www.ncbi.nlm.nih.gov/GenBank/) • ExPASy (http://www.expasy.org/) • SwissProt (http://www.expasy.org/sprot/) • GO (http://www.geneontology.org/) • PubMed (http://www.ncbi.nlm.nih.gov/sites/entrez) • MeSH browser (http://www.nlm.nih.gov/mesh/MBrowser.html) • NCBI Blast (http://blast.ncbi.nlm.nih.gov/Blast.cgi) • NCBI GEO (http://www.ncbi.nlm.nih.gov/projects/geo/) • Human Protein Atlas (http://www.proteinatlas.org/)

Assignment • Search the above databases for information on a gene/protein of your choice • Briefly report your findings (90 seconds) next Tuesday, September 30 • Examples: interleukin-N (e.g., 3), elastase, thrombin, creatine kinase, myosin-N (e.g., 2)

Sequences • Sequences of symbols central to bioinformatics • DNA • RNA • proteins • Fixed alphabet (size 4 for DNA/RNA, 20 for proteins)

Sequence similarity • Important for many biological problems • Examples • Similar primary structure in proteins implies similar form and function • Similar sequences in genes / proteins imply homologues across organisms • Similar short sequences lead to motif finding • Similarities between gene regions can be used for phylogenetic classification

How to measure similarity • Given two sequences S and T, we look into ways to derive T from S using elementary operations • Substitution (change a letter) • Deletion • Insertion • Process is reversible (S→T and T→S) • Many ways, some obviously more efficient

Edit distance • Each elementary operation is assigned a cost • Overall cost is the sum of the costs for each operation taken (linear model) • The editdistance between two strings is the minimum total cost among all possible sequences of operations that transform S into T

Alignment • An equivalent way to measuring edit distance is to align the two sequences • An alignment extends the sequences S and T into S′ and T′ using the same alphabet plus “-” (the space character), and matches S′[i] with T′[i]

Definitions • A string is a finite sequence of characters from a finite alphabet Σ • The length of a string S, denoted |S|, is the number of characters it contains (can be 0) • S[i] is the i-th character of S • A subsequence of a string S is the string formed by omitting a number of characters from S (order of characters does not change)

Defining alignment formally • An alignment is the mapping of two strings S and T from alphabet Σ into strings S′ and T′ where • The alphabet of S′ and T′ is Σ plus “-” • S is a subsequence of S′. All characters in S′ not in this subsequence must be “-”. • T is a subsequence of T′. All characters in T′ not in this subsequence must be “-”. • |S′| = |T′| • There is no i for which S′ [i] = T′ [i] = “-”

Example alignment Sequences: • GCGCATGGATTGAGCGA • TGCGCCATTGATGACCA A possible alignment: -GCGC-ATGGATTGAGCGA TGCGCCATTGAT-GACC-A

Alignment operations -GCGC-ATGGATTGAGCGA TGCGCCATTGAT-GACC-A Three elements: • Perfect matches • Mismatches • Insertions & deletions (indel)

Alignments are not unique For example, compare: -GCGC-ATGGATTGAGCGA TGCGCCATTGAT-GACC-A to ------GCGCATGGATTGAGCGA TGCGCC----ATTGATGACCA--

Measuring alignment quality • For each position i in the alignment, calculate the scoring functionσ(S′[i], T′[i]) • The scoring function depends only on the symbols S′[i] and T′[i], not on position • A very simple scoring function might be • σ(x, x) = +1 for x a letter • σ(x, y) = –2 for x,y different letters • σ(x, -) = σ(-, x) = -1 for indel

Overall alignment score • Defined as the sum of the applicable values of the scoring function • As with our definition of edit distance, this is a linear model

Scoring functions • Usually based on how similar the two symbols are • Derived from confusion probabilities • In biology, chemically similar amino-acids have lower penalties for substitution • In speech recognition, “p”→ “b” costs less than “p”→ “r” • Cost of indels depends on application

Comparing alignments -GCGC-ATGGATTGAGCGA TGCGCCATTGAT-GACC-A 4 indel, 13 matches, 2 mismatches score: +5 ------GCGCATGGATTGAGCGA TGCGCC----ATTGATGACCA-- 12 indel, 5 matches, 6 mismatches score: -19

Optimal alignment • An alignment which maximizes the overall alignment score is called optimal • Often, there is more than one optimal alignment for two strings • depends on sophistication of scoring function • The optimal alignment score can be used as a similarity value

Finding the optimal alignment • Simple algorithm: Construct all possible alignments, score them, and pick the best • How many alignments are there for two strings of length n and m?

Introduction to Sequence Alignment in Bioinformatics

Introduction to Sequence Alignment in Bioinformatics

Presentation Transcript

Part I

Alignment Fundamentals, Part Two

Pairwise Alignment, Part I

Part I

Part I

Part I

Pairwise Alignment, Part II

Chapter 6 View Alignment Techniques and Method Customization (Part I)

Multiple Sequence Alignment (I)

Part I

Alignment Fundamentals Part One

Face Alignment with Part-Based Modeling

PART I:

Sequence Alignment Part 3

Sequence Alignment (I)

Part I I I

Pairwise Sequence Alignment Part 2

Pairwise Sequence Alignment (I)

Pairwise Alignment, Part II

PART I - I

Chapter 6 View Alignment Techniques and Method Customization (Part I)

Sequence Alignment Part 3