170 likes | 419 Views
Sequence Alignments. Chi-Cheng Lin, Ph.D. Associate Professor Department of Computer Science Winona State University – Rochester Center clin@winona.edu. Sequence Alignments. Cornerstone of bioinformatics What is a sequence? Nucleotide sequence Amino acid sequence
E N D
Sequence Alignments Chi-Cheng Lin, Ph.D.Associate ProfessorDepartment of Computer ScienceWinona State University – Rochester Centerclin@winona.edu
Sequence Alignments • Cornerstone of bioinformatics • What is a sequence? • Nucleotide sequence • Amino acid sequence • Pairwise and multiple sequence alignments • What alignments can help • Determine function of a newly discovered gene sequence • Determine evolutionary relationships among genes, proteins, and species • Predict structure and function of protein
Why Align Sequences? • The draft human genome is available • Automated gene finding is possible • Gene: AGTACGTATCGTATAGCGTAA • What does it do? • One approach: Is there a similar gene in another species? • Align sequences with known genes • Find the gene with the “best” match
Visualization of Sequence Alignment • Dot Plot • One of the simplest and oldest methods for sequence alignment • Visualization of regions of similarity • Assign one sequence on the horizontal axis • Assign the other on the vertical axis • Place dots on the space of matches • Diagonal lines means adjacent regions of identity
A Simple Example • Construct a simple dot plot for TAGTCGATGTGGTCATC • The alignment is TAGTCGATGTGGTC-ATC
Genes Accumulate Mutations over Time • Mistakes in gene replication or repair • Deletions, duplications • Insertions, inversions • Translocations • Point mutations • Environmental factors • Radiation • Oxidation
Deletions • Codon deletion:ACG ATA GCG TAT GTA TAG CCG… • Effect depends on the protein, position, etc. • Almost always deleterious • Sometimes lethal • Frame shift mutation:ACG ATA GCG TAT GTA TAG CCG…ACG ATA GCG ATG TAT AGC CG?… • Almost always lethal
Indels • Comparing two genes it is generally impossible to tell if an indel is an insertion in one gene, or a deletion in another, unless ancestry is known:ACGTCTGATACGCCGTATCGTCTATCTACGTCTGAT---CCGTATCGTCTATCT
The Genetic Code Substitutions are mutations accepted by natural selection. Synonymous: CGC CGA Non-synonymous: GAU GAA
Wild-type hemoglobin DNA 3’----CTT----5’ mRNA 5’----GAA----3’ Normal hemoglobin ------[Glu]------ Mutant hemoglobin DNA 3’----CAT----5’ mRNA 5’----GUA----3’ Mutant hemoglobin ------[Val]------ Point Mutation Example: Sickle-cell Disease
image credit: U.S. Department of Energy Human Genome Program, http://www.ornl.gov/hgmis.
Comparing Two Sequences • Point mutations, easy:ACGTCTGATACGCCGTATAGTCTATCTACGTCTGATTCGCCCTATCGTCTATCT • Indels are difficult, must align sequences:ACGTCTGATACGCCGTATAGTCTATCTCTGATTCGCATCGTCTATCTACGTCTGATACGCCGTATAGTCTATCT----CTGATTCGC---ATCGTCTATCT
Scoring a Sequence Alignment • Example • Match score: +1 • Mismatch score: +0 • Gap penalty: –1 ACGTCTGATACGCCGTATAGTCTATCT ||||| ||| || ||||||||----CTGATTCGC---ATCGTCTATCT • Matches: 18 × (+1) • Mismatches: 2 × 0 • Gaps: 7 × (– 1) • Various scoring scheme exist. Score = 18 + 0 + (-7) = +11
How can we find an optimal alignment? • Finding the alignment is computationally hard:ACGTCTGATACGCCGTATAGTCTATCTCTGAT---TCG-CATCGTC--T-ATCT • There are ~888,000 possibilities to align the two sequences given above. • Algorithms using a technique called “dynamic programming” are used – out of the scope of this workshop.
Global and Local Alignments • Global alignments – score the entire alignment • Local alignment – find the best matching subsequence • Why local sequence alignment? • Global alignment is useful only if the sequences to be aligned are very similar • Subsequence comparison between a DNA sequence and a genome • Identify • Conserved regions • Protein function domains
Example • Compare the two sequences: TTGACACCCTCCCAATT ACCCCAGGCTTTACACAG • Global alignment (does it look good?) TTGACACCCTCC-CAATT || || || ACCCCAGGCTTTACACAG • Local alignment (does it look good?) ---------TTGACACCCTCCCAATT || |||| ACCCCAGGCTTTACACAG--------
Where do we get sequences to work with? • Biological databases • NCBI Entrez (http://www.ncbi.nlm.nih.gov/gquery/gquery.fcgi?term=) • Wet labs • Simulations • Other people’s results • On-line education resources • BEDROCK (http://www.bioquest.org/bedrock/) • BLAST results