130 likes | 320 Views
Sequence alignment is central to bioinformatics!. Phylogenetic trees and molecular evolution Identifying genes in a genome Predicting function of unknown genes Predicting protein structure Assembling genome sequences. What does it mean to “align” DNA sequences?. a 1. a 2.
E N D
Sequence alignment is central to bioinformatics! • Phylogenetic trees and molecular evolution • Identifying genes in a genome • Predicting function of unknown genes • Predicting protein structure • Assembling genome sequences
What does it mean to “align” DNA sequences? a1 a2 GCCTACGACCTCCAGAC GCGTTGG--CTCCAGAC a-globin e (embryonic) d (fetal) b (adult) b-globin ancestral globin gene leghemoglobin (legumes) myoglobin (muscle) evolutionary time
Terms for related sequences: paralogs orthologs • Homologous sequences – same evolutionary origin • Similar sequences – don’t have to have a common origin • Orthologs – homologous genes in two species • Paralogs – homologous genes within a species (duplication)
What can we learn from this pairwise alignment? species 1 GCCTACGACCTCCAGAC species 2 GCGTTGG--CTCCAGAC
How did the OYOP mutation detection algorithm work?Did you feel that it wasn’t fully satisfactory? What about inserting any combination of any number of gaps at any position until the best score is obtained? GCCTAC GCCTAC GCCTAC GCCTAC GCT--- G-CT-- G--CT- G---CT GCCTAC GCCTAC GCCTAC GCCTAC G-C-T- G-C--T -G-C-T --G-CT
Dot-matrix alignment: a simple algorithm window 1 window 2 • “Sliding window” of fixed length • Dot when window 1 matches window 2
Dot-matrix alignment • Uses: • See similar or different regions • Look for repeats, insertions, deletions • Drawbacks: • Noisy unless sequences are very similar • Does not show how the sequences align • Does not produce a score
Scoring an alignment species 1GCCTACGACCTCCGCCTACGACCTCC species 2GCGTTGG--CTCCGCGTT-GGC-TCC • Considerations: • Measure percent identity • Scoring: match, mismatch, gap • Two different alignments may give same score Is a gap worse than a mismatch? Why? Is a longer gap worse?
Needleman-Wunsch algorithm • Compares two sequences • Global alignment • Considers matches, mismatches and gaps • Provides optimal alignment and score • Not necessarily “correct” alignment • Efficient: uses dynamic programming • Break a problem into manageable sub-problems • Assemble sub-problems to solve original problem
Needleman-Wunsch parameters • Match score (“match bonus”) • Mismatch score – often zero, could penalize mismatch • Gap penalty (linear or Affine) • Various scoring methods can be used with basic algorithm
Semi-global (“glocal”) alignments • Align a gene with a genome • Align a domain with a protein • Align start of one sequence with end of another
Local alignments • Partial match between sequences: • Allow for introns • Find shared domains within distinct proteins