590 likes | 710 Views
Algorithms for Alignment of Genomic Sequences. Michael Brudno Department of Computer Science Stanford University PGA Workshop 07/16/2004. Conservation Implies Function. Gene. Exon. CNS: Other Conserved. Edit Distance Model (1).
E N D
Algorithms for Alignment of Genomic Sequences Michael Brudno Department of Computer Science Stanford University PGA Workshop 07/16/2004
Conservation Implies Function Gene Exon CNS: Other Conserved
Edit Distance Model (1) Weighted sum of insertions, deletions & mutations to transform one string into another AGGCACA--CA AGGCACACA | |||| || or| || || A--CACATTCA ACACATTCA
Edit Distance Model (2) Given: x, y Define: F(i,j) = Score of best alignment of x1…xi to y1…yj Recurrence: F(i,j) = max ( F(i-1,j) – GAP_PENALTY, F(i,j-1) – GAP_PENALTY, F(i-1,j-1) + SCORE(xi, yj))
Edit Distance Model (3) AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA • F(i,j) = Score of best alignment ending at i,j • Time O( n2 ) for two seqs, O( nk ) for k seqs F(i-1,j-1) F(i,j-1) AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC F(i,j) F(i-1,j)
Overview • Local Alignment (CHAOS) • Multiple Global Alignment (LAGAN) - Whole Genome Alignment • Glocal Alignment (Shuffle-LAGAN) • Biological Story
AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA AGTGACCTGGGAAGACCCTGAACCCTGGGTCACAAAACTC Local Alignment • F(i,j) = max (F(i,j), 0) • Return all paths with a position i,j where F(i,j) > C • Time O( n2 ) for two seqs, O( nk ) for k seqs
Heuristic Local Alignment AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA AGTGACCTGGGAAGACCCTGAACCCTGGGTCACAAAACTC AGTGACCTGGGAAGACCCTGAACCCTGGGTCACAAAACTC BLAST FASTA
CHAOS: CHAins Of Seeds • Find short matching words (seeds) • Chain them • Rescore chain
CHAOS: Chaining the Seeds seq1 seed • Find seeds at current location in seq1 seq2 location in seq1
CHAOS: Chaining the Seeds seq1 distance cutoff seed • Find seeds at current location in seq1 seq2 location in seq1
CHAOS: Chaining the Seeds seq1 gap cutoff distance cutoff seed • Find seeds at current location in seq1 seq2 location in seq1
CHAOS: Chaining the Seeds seq1 gap cutoff distance cutoff seed • Find seeds at current location in seq1 • Find the previous seeds that fall into the search box seq2 Search box location in seq1
CHAOS: Chaining the Seeds seq1 gap cutoff distance cutoff seed • Find seeds at current location in seq1 • Find the previous seeds that fall into the search box • Do a range query: seeds are indexed by their diagonal seq2 Search box location in seq1 Range of search
CHAOS: Chaining the Seeds seq1 gap cutoff distance cutoff seed • Find seeds at current location in seq1 • Find the previous seeds that fall into the search box • Do a range query: seeds are indexed by their diagonal. • Pick a previous seed that maximizes the score of chain seq2 Search box location in seq1 Range of search
CHAOS: Chaining the Seeds seq1 gap cutoff distance cutoff seed • Find seeds at current location in seq1 • Find the previous seeds that fall into the search box • Do a range query: seeds are indexed by their diagonal. • Pick a previous seed that maximizes the score of chain seq2 Search box location in seq1 Range of search Time O(n log n), where n is number of seeds.
CHAOS Scoring • Initial score = # matching bp - gaps • Rapid rescoring: extend all seeds to find optimal location for gaps
Overview • Local Alignment (CHAOS) • Multiple Global Alignment (LAGAN) - Whole Genome Alignment • Glocal Alignment (Shuffle-LAGAN) • Biological Story
z x y Global Alignment AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC
LAGAN: 1. FIND Local Alignments • Find Local Alignments • Chain Local Alignments • Restricted DP
LAGAN: 2. CHAIN Local Alignments • Find Local Alignments • Chain Local Alignments • Restricted DP
LAGAN: 3. Restricted DP • Find Local Alignments • Chain Local Alignments • Restricted DP
MLAGAN: 1. Progressive Alignment Human Baboon Given N sequences, phylogenetic tree Align pairwise, in order of the tree (LAGAN) Mouse Rat
MLAGAN: 2. Multi-anchoring To anchor the (X/Y), and (Z) alignments: X Z Y Z X/Y Z
Cystic Fibrosis (CFTR), 12 species • Human sequence length: 1.8 Mb • Total genomic sequence: 13 Mb Chicken Zebrafish Cow Pig Chimp Human Dog Rat Fugufish Cat Baboon Mouse
CFTR (cont’d ) % Exons Aligned TIME (sec) MAX MEMORY (Mb) LAGAN Mammals 99.7% 550 90 Chicken & Fishes 96% 862 90 MLAGAN Mammals 99.8% 4547 670 Chicken & Fishes 98%
Automatic computational system for comparative analysis of pairs of genomes http://pipeline.lbl.gov Alignments (all pair combinations): Human Genome (Golden Path Assembly) Mouse assemblies: Arachne, Phusion (2001) MGSC v3 (2002) Rat assemblies: January 2003, February 2003 ---------------------------------------------------------- D. Melanogaster vs D. Pseudoobscura February 2003
Tandem Local/Global Approach • Finding a likely mapping for a contig (BLAT)
Progressive Alignment Scheme Human, Mouse and Rat genomes Pairwise M/R mapping no yes Aligned M&R fragments Unaligned M&R sequences Mapping aligned fragments by union of M&R local BLAT hits on the human genome Map to Human Genome yes no no yes M/H and R/H pairwise alignment Unassigned M&R DNA fragments H/M/R MLAGAN alignment M/R pairwise alignment
Computational Time 23 dual 2.2GHz Intel Xeon node PC cluster. Pair-wise rat/mouse – 4 hours Pair-wise rat/human and mouse/human – 2 hours Multiple human/mouse/rat – 9 hours Total wall time: ~ 15 hours
Overview • Local Alignment (CHAOS) • Multiple Global Alignment (LAGAN) - Whole Genome Alignment • Glocal Alignment (Shuffle-LAGAN) • Biological Story
Evolution at the DNA level Deletion Mutation …ACGGTGCAGTTACCA… SEQUENCE EDITS …AC----CAGTCCACCA… REARRANGEMENTS Inversion Translocation Duplication
Local & Global Alignment AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA AGTGACCTGGGAAGACCCTGAACCCTGGGTCACAAAACTC Global Local AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA AGTGACCTGGGAAGACCCTGAACCCTGGGTCACAAAACTC
Glocal Alignment Problem Find least cost transformation of one sequence into another using new operations AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA • Sequence edits • Inversions • Translocations • Duplications • Combinations of above AGTGACCTGGGAAGACCCTGAACCCTGGGTCACAAAACTC
Shuffle-LAGAN A glocal aligner for long DNA sequences
S-LAGAN: Find Local Alignments • Find Local Alignments • Build Rough Homology Map • Globally Align Consistent Parts
S-LAGAN: Build Homology Map • Find Local Alignments • Build Rough Homology Map • Globally Align Consistent Parts
Building the Homology Map b a d c Chain (using Eppstein Galil); each alignment gets a score which is MAX over 4 possible chains. Penalties are affine (event and distance components) • Penalties: • regular • translocation c) inversion d) inverted translocation
S-LAGAN: Build Homology Map • Find Local Alignments • Build Rough Homology Map • Globally Align Consistent Parts
S-LAGAN: Global Alignment • Find Local Alignments • Build Rough Homology Map • Globally Align Consistent Parts
S-LAGAN Results (CFTR) Local Glocal
S-LAGAN Results (CFTR) Hum/Mus Hum/Rat
S-LAGAN results (HOX) • 12 paralogous genes • Conserved order in mammals
S-LAGAN results (HOX) • 12 paralogous genes • Conserved order in mammals
S-LAGAN Results (Chr 20) • Human Chr 20 v. homologous Mouse Chr 2. • 270 Segments of conserved synteny • 70 Inversions
S-LAGAN Results (Whole Genome) • Used Berkeley Genome Pipeline • % Human genome aligned with mouse sequence • Evaluation criteria from Waterston, et al (Nature 2002)
Rearrangements in Human v. Mouse • Preliminary conclusions: • Rearrangements come in all sizes • Duplications worse conserved than other rearranged regions • Simple inversions tend to be most common and most conserved