1 / 59

Algorithms for Alignment of Genomic Sequences

Algorithms for Alignment of Genomic Sequences. Michael Brudno Department of Computer Science Stanford University PGA Workshop 07/16/2004. Conservation Implies Function. Gene. Exon. CNS: Other Conserved. Edit Distance Model (1).

garan
Download Presentation

Algorithms for Alignment of Genomic Sequences

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Algorithms for Alignment of Genomic Sequences Michael Brudno Department of Computer Science Stanford University PGA Workshop 07/16/2004

  2. Conservation Implies Function Gene Exon CNS: Other Conserved

  3. Edit Distance Model (1) Weighted sum of insertions, deletions & mutations to transform one string into another AGGCACA--CA AGGCACACA | |||| || or| || || A--CACATTCA ACACATTCA

  4. Edit Distance Model (2) Given: x, y Define: F(i,j) = Score of best alignment of x1…xi to y1…yj Recurrence: F(i,j) = max ( F(i-1,j) – GAP_PENALTY, F(i,j-1) – GAP_PENALTY, F(i-1,j-1) + SCORE(xi, yj))

  5. Edit Distance Model (3) AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA • F(i,j) = Score of best alignment ending at i,j • Time O( n2 ) for two seqs, O( nk ) for k seqs F(i-1,j-1) F(i,j-1) AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC F(i,j) F(i-1,j)

  6. Overview • Local Alignment (CHAOS) • Multiple Global Alignment (LAGAN) - Whole Genome Alignment • Glocal Alignment (Shuffle-LAGAN) • Biological Story

  7. AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA AGTGACCTGGGAAGACCCTGAACCCTGGGTCACAAAACTC Local Alignment • F(i,j) = max (F(i,j), 0) • Return all paths with a position i,j where F(i,j) > C • Time O( n2 ) for two seqs, O( nk ) for k seqs

  8. Heuristic Local Alignment AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA AGTGACCTGGGAAGACCCTGAACCCTGGGTCACAAAACTC AGTGACCTGGGAAGACCCTGAACCCTGGGTCACAAAACTC BLAST FASTA

  9. CHAOS: CHAins Of Seeds • Find short matching words (seeds) • Chain them • Rescore chain

  10. CHAOS: Chaining the Seeds seq1 seed • Find seeds at current location in seq1 seq2 location in seq1

  11. CHAOS: Chaining the Seeds seq1 distance cutoff seed • Find seeds at current location in seq1 seq2 location in seq1

  12. CHAOS: Chaining the Seeds seq1 gap cutoff distance cutoff seed • Find seeds at current location in seq1 seq2 location in seq1

  13. CHAOS: Chaining the Seeds seq1 gap cutoff distance cutoff seed • Find seeds at current location in seq1 • Find the previous seeds that fall into the search box seq2 Search box location in seq1

  14. CHAOS: Chaining the Seeds seq1 gap cutoff distance cutoff seed • Find seeds at current location in seq1 • Find the previous seeds that fall into the search box • Do a range query: seeds are indexed by their diagonal seq2 Search box location in seq1 Range of search

  15. CHAOS: Chaining the Seeds seq1 gap cutoff distance cutoff seed • Find seeds at current location in seq1 • Find the previous seeds that fall into the search box • Do a range query: seeds are indexed by their diagonal. • Pick a previous seed that maximizes the score of chain seq2 Search box location in seq1 Range of search

  16. CHAOS: Chaining the Seeds seq1 gap cutoff distance cutoff seed • Find seeds at current location in seq1 • Find the previous seeds that fall into the search box • Do a range query: seeds are indexed by their diagonal. • Pick a previous seed that maximizes the score of chain seq2 Search box location in seq1 Range of search Time O(n log n), where n is number of seeds.

  17. CHAOS Scoring • Initial score = # matching bp - gaps • Rapid rescoring: extend all seeds to find optimal location for gaps

  18. Overview • Local Alignment (CHAOS) • Multiple Global Alignment (LAGAN) - Whole Genome Alignment • Glocal Alignment (Shuffle-LAGAN) • Biological Story

  19. z x y Global Alignment AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC

  20. LAGAN: 1. FIND Local Alignments • Find Local Alignments • Chain Local Alignments • Restricted DP

  21. LAGAN: 2. CHAIN Local Alignments • Find Local Alignments • Chain Local Alignments • Restricted DP

  22. LAGAN: 3. Restricted DP • Find Local Alignments • Chain Local Alignments • Restricted DP

  23. MLAGAN: 1. Progressive Alignment Human Baboon Given N sequences, phylogenetic tree Align pairwise, in order of the tree (LAGAN) Mouse Rat

  24. MLAGAN: 2. Multi-anchoring To anchor the (X/Y), and (Z) alignments: X Z Y Z X/Y Z

  25. Cystic Fibrosis (CFTR), 12 species • Human sequence length: 1.8 Mb • Total genomic sequence: 13 Mb Chicken Zebrafish Cow Pig Chimp Human Dog Rat Fugufish Cat Baboon Mouse

  26. CFTR (cont’d ) % Exons Aligned TIME (sec) MAX MEMORY (Mb) LAGAN Mammals 99.7% 550 90 Chicken & Fishes 96% 862 90 MLAGAN Mammals 99.8% 4547 670 Chicken & Fishes 98%

  27. Automatic computational system for comparative analysis of pairs of genomes http://pipeline.lbl.gov Alignments (all pair combinations): Human Genome (Golden Path Assembly) Mouse assemblies: Arachne, Phusion (2001) MGSC v3 (2002) Rat assemblies: January 2003, February 2003 ---------------------------------------------------------- D. Melanogaster vs D. Pseudoobscura February 2003

  28. Tandem Local/Global Approach • Finding a likely mapping for a contig (BLAT)

  29. Progressive Alignment Scheme Human, Mouse and Rat genomes Pairwise M/R mapping no yes Aligned M&R fragments Unaligned M&R sequences Mapping aligned fragments by union of M&R local BLAT hits on the human genome Map to Human Genome yes no no yes M/H and R/H pairwise alignment Unassigned M&R DNA fragments H/M/R MLAGAN alignment M/R pairwise alignment

  30. Computational Time 23 dual 2.2GHz Intel Xeon node PC cluster. Pair-wise rat/mouse – 4 hours Pair-wise rat/human and mouse/human – 2 hours Multiple human/mouse/rat – 9 hours Total wall time: ~ 15 hours

  31. Distribution of Large Indels

  32. Evolution Over a Chromosome

  33. Overview • Local Alignment (CHAOS) • Multiple Global Alignment (LAGAN) - Whole Genome Alignment • Glocal Alignment (Shuffle-LAGAN) • Biological Story

  34. Evolution at the DNA level Deletion Mutation …ACGGTGCAGTTACCA… SEQUENCE EDITS …AC----CAGTCCACCA… REARRANGEMENTS Inversion Translocation Duplication

  35. Local & Global Alignment AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA AGTGACCTGGGAAGACCCTGAACCCTGGGTCACAAAACTC Global Local AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA AGTGACCTGGGAAGACCCTGAACCCTGGGTCACAAAACTC

  36. Glocal Alignment Problem Find least cost transformation of one sequence into another using new operations AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA • Sequence edits • Inversions • Translocations • Duplications • Combinations of above AGTGACCTGGGAAGACCCTGAACCCTGGGTCACAAAACTC

  37. Shuffle-LAGAN A glocal aligner for long DNA sequences

  38. S-LAGAN: Find Local Alignments • Find Local Alignments • Build Rough Homology Map • Globally Align Consistent Parts

  39. S-LAGAN: Build Homology Map • Find Local Alignments • Build Rough Homology Map • Globally Align Consistent Parts

  40. Building the Homology Map b a d c Chain (using Eppstein Galil); each alignment gets a score which is MAX over 4 possible chains. Penalties are affine (event and distance components) • Penalties: • regular • translocation c) inversion d) inverted translocation

  41. S-LAGAN: Build Homology Map • Find Local Alignments • Build Rough Homology Map • Globally Align Consistent Parts

  42. S-LAGAN: Global Alignment • Find Local Alignments • Build Rough Homology Map • Globally Align Consistent Parts

  43. S-LAGAN Results (CFTR) Local Glocal

  44. S-LAGAN Results (CFTR) Hum/Mus Hum/Rat

  45. S-LAGAN Results (IGF cluster)

  46. S-LAGAN results (HOX) • 12 paralogous genes • Conserved order in mammals

  47. S-LAGAN results (HOX) • 12 paralogous genes • Conserved order in mammals

  48. S-LAGAN Results (Chr 20) • Human Chr 20 v. homologous Mouse Chr 2. • 270 Segments of conserved synteny • 70 Inversions

  49. S-LAGAN Results (Whole Genome) • Used Berkeley Genome Pipeline • % Human genome aligned with mouse sequence • Evaluation criteria from Waterston, et al (Nature 2002)

  50. Rearrangements in Human v. Mouse • Preliminary conclusions: • Rearrangements come in all sizes • Duplications worse conserved than other rearranged regions • Simple inversions tend to be most common and most conserved

More Related