1 / 68

Genome Sequencing and Assembly: Bioinformatics Overview

Explore the principles and methods of sequencing genomes, assembling DNA fragments, predicting structures, and analyzing biological data. Learn about gene detection, evolutionary processes, and genomic characteristics. Discover applications and challenges in bioinformatics.

kennethball
Download Presentation

Genome Sequencing and Assembly: Bioinformatics Overview

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. VL Algorithmische BioInformatik (19710) WS2015/2016Woche 7 - Mittwoch Tim Conrad AG Medical Bioinformatics Institut für Mathematik & Informatik, Freie Universität Berlin Including slides by Serafim Batzoglou

  2. Vorlesungsthemen Part 5: Secondary Structures (4) 11. Obtaining Secondary Structure from Sequence 12. Predicting Secondary Structures Part 6: Tertiary Structures (4) 13. Modeling Protein Structure 14. Analyzing Structure-Function Relationships Part 7: Cells and Organisms (8) 15. Proteome and Gene Expression Analysis 16. Clustering Methods and Statistics 17. Systems Biology Part 1: Background Basics (4) 1. The Nucleic Acid World 2. Protein Structure 3. Dealing with Databases Part 2: Sequence Alignments (3) 4. Producing and Analyzing Sequence Alignments 5. Pairwise Sequence Alignment and Database Searching 6. Patterns, Profiles, and Multiple Alignments Part 3: Evolutionary Processes (3) 7. Recovering Evolutionary History 8. Building Phylogenetic Trees Part 4: Genome Characteristics (4) 9. Revealing Genome Features 10. Gene Detection and Genome Annotation 2 Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016

  3. Eugene Myers • Was Professor at University of California, Berkeley • Now at MPI Dresden • His BLAST paper (1990) was cited more that 41.000 times (among the most highly cited papers ever) • Inventor of suffix array data structure “If you take a sequence and just run a gene prediction program on it, the programs don’t usually do very well. But if you take human and mouse sequence, and compare them against each other — looking for similar regions — you get better predictions. And the more genomes we have, the better it will get.” 3 Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016

  4. Two (main) Paradigms of Biology Molecular Paradigm Evolution Paradigm 4 Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016

  5. Comparison of 1196 orthologous genes • Sequence identity between genes in human/mouse • exons: 84.6% • protein: 85.4% • introns: 35% • 5’ UTRs: 67% • 3’ UTRs: 69% • 27 proteins were 100% identical. Makalowski et al., 1996 5 Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016

  6. Full Genomes are available 6 Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016

  7. Full Genomes are available http://genomesonline.org/ 7 Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016

  8. Sequencing Growth Cost of one human genome • 2004: $30,000,000 • 2008: $100,000 • 2010: $10,000 • 2013: $2,000 (today) 8 Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016

  9. Sequencing Applications 100 million species Each individualhas different DNA http://www.1000genomes.org/ Within individual, some cells have different DNA(e.g. cancer) http://cancergenome.nih.gov/ 9 Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016

  10. Typical Bioinformatics Flow Chart 1a. Sequencing 6. Gene & Protein expression data 1b. Analysis of nucleic acid seq. 7. Drug screening 2. Analysis of protein seq. (e.g. gene finding) 3. Molecular structure prediction Ab initio drug design OR Drug compound screening in database of molecules 4. Molecular interaction 8. Genetic variability 5. Metabolic and regulatory networks 10 Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016

  11. Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016 So, if a new species/individual is available there are two tasks: • Generategenomicsdata • SequencingandAssembly • Comparetoavailabledata (genomes) • Improvegenefinder • Evolution of organisms? • Why do certain bacteria cause diseases while others do not? • Identification and prioritization of drug targets 11

  12. Genome Assembly 12 Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016

  13. Steps to Assemble a Genome Some Terminology read a 500-900 long word that comes out of sequencer mate pair a pair of reads from two ends of the same insert fragment contig a contiguous sequence formed by several overlapping reads with no gaps supercontig an ordered and oriented set (scaffold) of contigs, usually by mate pairs consensus sequence derived from the sequene multiple alignment of reads in a contig ..ACGATTACAATAGGTT.. 13 Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016

  14. Shotgun Sequencing genomic segment cut many times at random (Shotgun) Get one or two reads from each segment 14 Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016

  15. Reconstructing the Sequence (Fragment Assembly) reads Cover region with high redundancy Overlap & extend reads to reconstruct the original genomic region 15 Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016

  16. Definition of Coverage Length of genomic segment: G Number of reads: N Length of each read: L Definition:Coverage C = N * L / G How much coverage is enough? Lander-Waterman model: Prob[ not covered bp ] = e-C Assuming uniform distribution of reads, C=10 results in 1 gapped region /1,000,000 nucleotides C 16 Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016

  17. Repeats Bacterial genomes: 5% Mammals: 50% Repeat types: • Low-Complexity DNA (e.g. ATATATATACATA…) • Microsatellite repeats (a1…ak)N where k ~ 3-6 (e.g. CAGCAGTAGCAGCACCAG) • Transposons • SINE(Short Interspersed Nuclear Elements) e.g., ALU: ~300-long, 106 copies • LINE(Long Interspersed Nuclear Elements) ~4000-long, 200,000 copies • LTRretroposons(Long Terminal Repeats (~700 bp) at each end) cousins of HIV • Gene Families genes duplicate & then diverge (paralogs) • Recent duplications ~100,000-long, very similar copies 17 Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016

  18. AGTAGCACAGACTACGACGAGACGATCGTGCGAGCGACGGCGTAGTGTGCTGTACTGTCGTGTGTGTGTACTCTCCTAGTAGCACAGACTACGACGAGACGATCGTGCGAGCGACGGCGTAGTGTGCTGTACTGTCGTGTGTGTGTACTCTCCT Sequencing and Fragment Assembly 3x109 nucleotides 50% of human DNA is composed of repeats Error! Glued together two distant regions 18 Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016

  19. What can we do about repeats? Two main approaches: • Cluster the reads • Link the reads 19 Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016

  20. What can we do about repeats? Two main approaches: • Cluster the reads • Link the reads 20 Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016

  21. What can we do about repeats? Two main approaches: • Cluster the reads • Link the reads 21 Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016

  22. AGTAGCACAGACTACGACGAGACGATCGTGCGAGCGACGGCGTAGTGTGCTGTACTGTCGTGTGTGTGTACTCTCCTAGTAGCACAGACTACGACGAGACGATCGTGCGAGCGACGGCGTAGTGTGCTGTACTGTCGTGTGTGTGTACTCTCCT A R B D R C Sequencing and Fragment Assembly 3x109 nucleotides ARB, CRD or ARD, CRB ? 22 Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016

  23. AGTAGCACAGACTACGACGAGACGATCGTGCGAGCGACGGCGTAGTGTGCTGTACTGTCGTGTGTGTGTACTCTCCTAGTAGCACAGACTACGACGAGACGATCGTGCGAGCGACGGCGTAGTGTGCTGTACTGTCGTGTGTGTGTACTCTCCT Sequencing and Fragment Assembly 3x109 nucleotides 23 Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016

  24. Strategies for whole-genome sequencing • Hierarchical – Clone-by-clone • Break genome into many long pieces • Map each long piece onto the genome • Sequence each piece with shotgun Example: Yeast, Worm, Human, Rat • Online version of (1) – Walking • Break genome into many long pieces • Start sequencing each piece with shotgun • Construct map as you go Example: Rice genome • Whole genome shotgun One large shotgun pass on the whole genome Example: Drosophila, Human (Celera), Neurospora, Mouse, Rat, Dog 24 Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016

  25. cut many times at random Whole Genome Shotgun Sequencing genome forward-reverse paired reads known dist 25 Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016

  26. Fragment Assembly(in whole-genome shotgun sequencing) 26 Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016

  27. Fragment Assembly Given N reads… Where N ~ 30 million… We need to use a linear-time algorithm 27 Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016

  28. Steps to Assemble a Genome 1. Find overlapping reads 2. Merge some “good” pairs of reads into longer contigs 3. Link contigs to form supercontigs 4. Derive consensus sequence ..ACGATTACAATAGGTT.. 28 Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016

  29. Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016

  30. Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016

  31. Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016

  32. Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016

  33. Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016 K-mer based overlap

  34. Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016 Sorting k-mers

  35. 1. Find Overlapping Reads (read, pos., word, orient.) aaactgcag aactgcagt actgcagta … gtacggatc tacggatct gggcccaaa ggcccaaac gcccaaact … actgcagta ctgcagtac gtacggatc tacggatct acggatcta … ctactacac tactacaca (word, read, orient., pos.) aaactgcag aactgcagt acggatcta actgcagta actgcagta cccaaactg cggatctac ctactacac ctgcagtac ctgcagtac gcccaaact ggcccaaac gggcccaaa gtacggatc gtacggatc tacggatct tacggatct tactacaca aaactgcagtacggatct aaactgcag aactgcagt … gtacggatct tacggatct gggcccaaactgcagtac gggcccaaa ggcccaaac … actgcagta ctgcagtac gtacggatctactacaca gtacggatc tacggatct … ctactacac tactacaca 35 Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016

  36. T GA TACA | || || TAGA TAGT 1. Find Overlapping Reads • Find pairs of reads sharing a k-mer, k ~ 24 • Extend to full alignment – throw away if not >98% similar TAGATTACACAGATTAC ||||||||||||||||| TAGATTACACAGATTAC • Caveat: repeats • A k-mer that occurs N times, causes O(N2) read/read comparisons • ALU k-mers could cause up to 1,000,0002 comparisons • Solution: • Discard all k-mers that occur “too often” • Set cutoff to balance sensitivity/speed tradeoff, according to genome at hand and computing resources available 36 Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016

  37. 1. Find Overlapping Reads Create local multiple alignments from the overlapping reads TAGATTACACAGATTACTGA TAGATTACACAGATTACTGA TAG TTACACAGATTATTGA TAGATTACACAGATTACTGA TAGATTACACAGATTACTGA TAGATTACACAGATTACTGA TAG TTACACAGATTATTGA TAGATTACACAGATTACTGA 37 Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016

  38. 1. Find Overlapping Reads • Correct errors using multiple alignment TAGATTACACAGATTACTGA TAGATTACACAGATTACTGA TAGATTACACAGATTACTGA TAGATTACACAGATTACTGA TAGATTACACAGATTATTGA TAG-TTACACAGATTATTGA TAGATTACACAGATTACTGA TAGATTACACAGATTACTGA TAG-TTACACAGATTACTGA TAG-TTACACAGATTATTGA insert A correlated errors— probably caused by repeats  disentangle overlaps replace T with C TAGATTACACAGATTACTGA TAGATTACACAGATTACTGA TAGATTACACAGATTACTGA In practice, error correction removes up to 98% of the errors TAG-TTACACAGATTATTGA TAG-TTACACAGATTATTGA 38 Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016

  39. 2. Merge Reads into Contigs • Overlap graph: • Nodes: reads r1…..rn • Edges: overlaps (ri, rj, shift, orientation, score) Reads that come from two regions of the genome (blue and red) that contain the same repeat Note: of course, we don’t know the “color” of these nodes 39 Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016

  40. repeat region 2. Merge Reads into Contigs We want to merge reads up to potential repeat boundaries Unique Contig Overcollapsed Contig 40 Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016

  41. 2. Merge Reads into Contigs • Remove transitively inferable overlaps • If read r overlaps to the right reads r1, r2, and r1 overlaps r2, then (r, r2) can be inferred by (r, r1) and (r1, r2) r r1 r2 r3 41 Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016

  42. 2. Merge Reads into Contigs 42 Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016

  43. Overlap graph after forming contigs Unitigs: Gene Myers, 95 43 Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016

  44. Repeats, errors, and contig lengths • Repeats shorter than read length are easily resolved • Read that spans across a repeat disambiguates order of flanking regions • Repeats with more base pair diffs than sequencing error rate are OK • We throw overlaps between two reads in different copies of the repeat • To make the genome appear less repetitive, try to: • Increase read length • Decrease sequencing error rate Role of error correction: Discards up to 98% of single-letter sequencing errors decreases error rate  decreases effective repeat content  increases contig length 44 Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016

  45. 3. Link Contigs into Supercontigs Normal density Too dense  Overcollapsed Inconsistent links  Overcollapsed? 45 Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016

  46. 3. Link Contigs into Supercontigs Find all links between unique contigs Connect contigs incrementally, if  2 forward-reverse links supercontig (aka scaffold) 46 Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016

  47. 3. Link Contigs into Supercontigs • Fill gaps in supercontigs with paths of repeat contigs • Complex algorithmic step • Exponential number of paths • Forward-reverse links 47 Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016

  48. De Brujin Graph formulation • Given sequence x1…xN, k-mer length k, Graph of 4k vertices, Edges between words with (k-1)-long overlap 48 Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016

  49. 4. Derive Consensus Sequence Derive multiple alignment from pairwise read alignments TAGATTACACAGATTACTGA TTGATGGCGTAA CTA TAGATTACACAGATTACTGACTTGATGGCGTAAACTA TAG TTACACAGATTATTGACTTCATGGCGTAA CTA TAGATTACACAGATTACTGACTTGATGGCGTAA CTA TAGATTACACAGATTACTGACTTGATGGGGTAA CTA TAGATTACACAGATTACTGACTTGATGGCGTAA CTA Derive each consensus base by weighted voting (Alternative: take maximum-quality letter) 49 Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016

  50. Genome Comparison 50 Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016

More Related