1 / 40

Variation Detections and De novo Assemblies from Next-gen Data

Variation Detections and De novo Assemblies from Next-gen Data. Zemin Ning The Wellcome Trust Sanger Institute. Outline of the Talk:. Projects before Bioinformatics Bioinformatics Projects Involved Variation Detection SNP, Indel, CNVs etc Fuzzypath – short read assembly

tammy
Download Presentation

Variation Detections and De novo Assemblies from Next-gen Data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Variation Detections and De novo Assemblies from Next-gen Data Zemin Ning The Wellcome Trust Sanger Institute

  2. Outline of the Talk: • Projects before Bioinformatics • Bioinformatics Projects Involved • Variation Detection • SNP, Indel, CNVs etc • Fuzzypath – short read assembly • Extremely GC Biased Genomes

  3. Powder Simulation

  4. Hair Dynamics Genetics and Human Hair Structure EAST ASIAN CAUCASIAN AFRICAN

  5. Informatics Projects Involved • SSAHA (Sequence Search and Alignment by the Hashing Algorithm • Ssaha2 – Alignment tool for Solexa, 454, ABI capillary reads • ssahaSNP – SNP/indel detection, mainly for ABI capillary reads • ssahaEST – EST or cDNA alignment • ssaha_SV – Structural variation (CNVs) detection • ssaha_pileup – SNP/indel detection from next-gen data • Phusion • Development and maintenance of the pipeline • Production of WGS assemblies: • Mouse, Zebrafish, Human (Venter genome), C. Briggsae, Rice, Schisto, Sea Lamprey, Gorilla, Malaria and many bacterial genomes • TraceSeach • Public sequence search facility for all the traces • Fuzzypath • Short read assembler

  6. Read mapping by hashing and dynamic programming data base of subject sequences FASTQ file with query sequences alignment banded Smith-Waterman

  7. SNP File Indel File Sequencing Reads Reference fasta Pileup/cons Ssaha_indel ssaha_pileup Ssaha_cigar Alignment - ssaha2 Unique placed cigar read file PE Ssaha_clean Ssaha_pairs Pipeline of ssaha_pileup SE

  8. Read Reference 27 14 25 30 21 29 Mapping Score in ssaha2 • Read mapping score is used to assess the repetitive feature of the read in the genome. In the cigar file cigar::50 Smap = 50 is the mapping score: • R = read length; Smax - maximum alignment score (smith-waterman) of the hits on genome; Smax2 - second best alignment score of the hits on genome; Say you have one read of 30 bases which has a few hits on the genome: Best hit: exact match with Smax 30; Second best hit: one base mismatch with Smax2 29. The mapping score for this read is Smap = 10;

  9. SNP Confidence Score in ssaha2 SNP score is calculated as the sum of weighted read mapping scores, combined with base quality. For Solexa reads: Smap - read mapping score, from 0 (repeat) to 50 (unique); Fq - base quality factor: Fq = 1 if Q>=30 Fq = 0.5 if Q<30; N – number of read coverage at the location.

  10. Getting Personal with J. Craig Venter and James Watson

  11. Datasets • Venter: ABI capillary reads • Celera: 19,397,599 55% in pairs • JCVI: 12,541,352 98% in pairs • Total: 31,938,951 72% in pairs • Watson: 454 GS FLX reads • Baylor & Roche 74,198,831 • single end reads with length 150 – 280 bps • Chromosome X Illumina reads • 140 million paired Solexa reads at ~45x

  12. SNP Results from Three Individuals Individuals Count % dbSNP Venter SNP Calling (Capillary)Homozygous SNPs 1 347 806 97.1%Heterozygous SNPs 1 857 167 90.9%Total SNPs 3 204 973 93.5% Watson SNP Calling (454) Homozygous SNPs 1 298 309 93.0%Heterozygous SNPs 1 767 951 63.9%Total SNPs 3 066 260 76.3%X Chromosome SNPs (Solexa)Homozygous SNPs 27 708 92.8%Heterozygous SNPs 63 197 81.8%Total SNPs 90 905 85.1%

  13. Detection of Structural Variations Sample Reads Insertion Sample Reads a b a b Deletion Reference Sequence Reference Sequence Insertion Sample Reads A’ A’’ Sample Reads 1 2 a 2’ b a 1’ b VNTR Reference Sequence Reference Sequence

  14. Structural Variations against NCBI36 Deletion VNTRs Insertion Total number: 2507 3775 1037 Maximum length (bp): 50000 4759Minimum length (bp): 20 20 Average length (bp): 815 216 Affected Bases: 2043653 817930 Deletion VNTR InsertionTotal number: 1389 553 396 Maximum length (bp): 71832 9589Minimum length (bp): 20 20 Average length (bp): 1252 270 Affected bases: 1740162 149421

  15. Deletion – Size Distribution

  16. VNTRs – Size Distribution

  17. Indel Detection P.Faciparum 3D7 Simulations Simulated Solexa reads: Number of reads: 25,647,985Genome size: 23.0 Mbp Read length: 36 Read coverage: 40x Num. of uniquely placed PE reads: 24,303,362Percentage of placed PE reads: 94.5% Num. of uniquely placed SE reads: 23,229,651 Percentage of placed SE reads: 90.6% Detection results: Number of deletions: 5,816 Number of detected deletions: 5,668 (97.5%) Number of false positives: 135 (2.3%) Number of insertions: 5,816 Number of detected insertions: 5,458 (93.8%) Number of false positives: 15 (0.26%)

  18. Availability ftp://ftp.sanger.ac.uk/pub/zn1/ssaha_pileup/ http://www.sanger.ac.uk/Software/analysis/SSAHA2 More information: ftp://ftp.sanger.ac.uk/pub/zn1/ssaha_pileup/ssaha_pileup-readme

  19. FuzzyPath and Assemblies from Mixed Solexa/454 Datasets to Extremely GC Biased Genomes

  20. Sequence Reconstruction - Hamiltonian path approach S=(ATGCAGGTCC) ATG -> TGC -> GCA -> CAG -> AGG -> GGT -> GTC -> TCC ATG AGG TGC TCC GTC GGT GCA CAG • Vertices: k-tuples from the spectrum shown in red (8); • Edges: overlapping k-tuples (7); • Path: visiting all vertices corresponding to the sequence.

  21. CG GT GC AT TG CA GG Sequence Reconstruction - Euler path approach ATG -> TGG -> GGC -> GCG -> CGT -> GTG -> TGC -> GCA ATGCGTGGCA ATGGCGTGCA • Vertices: correspond to (k-I)-tuples (7); • Edges: correspond to k-tuples from the spectrum (8); • Path: visiting all EDGES corresponding to the sequence.

  22. Assembly Strategy Selexa reads assembler to extend long reads of 1-2Kb forward-reverse paired reads known dist ~500 bp 30-70 bp 30-70 bp Capillary reads assembler Phrap/Phusion Genome/Chromosome

  23. Kmer Extension & Repeat Junctions A = A1 + A2 A2 A1 B1 B = B1 + B2 B2

  24. Handling of Single Base Variations A B1 A B2 B1 = B2 S = A + B1

  25. Fuzzy Kmers Number of Mismatches between Two Kmers ACGTAACTAACAGTT 00 01 10 11 00 00 01 11 00 00 01 00 10 11 11 Kmer_1 ACGTAACTCACAGTT 00 01 10 11 00 00 01 11 01 00 01 00 10 11 11 Kmer_2 ACGTAACT ACAGTT 00 00 00 00 00 00 00 00 01 00 00 00 00 00 00 Kmer_1^Kmer_2

  26. Pileup of other reads like 454, Sanger etc at a repeat junction Kmer Extension & Repeat Junctions A2 A1 Consensus Means to handle repeats: - Base quality - Read pair - Fuzzy kmers - Closely related reference - 454 or Sanger reads

  27. Pileup of Solexa and 454 Reads

  28. S.Suis P1/7 Solexa/454 Assembly Solexa reads: Number of reads: 3,084,185;Finished genome size: 2,007,491 bp; Read length: 39 and 36 bp; Estimated read coverage: ~55X; Number of 454 reads: 100,000; Read coverage of 454: 10X; Assembly features: - contig statsTotal number of contigs: 73; Total bases of contigs: 1,999,817 bp N50 contig size: 62,508; Largest contig: 162,190 Averaged contig size: 27,394; Contig coverage over the genome: ~99 %; Contig extension errors: 2 Mis-assembly errors: 3

  29. Salmonella seftenbergSolexa Assembly from Pair-End Reads Solexa reads: Number of reads: 6,000,000;Finished genome size: ~4.8 Mbp; Read length: 2x37 bp; Estimated read coverage: ~92.5 X; Insert size: 170/50-300 bp; Assembly features: - contig stats Solexa 454Total number of contigs: 75; 390 Total bases of contigs: 4.80 Mbp 4.77 Mb N50 contig size: 139,353 25,702 Largest contig: 395,600 62,040 Averaged contig size: 63,969 12,224 Contig coverage on genome: ~99.8 % 99.4% Contig extension errors: 0 Mis-assembly errors: 0 4

  30. E.Coli strain 042 Assembly Solexa reads: Number of reads: 7,055,348;Finished genome size: 5.35 Mbp; Read length: 2x36bp; Estimated read coverage: ~95X; Insert size: 170/50-300 bp; Assembly features: - contig statsTotal number of contigs: 168; Total bases of contigs: 5.19 Mbp N50 contig size: 85,886; Largest contig: 337,768 Averaged contig size: 30,886; Contig coverage over the genome: ~99 %; Contig extension errors: 1 Mis-assembly errors: 2

  31. Salmonella delhi5 Solexa Assembly Guided by A Close Reference Solexa reads: Number of reads: 6,346,317;Finished genome size: 4.7 Mbp; Read length: 33 bp; Estimated read coverage: ~40 X; Shredded reference of SpA: 10X; Assembly features: - contig statsTotal number of contigs: 66; Total bases of contigs: 4,615,704 bp N50 contig size: 168,793; Largest contig: 401,700 Averaged contig size: 69,934; Contig coverage over the genome: ~98 %; Contig extension errors: 0 Mis-assembly errors: 2

  32. The Malaria Genome Project

  33. Datasets with Various GC Content GC 68.0% 50.5% 19.0% 68.0% 19.0% 50.8% 19.0% 19.0% 19.0% 19.0%

  34. Malaria 3D7 Assemblies Solexa reads: 2x36 bp 2x76 bp Number of reads: 14.0m 9.77mFinished genome size: 23 Mbp 23 Mbp Estimated read coverage: 43x 64x Insert size: 170 bp 170 bp Assembly features:Total number of contigs: 26,926 22839 Total bases of contigs: 19.2 Mbp 21.1 Mb N50 contig size: 1456 1621 Largest contig: 9106 9825 Averaged contig size: 706 923 Contig coverage on genome: ~83.5 % 91.7% Contig extension errors: ? ? Mis-assembly errors: ? ?

  35. Acknowledgements: • Jim Mullikin • Tony Cox – Illumina, UK • Tony Cox – Sanger Institute • Adam Spargao, • Yong Gu • Ben Blackburne • Hannes Ponstingl • Daniel Turner • Michael Quail • Jane Rogers • Richard Durbin

More Related