Variation Detections and De novo Assemblies from Next-gen Data

Variation Detections and De novo Assemblies from Next-gen Data Zemin Ning The Wellcome Trust Sanger Institute

Outline of the Talk: • Projects before Bioinformatics • Bioinformatics Projects Involved • Variation Detection • SNP, Indel, CNVs etc • Fuzzypath – short read assembly • Extremely GC Biased Genomes

Powder Simulation

Hair Dynamics Genetics and Human Hair Structure EAST ASIAN CAUCASIAN AFRICAN

Informatics Projects Involved • SSAHA (Sequence Search and Alignment by the Hashing Algorithm • Ssaha2 – Alignment tool for Solexa, 454, ABI capillary reads • ssahaSNP – SNP/indel detection, mainly for ABI capillary reads • ssahaEST – EST or cDNA alignment • ssaha_SV – Structural variation (CNVs) detection • ssaha_pileup – SNP/indel detection from next-gen data • Phusion • Development and maintenance of the pipeline • Production of WGS assemblies: • Mouse, Zebrafish, Human (Venter genome), C. Briggsae, Rice, Schisto, Sea Lamprey, Gorilla, Malaria and many bacterial genomes • TraceSeach • Public sequence search facility for all the traces • Fuzzypath • Short read assembler

Read mapping by hashing and dynamic programming data base of subject sequences FASTQ file with query sequences alignment banded Smith-Waterman

SNP File Indel File Sequencing Reads Reference fasta Pileup/cons Ssaha_indel ssaha_pileup Ssaha_cigar Alignment - ssaha2 Unique placed cigar read file PE Ssaha_clean Ssaha_pairs Pipeline of ssaha_pileup SE

Read Reference 27 14 25 30 21 29 Mapping Score in ssaha2 • Read mapping score is used to assess the repetitive feature of the read in the genome. In the cigar file cigar::50 Smap = 50 is the mapping score: • R = read length; Smax - maximum alignment score (smith-waterman) of the hits on genome; Smax2 - second best alignment score of the hits on genome; Say you have one read of 30 bases which has a few hits on the genome: Best hit: exact match with Smax 30; Second best hit: one base mismatch with Smax2 29. The mapping score for this read is Smap = 10;

SNP Confidence Score in ssaha2 SNP score is calculated as the sum of weighted read mapping scores, combined with base quality. For Solexa reads: Smap - read mapping score, from 0 (repeat) to 50 (unique); Fq - base quality factor: Fq = 1 if Q>=30 Fq = 0.5 if Q<30; N – number of read coverage at the location.

Getting Personal with J. Craig Venter and James Watson

Datasets • Venter: ABI capillary reads • Celera: 19,397,599 55% in pairs • JCVI: 12,541,352 98% in pairs • Total: 31,938,951 72% in pairs • Watson: 454 GS FLX reads • Baylor & Roche 74,198,831 • single end reads with length 150 – 280 bps • Chromosome X Illumina reads • 140 million paired Solexa reads at ~45x

SNP Results from Three Individuals Individuals Count % dbSNP Venter SNP Calling (Capillary)Homozygous SNPs 1 347 806 97.1%Heterozygous SNPs 1 857 167 90.9%Total SNPs 3 204 973 93.5% Watson SNP Calling (454) Homozygous SNPs 1 298 309 93.0%Heterozygous SNPs 1 767 951 63.9%Total SNPs 3 066 260 76.3%X Chromosome SNPs (Solexa)Homozygous SNPs 27 708 92.8%Heterozygous SNPs 63 197 81.8%Total SNPs 90 905 85.1%

Detection of Structural Variations Sample Reads Insertion Sample Reads a b a b Deletion Reference Sequence Reference Sequence Insertion Sample Reads A’ A’’ Sample Reads 1 2 a 2’ b a 1’ b VNTR Reference Sequence Reference Sequence

Structural Variations against NCBI36 Deletion VNTRs Insertion Total number: 2507 3775 1037 Maximum length (bp): 50000 4759Minimum length (bp): 20 20 Average length (bp): 815 216 Affected Bases: 2043653 817930 Deletion VNTR InsertionTotal number: 1389 553 396 Maximum length (bp): 71832 9589Minimum length (bp): 20 20 Average length (bp): 1252 270 Affected bases: 1740162 149421

Deletion – Size Distribution

VNTRs – Size Distribution

Indel Detection P.Faciparum 3D7 Simulations Simulated Solexa reads: Number of reads: 25,647,985Genome size: 23.0 Mbp Read length: 36 Read coverage: 40x Num. of uniquely placed PE reads: 24,303,362Percentage of placed PE reads: 94.5% Num. of uniquely placed SE reads: 23,229,651 Percentage of placed SE reads: 90.6% Detection results: Number of deletions: 5,816 Number of detected deletions: 5,668 (97.5%) Number of false positives: 135 (2.3%) Number of insertions: 5,816 Number of detected insertions: 5,458 (93.8%) Number of false positives: 15 (0.26%)

Availability ftp://ftp.sanger.ac.uk/pub/zn1/ssaha_pileup/ http://www.sanger.ac.uk/Software/analysis/SSAHA2 More information: ftp://ftp.sanger.ac.uk/pub/zn1/ssaha_pileup/ssaha_pileup-readme

FuzzyPath and Assemblies from Mixed Solexa/454 Datasets to Extremely GC Biased Genomes

Sequence Reconstruction - Hamiltonian path approach S=(ATGCAGGTCC) ATG -> TGC -> GCA -> CAG -> AGG -> GGT -> GTC -> TCC ATG AGG TGC TCC GTC GGT GCA CAG • Vertices: k-tuples from the spectrum shown in red (8); • Edges: overlapping k-tuples (7); • Path: visiting all vertices corresponding to the sequence.

CG GT GC AT TG CA GG Sequence Reconstruction - Euler path approach ATG -> TGG -> GGC -> GCG -> CGT -> GTG -> TGC -> GCA ATGCGTGGCA ATGGCGTGCA • Vertices: correspond to (k-I)-tuples (7); • Edges: correspond to k-tuples from the spectrum (8); • Path: visiting all EDGES corresponding to the sequence.

Assembly Strategy Selexa reads assembler to extend long reads of 1-2Kb forward-reverse paired reads known dist ~500 bp 30-70 bp 30-70 bp Capillary reads assembler Phrap/Phusion Genome/Chromosome

Kmer Extension & Repeat Junctions A = A1 + A2 A2 A1 B1 B = B1 + B2 B2

Handling of Single Base Variations A B1 A B2 B1 = B2 S = A + B1

Fuzzy Kmers Number of Mismatches between Two Kmers ACGTAACTAACAGTT 00 01 10 11 00 00 01 11 00 00 01 00 10 11 11 Kmer_1 ACGTAACTCACAGTT 00 01 10 11 00 00 01 11 01 00 01 00 10 11 11 Kmer_2 ACGTAACT ACAGTT 00 00 00 00 00 00 00 00 01 00 00 00 00 00 00 Kmer_1^Kmer_2

Pileup of other reads like 454, Sanger etc at a repeat junction Kmer Extension & Repeat Junctions A2 A1 Consensus Means to handle repeats: - Base quality - Read pair - Fuzzy kmers - Closely related reference - 454 or Sanger reads

Pileup of Solexa and 454 Reads

S.Suis P1/7 Solexa/454 Assembly Solexa reads: Number of reads: 3,084,185;Finished genome size: 2,007,491 bp; Read length: 39 and 36 bp; Estimated read coverage: ~55X; Number of 454 reads: 100,000; Read coverage of 454: 10X; Assembly features: - contig statsTotal number of contigs: 73; Total bases of contigs: 1,999,817 bp N50 contig size: 62,508; Largest contig: 162,190 Averaged contig size: 27,394; Contig coverage over the genome: ~99 %; Contig extension errors: 2 Mis-assembly errors: 3

Salmonella seftenbergSolexa Assembly from Pair-End Reads Solexa reads: Number of reads: 6,000,000;Finished genome size: ~4.8 Mbp; Read length: 2x37 bp; Estimated read coverage: ~92.5 X; Insert size: 170/50-300 bp; Assembly features: - contig stats Solexa 454Total number of contigs: 75; 390 Total bases of contigs: 4.80 Mbp 4.77 Mb N50 contig size: 139,353 25,702 Largest contig: 395,600 62,040 Averaged contig size: 63,969 12,224 Contig coverage on genome: ~99.8 % 99.4% Contig extension errors: 0 Mis-assembly errors: 0 4

E.Coli strain 042 Assembly Solexa reads: Number of reads: 7,055,348;Finished genome size: 5.35 Mbp; Read length: 2x36bp; Estimated read coverage: ~95X; Insert size: 170/50-300 bp; Assembly features: - contig statsTotal number of contigs: 168; Total bases of contigs: 5.19 Mbp N50 contig size: 85,886; Largest contig: 337,768 Averaged contig size: 30,886; Contig coverage over the genome: ~99 %; Contig extension errors: 1 Mis-assembly errors: 2

Salmonella delhi5 Solexa Assembly Guided by A Close Reference Solexa reads: Number of reads: 6,346,317;Finished genome size: 4.7 Mbp; Read length: 33 bp; Estimated read coverage: ~40 X; Shredded reference of SpA: 10X; Assembly features: - contig statsTotal number of contigs: 66; Total bases of contigs: 4,615,704 bp N50 contig size: 168,793; Largest contig: 401,700 Averaged contig size: 69,934; Contig coverage over the genome: ~98 %; Contig extension errors: 0 Mis-assembly errors: 2

The Malaria Genome Project

Datasets with Various GC Content GC 68.0% 50.5% 19.0% 68.0% 19.0% 50.8% 19.0% 19.0% 19.0% 19.0%

Malaria 3D7 Assemblies Solexa reads: 2x36 bp 2x76 bp Number of reads: 14.0m 9.77mFinished genome size: 23 Mbp 23 Mbp Estimated read coverage: 43x 64x Insert size: 170 bp 170 bp Assembly features:Total number of contigs: 26,926 22839 Total bases of contigs: 19.2 Mbp 21.1 Mb N50 contig size: 1456 1621 Largest contig: 9106 9825 Averaged contig size: 706 923 Contig coverage on genome: ~83.5 % 91.7% Contig extension errors: ? ? Mis-assembly errors: ? ?

Acknowledgements: • Jim Mullikin • Tony Cox – Illumina, UK • Tony Cox – Sanger Institute • Adam Spargao, • Yong Gu • Ben Blackburne • Hannes Ponstingl • Daniel Turner • Michael Quail • Jane Rogers • Richard Durbin

Variation Detections and De novo Assemblies from Next-gen Data

Variation Detections and De novo Assemblies from Next-gen Data

Presentation Transcript

Next Gen Sequencing Data

De novo assembly from Illumina

Next-Gen Content Creation for Next-Gen AI

Genome-Wide SNP Discovery from de novo Assemblies of Pepper ( Capsicum annuum ) Transcriptomes

Protection from Next Gen Threats

Next Gen Funding

Next Gen?

Data and Variation

“Next Gen” Effects ?

Finding Next Gen CryEngine2

Next Gen CRM

Next Gen Access

Grow Next Gen

SIGNAL PROCESSING FOR NEXT-GEN SEQUENCING DATA

NEXT GEN YSLOW

Next-Gen Retrofits

Genome De Novo Assemblies and Applications in NGS Sequencing

Next Gen Funding

WLAN Next Gen - UWB