Developing Bioinformatics Tools for Genome Analysis

Developing Bioinformatics Tools for Genome Analysis Zemin Ning The Wellcome Trust Sanger Institute

Outline of the Talk: • My academic background • Informatics Projects • Sequence alignment • Detecting breakpoints of CNVs • Genome assembly - algorithm • Genome assembly – Tasmanian Devil • 3rd generation sequencing – Oxford Nanopore

Powder Simulation

Hair Dynamics Genetics and Human Hair Structure EAST ASIAN CAUCASIAN AFRICAN

BionformaticsProjects Involved • Sequence Alignment • Ssaha/Ssaha2/SMALT - Alignment tool for Illumina, 454, ABI capillary reads • ssahaSNP – SNP/indel detection, mainly for ABI capillary reads • ssahaEST – EST or cDNA alignment • ssaha_pileup – SNP/indel detection from next-gen data • Genome Assembly • Phusion/Phusion2: Clustering based genome assembly pipeline • Fuzzypath: DeBruijngraph based assembler • iCAS: - a pipeline for Illumina clone assembly • Production of WGS assemblies: • Mouse, Zebrafish, Human (Venter genome), C. Briggsae, Rice, Schisto, Sea Lamprey, Gorilla, Tasmanian Devil, Bamboo, Malaria and many bacterial genomes • Pindel • Detecting indels/SVs from short reads • TraceSeach • Public sequence search facility for all the traces

SMALT Algorithm

ATGGCGTGCAGTCCATGTTCGGATCA ATGGCGTGCAGT TGGCGTGCAGTC GGCGTGCAGTCC GCGTGCAGTCCA CGTGCAGTCCAT Overlap hashing W = N-k+1 (k = 12) ATGGCGTGCAGTCCATGTTCGGATCATTACGTAAGC ATGGGCAGATGT CCATGTTCGGAT CATTACGTAAGC Non-overlap Hashing W = N/k Non-overlap Hashing v Overlap Hashing

Sequence S: (s1s2, …, si, …, sm) i =1,2, …, m K-tuple: (sisi+1...si+k-1) “A” =00; “C” = 01; “G” = 10; “T” = 11 Sequence Representation Using two binary digits for each base, we may have the following representations: For any of the m/k no-overlapping k-tuples in the sequence, an integer may be used to represent the k-tuple in a unique way where bi = 0 or 1, depending on the value of the sequence base and Emax is the maximum value of the possible E values.

E k-tuple Ni Indices and Offsets 0 AA 1 2, 19 1 AC 3 1, 9 2, 5 2, 11 2 AG 2 1, 15 2, 35 3 AT 2 2, 13 3, 3 4 CA 7 2, 3 2, 9 2, 21 2, 27 2, 33 3, 21 3, 23 5 CC 4 1, 21 2, 31 3, 5 3, 7 6 CG 1 1, 5 7 CT 6 1, 23 2, 39 2, 43 3, 13 3, 15 3, 17 8 GA 4 1, 3 1, 17 2, 15 2, 25 9 GC 0 10 GG 5 1, 25 1, 31 2, 17 2, 29 3, 1 11 GT 6 1, 1 1, 27 1, 29 2, 1 2, 37 3, 19 12 TA 1 3, 25 13 TC 6 1, 7 1, 11 1, 19 2, 23 2, 41 3, 11 14 TG 3 1, 13 2, 7 3, 9 15 TT Hash Table: A 2-tuple hashing table of S1, S2 and S3 S1=(GTGACGTCACTCTGAGGATCCCCTGGGTGTGG) S2=(GTCAACTGCAACATGAGGAACATCGACAGGCCCAAGGTCTTCCT) S3=(GGATCCCCTGTCCTCTCTGTCACATA)

Burrows-Wheeler vs Hashing seed (28 bp) seed (28 bp) seed (28 bp) lo-half hi-half BOWTIE/TOPHAT 3' 5' depth-1st by default, breadth-1st slower no indels seed (32 bp, optional) BWA 5' 3' breadth first, upper bound on edit distance, e.g. max 5 mismatches in 100bp read. Can deal with indels. SMALT/SSAHA2 5' 3' Exact matching k-segment (1 kmer) required. Partial alignments (indels, splice junctions)

Burrows-Wheeler vs Hashing • Strengths and weaknesses (trends) • Burrows-Wheeler, e.g. bwa, bowtie • Fast, esp. (multiple) exact matches • High sensitivity at repetitive regions • less robust at high genomic variation • Hashing (overlapping k-mer words, e.g SMALT/SSAHA2, Stampy, SNAP) • Slower (more memory hungry) • Less sensitivity at repetitive regions • tolerate high genomic variation • partial alignments (junction reads) easier • Flexible (multiple sequencing platforms)

Performance Assessmenton simulated reads human genome 105 read pairs 2 x 100 bp (insert 500) 20% of variations indels (max. 10)

Performance of mappers(genome re-sequencing) Simulated for human genome: 4x106 x 100 bp single reads 1% variation of which 20% indels 14 bp maximum indel length indel length

Sensitivity Assessment ~ 2% genomic variation Reads: M. spretus whole genome shotgun 2 x 108 bp, insert 250 bp Reference: M. musculus NCBI build 37 • independently mapped reads • Count discordant pairs as erroneous mappings

Pindel – A Pattern Growth Approach to DetectStructural Variations ATGCAGC ATCAAGTATGCTTAGC Minimum unique substring: ATG Maximum unique substring: ATGC

String matching by pattern growth ATGCAGC ATCAAGTATGCTTAGC 04 January 2020 16

String matching by pattern growth ATGCAGC ATCAAGTATGCTTAGC Minimum unique substring: ATG Maximum unique substring: ATGC 04 January 2020 19

Deletion with Base Level Breakpoint Precision ATGCAGC ATCAAGTATGCTTAGC Minimum unique substring: ATG Maximum unique substring: ATGC

Indels from LowCov data of 1000g on chr1 Deletion: 88,425 Insertion: 69,710

Assembly Method Sequencing reads: 1. Overlap graph 2. de Bruijn graph 3. String graph

Phusion2 Assembly Pipeline Assembly Data Process Solexa Reads Supercontig Long Insert Reads PRono SOAPdenovo Contigs Reads Group 2x75 or 2x100 Base Correction Fermi ABySS Phrap

ATGGCGTGCAGTCCATGTTCGGATCA ATGGCGTGCAGT TGGCGTGCAGTC GGCGTGCAGTCC GCGTGCAGTCCA CGTGCAGTCCAT ATGGCGTGCAGTCCATGTTCGGATCA ATGGGCAGATGT TGGCCAGTTGTT GGCGAGTCGTTC GCGTGTCCTTCG Kmer Word Hashing Contiguous Base Hash K = 12 Gap-Hash 4x3

Useful Region Real Data Curve Poisson Curve Word use distribution for the mouse sequence data at ~7.5 fold

Making kmers and establishing kmer profile (K=31-63) from the reads Clustering Strategy Reduce cluster size Read Clustering Assembly Contigs Assembly Scaffolds

Relational Matrix R11 R12 R13 ... R1j … R1n R11 R12 R13 ... R1j … R1n R21 R22 R23 ... R2j … R2n R21 R22 R23 ... R2j … R2n R31 R32 R33 ... R3j … R3n R31 R32 R33 ... R3j … R3n ... … Rij … … ... … Rij … … R1 = Rn1 Rn2 Rn3 ... Rnj … Rnn Rn1 Rn2 Rn3 ... Rnj … Rnn R2 = R1. E1 = … R50 = R49. E49

Relation Matrix: R(i,j) – Implementation 1 2 3 4 5 6 … j … 500 1 2 3 4 Number of shared kmer words (< 63) 5 . . . Read index R(i,j) N

Relation Matrix: R(i,j) – number of kmer words shared between read i and read j 1 2 3 4 5 6 … j … N 41 0 0 0 0 1 2 41 37 0 0 0 3 0 37 0 22 0 4 0 0 0 0 27 Group 2: (4,6) 5 0 0 22 0 0 6 0 0 0 27 0 i R(i,j) Group 1: (1,2,3,5) N

Outline of Genome Assemblies

Tasmanian tiger Tasmanian devil Australian Tasmanian

Tasmanian devil Tasmanian devil Wallaby Opossum

Tasmanian devil facial tumour disease (DFTD) • Transmissible cancer characterised by the growth of large tumours on the face, neck and mouth of Tasmanian devils • Transmitted by biting • Commonly metastasises • First observed in 1996 • Primarily affects adults >1yr • Death in 4 – 6 months

DFTD samples for sequencing Area still DFTD free DFTD originated here c.1996 Narawntapu 2007 Mt William 2007 or 2008 Upper Natone 2007 Strain 1, tetraploid Strain 2 Reedy Marsh 2007 Strain 3 “Evolved” Unknown strain Coles Bay Mangalore 2007 Forestier 2007

Devil Genomes Sequenced Tumour 2 (53T) Narawntapu 2007 Mt William Upper Natone 2007 Reedy Marsh 2007 Tumour 1 (87T) Coles Bay Mangalore 2007 Salem - A female Tasmanian Devil lived Taronga Zoo in Sydney. Forestier 2007

Sequencing T. Devil on Illumina: Strategy Tumour or normal genomic DNA Fragments of defined size 0.5, 2, 5, 7, 8, 10 kb Sequencing 2x100bp reads short insert 2x50bp mate pairs Sequencing performed at Illumina

Devil – Opossum Homology Map Based on Hybridisation Results of Devil Paints onto Opossum Chromosomes Opossum Devil 1 4 2a 3a 6 1 2 3 4 5 2b 5 3b X 6 7 8 X Opossum chromosome images were taken from Duke et a. 2007, Chromosome Res 15:361-370

Scaffolds Assigned to Chromosomes using Flow-sorting Data Chr Size (Mb) Scaffolds_assigned Bases_assigned (Mb) Chr1 571 6729 684 Chr2 610 8381 740 Chr3 556 7197 641 Chr4 450 4817 487 Chr5 341 3188 300 Chr6 277 2844 263 ChrX 122 2378 86.6 Unassigned 440 1.23

Variant calling : catalogue of variants in all 4 genomes *Data source: Illumina. Variants removed within 500bp of a contig end, Q(indel) < 30 and Q(GT) < 5.

Oxford Nanopore The Data, the Alignment and the Assembly

1D and 2D Base Calling The 1D vs 2D barcoding refers to whether the complementary strand is used to improve basecalled data. Basically – it gives two shots at examining the same loci. The advantage being that the complementary strand will have a different kmer profile.

Read Length Distribution – 1D and 2D Yeast S288C ONT reads sequenced – 1D & 2D 1D 2D Number of bases: 275Mb 373Mb Number of reads: 47735 66106 N50: 8335 6645 Average length: 5765 5651 Maximum length: 143706 32826

Mapping score: 4 Match identity: 53.4% Mapping score: 19 Match identity: 75.2%

ONT Read Length ONT Assisted Scaffolding – Fake Mate Pairs Step length Segment length Insert Size

ONT Assisted Scaffolding http://sourceforge.net/projects/phusion2/files/smis/ Mate pair data is used to scaffold contigs. Contigs, and pairs of contigs connected by pairs, define a bi-directional graph: Using expected insert size, a estimate of the gap size can be given for each contig.

Spinner – walks through a loop These techniques alone produces useful results. Further stages will be used to resolve repeats pairs that “jump over” repeats, and graph flow concepts.

Scaffold N50 858Kb ; Contig N50 330Kb Scaffold one piece Scaffold 2-3 pieces

SV between S288C and W303

Developing Bioinformatics Tools for Genome Analysis