Assembling Genome

Assembling Genome TimotheeCezard EBI NGS workshop 16/10/2012

AssemblyAlgorithms • Goal: Find the shortest common sequence of a set of reads. • This is NP-hard problem, we need to use some approximation algorithm. Main algorithm used: • Overlap Layout Consensus • Debrujin graphs

Overlap-layout-consensusStep 1: Find Overlapping Reads Need efficient alignment algorithm Doesn’t scale well when number of read is high Use seed based alignment with extension TACATAGATTACACAGATTACTGA || |||||||||||||||||||| TAGTTAGATTACACAGATTACTAGA

Overlap-layout-consensusStep 2: Construct overlap graph • A graph is constructed: • Nodes are reads • Edges represent overlapping reads CGTAGTGGCAT Overlap graph ATTCACGTAG

Overlap-layout-consensusStep 3: Find Contigs Try to find the Hamiltonian path: • a path in the graph contains each node exactly once. • Expensive computationally CGTAGTGGCAT ATTCACGTAG

Overlap-layout-consensus • This approach is used in Celera (CABOG), Newbler, Mira, SGA… • It is mostly used with Sanger or 454 data. • Can’t assemble repeat longer than read length • Could come back if read gets longer.

De Bruijn Graphs example “It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, it was the epoch of belief, it was the epoch of incredulity,.... “ Dickens, Charles. A Tale of Two Cities. 1859. London: Chapman Hall Velvet example courtesy of J. Leipzig 2010

De Bruijn Graphs example itwasthebestoftimesitwastheworstoftimesitwastheageofwisdomitwastheageoffoolishness… Generate random ‘reads’ How do we assemble? fincreduligeoffoolisItwasthebeItwasthebegeofwisdomitwastheepepochofinctimesitwasstheepochonessitwastwastheageotheepochofstheepochohofincreduestoftimeseoffoolishlishnessithofbeliefipochofincritwasthewotwastheagetoftimesitdomitwasthochofbelieeepochofbeeepochofbeastheworstchofincredtheageofwiiefitwasthssitwastheastheepochefitwasthewisdomitwaageoffoolitwastheworochofbeliesdomitwastsitwastheaeepochofbeffoolishneeofwisdomihebestoftistheageofftwastheepoeworstoftistoftimesitheepochofesitwastheheepochofitheepochofsdomitwastastheworstrstoftimesworstoftimstheepochogeoffoolisffoolishnetimesitwaslishnessitstheageoffeworstoftiorstoftimefwisdomitwwastheageoheageofwisincredulitishnessitwtwastheepowastheworsastheepochheworstoftofbeliefitwastheageoheepochofipochofincrheageofwisstheageofwfincreduliastheageofwisdomitwawastheageoastheepocholishnessiastheepochitwastheeptwastheagewisdomitwafbeliefitwbestoftimeepochofbeltheepochofsthebestoflishnessithofbeliefiItwasthebeishnessitwsitwasthewageofwisdotwastheageesitwasthetwastheageshnessitwafincredulifbeliefitwtheepochofmesitwasthdomitwasthochofbelieheageofwisoftimesitwstheepochobestoftimetwastheagefoolishnesftimesitwathebestoftitwastheagtheepochofitwasthewoofbeliefitbestoftimemitwastheaimesitwasttimesitwasorstoftimeestoftimestwasthebesstoftimesisdomitwastwisdomitwatheworstofastheworstsitwasthewtheageoffoeepochofbetheageofwifoolishnesincredulitofbeliefitchofincredbeliefitwabeliefitwawisdomitwaeageoffooleoffoolishitwastheagmesitwasthepochofincssitwastheitwastheepastheageofstheageoffsitwastheethebestoftoolishnessheepochofbochofbeliewastheepocbestoftimemesitwasthebestoftimpochofincr …etc. to 10’s of millions of reads Traditional all-vs-all comparisons of datasets this size require immense computational resources. De Bruijn solution: Construct a graph efficiently

De Bruijn GraphsStep 1: create kmer Step 1: “Kmerize” the data Reads: theageofwi sthebestof astheageof worstoftim imesitwast the sth ast wor ime Kmers : (k=3) hea the sth ors mes eag heb the rst esi age ebe hea sto sit geo bes eag tof itw eof est age oft twa ofw sto geo fti was fwi tof eof tim ast …..etc for all reads in the dataset

De Bruijn GraphsStep2 Build the graph Look for k-1 overlaps: given by the reads hea eag age geo eof the ast sth the hea eag age geo eof ofw fwi ast sth the heb ebe bes est sto tof sto tof wor ors rst oft fti tim ime mes was esi twa itw sit …..etc for all ‘kmers’ in the dataset

De Bruijn Graphsstep3: simplify the graph

De Bruijn Graphsstep4: Create contigs No single solution! Break the graph to give the final assembly

De Bruijn example The final assembly (k=3) foolishness itwasthe st wor wisdom times age incredulity epoch be of belief Repeat with a longer “kmer” length A better assembly (k=10) itwasthebestoftimesitwastheworstoftimesitwastheageofwisdomitwastheageoffoolis… Why not always use longest ‘k’ possible? Sequencing errors: Mostly unaffected kmers the ent sth ebe tof k=3 heb nto ben sthebentof k=10 sthebentof 100% wrong kmer

Strengths and problems of De Bruijn approach • Strengths: • No need to calculate the overlaps • Size of the final graph is function of the genome size • Repeats are collapsed • Problems: • Can only resolve k long repeat • Loose connectivity when create the contigs

Resolve repeat through scaffolding Contigs from assembly Align reads from short insert or long insert library Join contigs using evidence from paired end data Scaffold

De Bruijn assembler • Velvet: http://www.ebi.ac.uk/~zerbino/velvet/ • ABySS: http://www.bcgsc.ca/platform/bioinfo/software/abyss • SOAP-denovo: http://soap.genomics.org.cn/soapdenovo.html • ALLPATH-LG: http://www.broadinstitute.org/software/allpaths-lg/blog/ • IDBA-UD: http://i.cs.hku.hk/~alse/hkubrg/projects/idba_ud/

What makes an assembly good? • High coverage: 50 to 100X • Different but precise insert size libraries • Little to no sequencing errors • Avoid large number of variant. • Try different assembler • Need a big fat memory machine (from 16Go to 1To)

What makes your assembly better? • Error Correction: Correct the read before assembly • http://bib.oxfordjournals.org/content/early/2012/04/06/bib.bbs015.full • SOAP-denovo • Reptile: http://aluru-sun.ece.iastate.edu/doku.php?id=reptile • SGA: https://github.com/jts/sga • Joining overlapping reads: • COPE: ftp://ftp.genomics.org.cn/pub/cope/ • FLASH: http://genomics.jhu.edu/software/FLASH/index.shtml

What makes your assembly better? Gap Filling - Image Tsai et al. Genome biology 2010

Assembly validation • N50 is the most commonly used metric: • Weighted median such as 50% of your assembly is contained in contig of length >=N50 • CEGMA: Core Eukaryotic Genes Mapping Approach • Looks in your assembly for gene that should be there • Usually best assembly have best CEGMA score • http://korflab.ucdavis.edu/datasets/cegma/ • There are no magic tool

Assembling Genome