Ancient DNA Research: Hybridization Capture and High-Throughput Sequencing

Hybridization capture, high-throughput sequencing and its implications for ancient DNA research Michael Hofreiter

Is science becoming infantilized? Our young people are undisciplined and sleazy. They do not listen to their parents anymore. The end of the world is near. Ur, Chaldäa, 2,000 BC

Is ancient DNA research infantile? It’s a zebra Higuchi et al. 1984, Nature

However........... Watson and Crick 1953 was also a short Nature paper

Simple stories are not always bad There are lies for children and lies for adults Terry Pratchett

Some more reflections Not all investigations deserve equal respect. Observations alone do not always make sense. What do we really learn from genomic data? How to win a Nobel prize? I don’t know.

From no data to drowning in data

The latest fancy piece of kit ~ 200 Gb total sequence ~ 1 billion individual reads

The latest Increase in sequencing throughput

Mammoth

Palaeo-Eskimos

Neanderthals

Data first?

So what have we learned from ancient genomes Mammoth genome draft: Hm............ Saqqaq genome: Migrated from Arctic north-east Asia 5,500 B.P. Neanderthal genome draft: Diverged from modern humans ~ 0.4 mya Maybe gene flow into modern human gene pool Genetic regions were selected on human lineage

And what did they cost? Mammoth genome draft: ~ $ 800,000 Saqqaq genome: $ 500,000 Neanderthal genome draft: $ 6.4 million

The disadvantages Shotgun sequencing Made for a maximum of 8 samples Costs - $ 20,000 per run

Another problem Neandertal 4.0% Percentage endogenous DNA

Are more data better data?

“spelaeus” “eremus” “ladinicus” “rossicus” “ingressus” “kudarensis” NJ tree 123 sequences 250 bp control region outgroups

“spelaeus” “eremus” “ladinicus” “rossicus” “ingressus” 51! “kudarensis” Condensed NJ tree 50% bootstrap cutoff 123 sequences 250 bp D-loop outgroups

Different PCR types

Ursus spelaeus Ursus ingressus Ursus kudarensis SP1325 Zoolithen cave Ger Combined NJ, ML and Bayesian tree based on 9,632 bp of 2 published and 31 additional cave bear specimens SP2083 A Ceza Sp 90-86-1.0 99-100-1.0 SP2085 A Ceza Sp SP1659 Arcy Cure Fr 99-100-1.0 EU327344 Chauvet Fr 100-99-1.0 SP2091 Eiros Sp 100-100-1.0 SP1497 Herrmanns cave Ger SP2081 Cova Linares Sp 93-94-1.0 SP1330 Zoolithen cave Ger 99-85-1.0 SP1334 Zoolithen cave Ger 100-100-1.0 SP2129 Grotte d’ours Fr SP370 Herdengel cave Au 100-40-0.5 100-100-1.0 SP2133 Schneiber cave Ger SP1324 Zoolithen cave Ger SP1844 Divje babe Slo SP1626 Pestera cu Oase Ro 100-84-1.0 100-100-1.0 100-100-1.0 SP1629 Pestera cu Oase Ro SP2125 Medvedia jaskyna Slv 85-91-1.0 SP2062 Bolshoi cave Ru 95-87-0.89 SP2065 Medvezhiya cave Ru 61-62-0.89 SP2064 Secrets cave Ru SP1845 Divje babe Slo SP2027 Geissenkloesterle Ger 58-55-0.98 92-86-1.0 59-59-0.98 SP2106 Geissenkloesterle Ger 97-91-1.0 SP232 Nixloch Au SP234 Potocka zijalka Slo 100-95-1.0 SP335 Gamssulzen Au 100-100-1.0 SP233 Potocka zijalka Slo SP1850 Divje babe Slo NC011112 Gamssulzen Au 63-63-0.93 SP341 Gamssulzen Au SP2073 Hovk Arm 100-100-1.0 SP2074 Hovk Arm EU497665 Ursus arctos

Results of DMPS between 13.0 and 16.5 kb replicated sequence for each of the 31 individuals ~1.0 Mb of targeted aDNA sequence data

Requirements for PCR PCR target Primer F Primer R Min 20BP Min 30BP Min 20BP Min molecule length 70BP

Fragment length in ancient DNA Frequency ½ fragment size = 2 - 100x number of molecules 30 50 70 Fragment length in BP

DNA hybridization capture

DNA hybridization capture • ~5Mb targeted per array • 7 arrays, whole exome • ~98% of exons retrieved • 300,000 primer pairs for aDNA • 6,000 LR-PCRs for modern DNA Probes Glass slide

Ancient DNA capture Science 2010

“spelaeus” “eremus” “ladinicus” “rossicus” “ingressus” “kudarensis” NJ tree 123 sequences 250 bp control region outgroups

The costs Capture array up to 1 million features £ 350 each SureSelect 10 rxns 200 kb – 6.6 Mb £ 6,638 SureSelect 100 rxns 200 kb £ 30,777 SureSelect 1,000 rxns 200 kb £ 107,719 => Home-made solutions => Multiplexing

Barcoding

So..................... How does it work?

Sometimes well

Long range versus capture

And sometimes not so well

Jumping artefacts Clade 1 Clade 2 Clade 3

Possible capture methodologies Methodology Results Problems SureSelect no experience yet high costs Array capture mammoth mtDNA jumping artefacts PEC mammoth nuDNA limited sensitivity high costs Dynalbeads In solution 454, biotin adaptors Castor mtDNA length limited 454, biotin UTP Castor mtDNA length limited Illumina, biotin UTP Castor mtDNA length limited jumping artefacts

Capture advantages High sequence yield per sample aliquot Time and work efficient Higher sensitivity than PCR

Capture disadvantages High costs Sometimes low on-target ratio Problems with multiplexing Generally jumping artefacts

Summary for capture Long term little alternative - if large amounts of data required Also some methods have better sensitivity than PCR Multiplex problems especially for low-complexity data need resolving Currently not suitable for routine applications Methodological development required

Some final thoughts How should blank controls be done? And how many? What does contamination mean when you have 20 million sequence reads? How shall we replicate the data? Is independent replication possible? And is it necessary?

Molecular Ecology Thanks • Many people • Adrian Briggs, Harvard Medical school • Kevin Campbell, University of Manitoba • Research Group Molecular Ecology • Sequencing group in Leipzig • MPG, DFG and Volkswagen foundation for money • University of York • For your attention

Ancient DNA Research: Hybridization Capture and High-Throughput Sequencing