1 / 31

Assembling the sheep genome via KAREN

Assembling the sheep genome via KAREN. John McEwan (AgResearch Invermay) on behalf of the International Sheep Genomics Consortium. Why?. We want to improve genetic gain in sheep Can use whole genome selection Need a high density SNP chip Need a genome sequence and SNPs

andrew
Download Presentation

Assembling the sheep genome via KAREN

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Assembling the sheep genome via KAREN John McEwan (AgResearch Invermay) on behalf of the International Sheep Genomics Consortium

  2. Why? • We want to improve genetic gain in sheep • Can use whole genome selection • Need a high density SNP chip • Need a genome sequence and SNPs • Need to sequence and assemble sheep • Use new sequencing technology • Job too big and expensive for one group • International Consortia developed

  3. Whole genome selection • Major scientific advance • Genome sequencing & SNP chips = “genome wide selection” • As accurate as progeny testing, • but can be done at birth • Suitable for • sex limited, • difficult to measure traits or • traits measured late in life • Dairy cattle: • increase genetic gain 50-100% • while decreasing progeny testing costs • Application in sheep is still being explored but has great advantages • Numerous other species and uses for sequence

  4. What is a SNP and what is a SNP chip? • SNP = single nucleotide polymorphism • SNP chip = test 60,000 to 1,000,000 SNPs • WGS works by being able to: • predict status of other SNP variants nearby • includes variants that affect production traits

  5. ISGC – division of labor • 6 sites • AgR hosts core database • tasks divided to best • utilise skills • best utilise resources • history was DVDs….. • versioning • KAREN for transfer of data • make available to world

  6. Roche 454 FLX Skim Sequencing Strategy

  7. dB Summary

  8. Data • ~90 “runs” on 454 • Per run • Sequence ~130Mb • Processed data ~800Mb (quality ….) • Raw data 33Mb x 412 images = 13.6Gb • Total 1224 +72 = 1296Gb • Actually more as used another technology • This gives a false impression…..

  9. Repeat masking • Created own repeat database • Almost all slight variants existing repeats • Only masks ~2% more bases (40% total) • Speeds mapping to bovine genome • Takes about 4-5 days on ~120 CPUs • File size ~10Gbp… • Versioning important…

  10. BLAST results • Map to cattle • 46% uniquely • 4.4% unambiguously • Issues: options, time taken, size of output • Weeks of processing time…..

  11. Newbler assembly eg 1Mbp region Contigs plus Singletons number 2365 numberOfBases1,032,839 avgSize bp 437 Coverage% unadj 51.6 adj52.7 • This could only be done in one location. Fast (several days). However, alternatives needed to be explored… • Results needed to be transferred (~3Gbp)

  12. Meld Process

  13. MELD • Contigs • Ordered • Orientated • Use BT4+ • Contigs ~480bp

  14. OA_ver.1.0 coverage Assembly 3.158 Gbp with 1.242 Gbp non N

  15. Copy number variants • QC process • Need sanity checks that assembly is correct • CNVs • Regions >1000bp present variable numbers of times in genome • Often duplicated by unequal recombination • Can confuse SNP detection and v freq source of assembly errors • Detection • Use “adjusted” depth of ovine 454 reads mapped to BT4 genome • At each base pair count depth for each animal and average • Done using 50kb window with 1kb increments • Results • Average depth animal ~0.45X • 1-3 CNV regions detected/chromosome • Appear to be true CNVs

  16. Chromosome 1: as an example

  17. Example putative CNV: BTA1:149Mbp

  18. BTA1:149Mbp Gbrowse view

  19. SNP detection • SNP Detection Criteria • Stacking: collapsed where reads same (animal , plate, bp) • Depth: >3 (35% of sequence) and <9 reads deep • MAF: at least 2 reads present • SNP Class: • A 2 or more animals present for both alleles. • B 2 or more animals present for at least 1 allele, • C alleles one animal • SNP quality: read will be discarded if: • variants 10bp either side • homopolymeric runs (n>4) within 5bp • indels within 10bp

  20. How biased is the sampling?

  21. Interim SNP Results • 4 unique reads to do a call • A = both alleles seen by 2 animals • B = 1 allele seen in 2 animals • C = both alleles seen one animal • 2= Infinium 2 SNPs • 1 probe 50bp no G/C,A/T SNPs • 1= Infinium 1 SNPs • 2 probes 50bp • ~69% pass design (0.8 threshold) • ~ 200K SNPs or ~3 SNPs/50Kb • As expected but rather low • ~5/50kb better

  22. Information resources https://isgcdata.agresearch.nz www.sheephapmap.org BLAST & sequence download available Up to date information on ISGC project aims and progress

  23. Genome Annotation • Visualize sequence and annotation • Widely used • Concept of “tracks” • Each track has significant processing requirements • Distribute tasks • Versioning again important • Significant data transfers • Can have more than 50 tracks

  24. SNP validation and selection • Validation • Selected 112 Class A 454 SNPs • Assay with Sequenom • Aim is >85% validation rate in IMF (end of July) • Achieved 81% • Select 60K SNPs for chip • Spacing algorithms used based on quality (est MAF, adjacent sequence) and position • Multiple runs • Target date Aug 22nd

  25. Future • 60K chip Aug 22nd final date for SNPs • Available December 2008 • Assembly • 2nd Assembly ~Aug 2008 • BT4+sens blast+CAP3 assembler: expect 20% more sequence • 3rd Assembly ~Dec 2008 • + all_vs_all: expect 10% more seq • 4th Assembly ~June 2009 • As above but include ~4.5Gbp more seq inc paired end reads • Application 10X coverage with ~200bp paired end reads June 2009 • For each assembly • annotate with Gbrowse • detect SNPs

  26. Lessons learned • 8th year of international consortia (~4th) • Data volumes increasing rapidly (1000X) • Initial data transfer is not the major issue • Storage, transfer, annotation is ongoing • Processing, synchronisation, sharing resources • Generate more than 10X volume • Small numerous 0.1-5Gb transfers • Needs reliable transparent high volume data transfer • Still issues with firewalls • Currently using phone for humans… why (location)?

  27. Acknowledgements AgResearch NZ Baylor HGSC CSIRO Genesis Faraday John McEwan Richard Gibbs Brian Dalrymple Chris Warkup Gemma Payne George Weinstock James Kijas Nessa O’Sullivan Donna M. Muzny Ross Tellam Tracey van Stijn Michael E. Holder Wes Barris Theresa Wilson Lynne Nazareth Sean McWilliam Rudi Brauning Rebecca L. Thorton Abhirami Ratnakumar Alan McCulloch Christie Kovar David Townley Russell Smithies Benoit Auvray Roslin Institute sheepGENOMICS UNE/sheepGENOMICS Steve Bishop Terry Longhurst (MLA) Hutton Oddy Rob Forage University of Otago University of Sydney USDA Jo Stanton Frank Nicholas Tim Smith Chrissie Curt van Tassell Mark Funding Genesis Faraday, University of Sydney ISL Grant, and Ovita NZ

  28. Thanks KAREN and team

More Related