1 / 13

The 1000 Genomes Project

The 1000 Genomes Project. Current 1April 2009. Overview. Making personal genomics possible A short history of the effects of disruptive sequencing technology From 1 to 1001: The goals and science of the 1000 Genomes Project Bioinformatics infrastructure requirements (hint: they are large)

idalia
Download Presentation

The 1000 Genomes Project

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The 1000 Genomes Project • Current 1April 2009

  2. Overview • Making personal genomics possible • A short history of the effects of disruptive sequencing technology • From 1 to 1001: • The goals and science of the 1000 Genomes Project • Bioinformatics infrastructure requirements (hint: they are large) • Project progress: Where are we today? • Collecting information and making it accessible to researchers • The 1000 Genomes DCC • EBI, European Read Archive, Ensembl and the public domain

  3. Evolution of Large-scale Genome Analysis • 2000: The year of the human genome • Working drafts completed • All data freely released (eventually) • Project took about 10 years and cost about $3 billion • 2007: The year of the personal genome • Craig Venter; James Watson • 2008: 1000 Genomes project launched • First Solexa genomes published Yanhuang Project

  4. 2009 • Major genome centers produce the same number of base pairs every 6 hours at a cost approaching $50,000 • Dense genotyping assays are more than two orders of magnitude less expensive compared to the HapMap project • Making the 1000 Genomes Project and massive case-control studies of human variation possible • 1000 Genomes project is an international collaboration funded by the Wellcome Trust, the National Human Genome Research Institute, the Chinese Academy of Science, the German Science Federation and several corporate partners • Details at www.1000genomes.org

  5. 1000 Genomes Project: Primary goals • Overall: Create a deep catalogue of human variation to provide a better baseline to underpin human genetics • Discover shared variation (shared = not private to individual) and characterise by allele frequency • Aim for effectively all (not just a lot of) common variation • Structural variants as well as SNPs • Accessible because the project will used paired-end sequencing reads • Deeper discovery in gene regions, down to 0.5% to 0.1% MAF • Place variants in their haplotype context • Call and phase the variants on the sampled individuals • This will support tagging by genotyping and imputation based approaches • Synergistic with large-scale genotyping projects and whole genome association studies

  6. 1000 Genomes Project Details • Three pilots during 2008 with analysis currently ongoing • Pilot 1: 3x60 samples at 2x (6Gb) per person: • European CEU, African YRI, East Asian CHB/JPT • Pilot 2: CEU and YRI trios at 20x • Pilot 3: 1000 genes in 1000 people • Multiple platforms/protocols • Develop and evaluate methods for data collection and analysis • Two year main project • Population components now be finalised • All data has appropriate consent for full and unrestricted public release • This is a tremendous amount of raw data (between 500 terabytes and 1 petabyte for the main project)

  7. Bioinformatics Requirements for 1000 Genomes • Basic Requirements • File formats • SRF: New technology sequencing does not produce the same type of raw data as Sanger-style sequencing • SAM/BAM: Alignment formats must be efficient if one is mapping half a trillion reads • Initial analysis tools • Most short read aligners incorporate the quality scores into the mapping • Advanced Requirements • Genome likelihood format for representing an individual genome with appropriate uncertainty • Advanced analysis tools (mostly under development) • SNP calling that are trio aware and population based • Assembly & Search

  8. 1000 Genomes data production • Over 150 terabytes of data deposited so far • Approximately 6,000,000,000,000 bp (6 terabases) have already been submitted • Perspective: During September and October the 1000 Genomes project produced the equivalent of EMBL/GenBank every week • Approaching 2000x total coverage • Raw and processed data is freely available • ftp://ftp.1000genomes.ebi.ac.uk • ftp://ftp-trace.ncbi.nih.gov/1000genomes/ • Data also available from the EBI’s European Read Archive and the NCBI’s Short Read Archive • Millions of new SNPs submitted to dbSNP

  9. 1000 Genomes Data Availability • Three data releases • Late December (trio SNPs) • Early February (first set of low coverage SNPs) • Late February (trio alignments) • 1000 Genomes Browser • Based on Ensembl • Visualization and portal for project data • Complement to 1000genomes.org website • Deep coverage trios plus other individuals will be available from main Ensembl browser this spring

  10. 1000 Genomes Browser

  11. Data Transfer Infrastructure • Data transfer requirements are enormous • FTP does not work well for terabytes of data • “Old fashioned” solutions • Copy the data onto a hard drive and mail the hard drive around the world • (Significant personnel costs) • Infrastructure solutions • Create/buy dedicated lines for point to point transfer or direct connection to faster points on the backbone • Expensive to do collaborative analysis, but will probably be part of the solution • Advanced technology solutions • Asperasoft • Uses udp to transfer files to avoid tcp • Can quickly saturate connections

  12. Conclusions • The 1000 Genomes Project will construct a human polymorphism reference resource • True baseline information for human genetics • New sequencing technologies make this possible, but high accuracy requires high coverage • A picture of population variation appears achievable with low coverage sequencing of many individuals • Good down to 1% • Poor for private/rare variants • Human disease studies will be major beneficiaries • The bioinformatics challenges are large and nearly everywhere in the project

  13. Acknowledgements • Vertebrate Genomics Team • Mario Caccamo, Ilkka Lappalainen, Jonathan Hinton, Vasudev Kumanduri (European Genotype Archive) • Zam Iqbal, Laura Clarke (1000 Genomes DCC); Fiona Cunningham, Yuan Chen, Will McLaren(Ensembl Variation) • Nathan Johnson, Steven Wilder, Damian Keefe (Ensembl Functional Genomics) • Javier Herrero, Kathryn Beal, Stephen Fitzgerald, Albert Vilella, Leo Gordon, Benoit Ballester (Ensembl Comparative Genomics) • NCBI portion of the 1000 Genomes DCC • Steve Sherry, Martin Shumway, Justin Paschall, Hoda Khouri • Ensembl Systems, Web and Genebuild Teams • EMBL Sequence Archive / European Trace Archive Team • EBI’s The Protein and Nucleotide Database Group • EBI Systems Group • 1000 Genomes Project • Funding: Wellcome Trust, EMBL

More Related