1 / 43

The Data Tsunami in Biomedical Research

The Data Tsunami in Biomedical Research. Guillaume Bourque McGill University and Genome Quebec Innovation Center, Dept. of Human Genetics, McGill University June 5 th , 2013. Next-generation sequencing (NGS). Stein, Genome Biol. 2010. Falling cost of sequencing.

kylia
Download Presentation

The Data Tsunami in Biomedical Research

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The Data Tsunami in Biomedical Research Guillaume Bourque McGill University and Genome Quebec Innovation Center, Dept. of Human Genetics, McGill University June 5th, 2013

  2. Next-generation sequencing (NGS) Stein, Genome Biol. 2010

  3. Falling cost of sequencing DeWitt, Nat. Biotechnol. 2012

  4. Sequencing human genomes 2001 2011 2013 (?) 1000 Genomes Project The Human Genome Your Genome ~ 3 Billion $ ~ 10 000 $ 100 - 1000 $

  5. Outline • Overview of Next-Generation Sequencing (NGS) • Applications • Challenges • Solutions

  6. Sequencing Revolution Sanger sequencing Next-Generation sequencing http://www.brusselsgenetics.be Metzker, Nat. Rev. Genet. 2010 100s of reactions… 10000s of base pairs… Millions of reactions! Billions of base pairs!

  7. High-throughput Sequencing 36bp X 20M X 8 lanes 2009 6 Gbases 2 X 150bp X 250M X 8 lanes 2013 600 Gbases 200 Human Genomes in 1 run!!!

  8. NGS TechnologyComparison

  9. Genome Canada • > $915M investment and > $900M in co-funding • 100s Large-scale genomics projects • 5 Innovation centers

  10. Outline • Overview of Next-Generation Sequencing (NGS) • Applications • Challenges • Solutions

  11. Applications (I) • De novo sequencing • From the human genome… To all model organisms… To all relevant organisms (e.g. extreme genomes)… To “all” organisms?

  12. Human Genome • 3 Billion DNA base pairs (bp) • Two human genomes are ~99.9% identical • There are about ~3M bpdifferences between you and me • Some of these differences explain variation in: • Disease susceptibility • Differences in drug metabolism • … www.dnacenter.com

  13. Applications (II) • Genome re-sequencing • Genetic disorders • Cancer genome sequencing • Map genomic structural variations across individuals • Genealogy and migration • Agricultural crops • … The Cancer Genome Atlas 1000 Genomes Project

  14. Exome sequencing for Mendelian disease “… about one-half to one-third (~3,000) of all known or suspected Mendelian disorders (for example, cystic fibrosis and sickle cell anaemia) have been discovered. However, there is a substantial gap in our knowledge about the genes that cause many rare Mendelian phenotypes.” “Accordingly, we can realistically look towards a future in which the genetic basis of all Mendelian traits is known, …”

  15. Exome sequencing

  16. Cancer genome sequencing Can obtain a full catalogue of mutations

  17. Michael Stromberg, bioinformatics.ca

  18. Mutations in paediatricgliblastoma Jabado, Pfister and Majewski

  19. Mutations in paediatricgliblastoma • Sequenced the exomes of 48 paediatric GBM samples, found: • Somatic mutations in the H3.3-ATRX-DAXX chromatin remodelling pathway in 44% of tumours • Recurrent mutations in H3F3A, which encodes the replication-independent histone 3 variant H3.3 in 31% of tumours

  20. Applications (III) • Quantitative biology of complex systems • New high-throughput technologies in functional genomics: ChIP-Seq, RNA-Seq, ChIA-PET, RIP-Seq, … • From single-gene measurements, to thousands of probes on arrays, to profiles covering all 3B bases of the genome • Important systems: Stem cells, Cancer, Infectious diseases…

  21. Outline • Overview of Next-Generation Sequencing (NGS) • Applications • Challenges • Solutions

  22. High-throughput Sequencing 36bp X 20M X 8 lanes 2009 6 Gbases 2 X 150bp X 250M X 8 lanes 2013 600 Gbases 200 Human Genomes in 1 run!!!

  23. Big Data 2013 2 X 10 TBytes 1TBytes Intensity files Reads + qualities 70 TBytes Image files

  24. Big Data From: AlexandreMontpetit Subject: news from Illumina Date: 4 June, 2013 2:15:16 PM EDT To: Guillaume Bourque De Mark Van Oene (vp Illumina ventes): dans la prochaine annee on doit s'attendre a 2x plus de reads en 2x moins de temps (et 2x plus longs) Ca cause probleme? Alex 2013 2 X 10 TBytes 1TBytes Intensity files Reads + qualities 240 TBytes 12 TBytes 25 TB of raw data / month 300 TB of raw data / year

  25. Large NGS project Cancer project with whole genome data: 500 matched-normal 500 tumors vs 125 TB raw 125 TB raw 500 X 3 lanes = 500 X 250GB 500 X 3 lanes = 500 X 250GB

  26. DNA bases sequenced at the Innovation Center 72 Trillions! 0r 800 genomes at 30X DNA bases 12 HiSeqs

  27. adventure.nationalgeographic.com

  28. Biomedical research is built on data integration Your data

  29. Biomedical research is built on data integration 100X Your data

  30. Challenges • NGS instruments generate TBs of data • NGS instruments are getting faster, cheaper and will increasingly be found in small research labs and hospitals • Data sharing and integration is critical in biomedical research • Sequencing data represents sensitive private data and is identifiable

  31. Outline • Overview of Next-Generation Sequencing (NGS) • Applications • Challenges • Solutions

  32. Nanuq software • Has tracked data and meta-data for more than: • 2.6 million sample aliquots, • 20,500 reagents, • 17,000 plates, • 140,000 tubes, • Multiple platforms, technologies and workflows(sequencing, genotyping, microarray, etc.) • 3,900 external users

  33. Standardized analysis pipelines … RNA-Seq Analysis report ChIP-Seq Analysis report Methylation Analysis report … … … …

  34. Data center at the Innovation Center > 1200 cores > 2 PB disk > 5 PB tape

  35. Need more! UdeSMammouth – 39168 cores McGill Guillimin – 16000 cores

  36. Data processing issues • We have many different projects all needing space and processing. • We want to use the Compute Canada clusters for scalability but also to facilitate data distribution (we have >800 users). • This brings uniformity problems: • Different setups Hardware and Software • Different configurations • Etc.

  37. Our strategy • We wrote analyses pipelines to be easily configurable across clusters. • Same code, one ini file to customize (we already have templates for 3 cluster sites) • We install Linux modules readable by all on all these clusters so we know exactly what is available everywhere • We also deploy common genomes across sites.

  38. Usage on Compute Canada

  39. Canadian Epigenetics, Environment and Health Research Consortium (CEEHRC) $1.5M (2012-2017)

  40. PORTal for the Analysis of Genetics and Genomics Experiments (PORTAGGE)

  41. Conclusions • NGS offers a variety of technologies and numerous exciting applications • Many areas of NGS data analyses are still under active development (e.g. RNA-Seq) • A major challenge is to ensure sufficient compute and storage capacities not to limit more advanced analyses • Need to work together to avoid duplication of efforts in installing tools but also to develop efficient ways to use HPC in biomedical research

  42. Acknowledgements IT team Terrance Mcquilkin Marc-André Labonté Genevieve Dancausse Andras Frankel AlexandruGuja Analysis team Louis Letourneau Mathieu Bourgey Maxime Caron Gary Lévesque Robert Eveleigh Francois Lefebvre Johanna Sandoval Pascale Marquis Development team Nathalie Émond David Bujold Francois Cantin Catherine Côté BurakDemirtas Daniel Guertin Louis Dumond Joseph Francois Korbuly Marc Michaud Thuong Ngo EDCC team David Morais (UdeS) Carol Gauthier (UdeS) Bryan Caron (McGill) Alain Veilleux (UdeS) ME Rousseau (McGill) guil.bourque@mcgill.ca

  43. Questions?

More Related