440 likes | 604 Views
The Data Tsunami in Biomedical Research. Guillaume Bourque McGill University and Genome Quebec Innovation Center, Dept. of Human Genetics, McGill University June 5 th , 2013. Next-generation sequencing (NGS). Stein, Genome Biol. 2010. Falling cost of sequencing.
E N D
The Data Tsunami in Biomedical Research Guillaume Bourque McGill University and Genome Quebec Innovation Center, Dept. of Human Genetics, McGill University June 5th, 2013
Next-generation sequencing (NGS) Stein, Genome Biol. 2010
Falling cost of sequencing DeWitt, Nat. Biotechnol. 2012
Sequencing human genomes 2001 2011 2013 (?) 1000 Genomes Project The Human Genome Your Genome ~ 3 Billion $ ~ 10 000 $ 100 - 1000 $
Outline • Overview of Next-Generation Sequencing (NGS) • Applications • Challenges • Solutions
Sequencing Revolution Sanger sequencing Next-Generation sequencing http://www.brusselsgenetics.be Metzker, Nat. Rev. Genet. 2010 100s of reactions… 10000s of base pairs… Millions of reactions! Billions of base pairs!
High-throughput Sequencing 36bp X 20M X 8 lanes 2009 6 Gbases 2 X 150bp X 250M X 8 lanes 2013 600 Gbases 200 Human Genomes in 1 run!!!
Genome Canada • > $915M investment and > $900M in co-funding • 100s Large-scale genomics projects • 5 Innovation centers
Outline • Overview of Next-Generation Sequencing (NGS) • Applications • Challenges • Solutions
Applications (I) • De novo sequencing • From the human genome… To all model organisms… To all relevant organisms (e.g. extreme genomes)… To “all” organisms?
Human Genome • 3 Billion DNA base pairs (bp) • Two human genomes are ~99.9% identical • There are about ~3M bpdifferences between you and me • Some of these differences explain variation in: • Disease susceptibility • Differences in drug metabolism • … www.dnacenter.com
Applications (II) • Genome re-sequencing • Genetic disorders • Cancer genome sequencing • Map genomic structural variations across individuals • Genealogy and migration • Agricultural crops • … The Cancer Genome Atlas 1000 Genomes Project
Exome sequencing for Mendelian disease “… about one-half to one-third (~3,000) of all known or suspected Mendelian disorders (for example, cystic fibrosis and sickle cell anaemia) have been discovered. However, there is a substantial gap in our knowledge about the genes that cause many rare Mendelian phenotypes.” “Accordingly, we can realistically look towards a future in which the genetic basis of all Mendelian traits is known, …”
Cancer genome sequencing Can obtain a full catalogue of mutations
Mutations in paediatricgliblastoma Jabado, Pfister and Majewski
Mutations in paediatricgliblastoma • Sequenced the exomes of 48 paediatric GBM samples, found: • Somatic mutations in the H3.3-ATRX-DAXX chromatin remodelling pathway in 44% of tumours • Recurrent mutations in H3F3A, which encodes the replication-independent histone 3 variant H3.3 in 31% of tumours
Applications (III) • Quantitative biology of complex systems • New high-throughput technologies in functional genomics: ChIP-Seq, RNA-Seq, ChIA-PET, RIP-Seq, … • From single-gene measurements, to thousands of probes on arrays, to profiles covering all 3B bases of the genome • Important systems: Stem cells, Cancer, Infectious diseases…
Outline • Overview of Next-Generation Sequencing (NGS) • Applications • Challenges • Solutions
High-throughput Sequencing 36bp X 20M X 8 lanes 2009 6 Gbases 2 X 150bp X 250M X 8 lanes 2013 600 Gbases 200 Human Genomes in 1 run!!!
Big Data 2013 2 X 10 TBytes 1TBytes Intensity files Reads + qualities 70 TBytes Image files
Big Data From: AlexandreMontpetit Subject: news from Illumina Date: 4 June, 2013 2:15:16 PM EDT To: Guillaume Bourque De Mark Van Oene (vp Illumina ventes): dans la prochaine annee on doit s'attendre a 2x plus de reads en 2x moins de temps (et 2x plus longs) Ca cause probleme? Alex 2013 2 X 10 TBytes 1TBytes Intensity files Reads + qualities 240 TBytes 12 TBytes 25 TB of raw data / month 300 TB of raw data / year
Large NGS project Cancer project with whole genome data: 500 matched-normal 500 tumors vs 125 TB raw 125 TB raw 500 X 3 lanes = 500 X 250GB 500 X 3 lanes = 500 X 250GB
DNA bases sequenced at the Innovation Center 72 Trillions! 0r 800 genomes at 30X DNA bases 12 HiSeqs
Biomedical research is built on data integration 100X Your data
Challenges • NGS instruments generate TBs of data • NGS instruments are getting faster, cheaper and will increasingly be found in small research labs and hospitals • Data sharing and integration is critical in biomedical research • Sequencing data represents sensitive private data and is identifiable
Outline • Overview of Next-Generation Sequencing (NGS) • Applications • Challenges • Solutions
Nanuq software • Has tracked data and meta-data for more than: • 2.6 million sample aliquots, • 20,500 reagents, • 17,000 plates, • 140,000 tubes, • Multiple platforms, technologies and workflows(sequencing, genotyping, microarray, etc.) • 3,900 external users
Standardized analysis pipelines … RNA-Seq Analysis report ChIP-Seq Analysis report Methylation Analysis report … … … …
Data center at the Innovation Center > 1200 cores > 2 PB disk > 5 PB tape
Need more! UdeSMammouth – 39168 cores McGill Guillimin – 16000 cores
Data processing issues • We have many different projects all needing space and processing. • We want to use the Compute Canada clusters for scalability but also to facilitate data distribution (we have >800 users). • This brings uniformity problems: • Different setups Hardware and Software • Different configurations • Etc.
Our strategy • We wrote analyses pipelines to be easily configurable across clusters. • Same code, one ini file to customize (we already have templates for 3 cluster sites) • We install Linux modules readable by all on all these clusters so we know exactly what is available everywhere • We also deploy common genomes across sites.
Canadian Epigenetics, Environment and Health Research Consortium (CEEHRC) $1.5M (2012-2017)
PORTal for the Analysis of Genetics and Genomics Experiments (PORTAGGE)
Conclusions • NGS offers a variety of technologies and numerous exciting applications • Many areas of NGS data analyses are still under active development (e.g. RNA-Seq) • A major challenge is to ensure sufficient compute and storage capacities not to limit more advanced analyses • Need to work together to avoid duplication of efforts in installing tools but also to develop efficient ways to use HPC in biomedical research
Acknowledgements IT team Terrance Mcquilkin Marc-André Labonté Genevieve Dancausse Andras Frankel AlexandruGuja Analysis team Louis Letourneau Mathieu Bourgey Maxime Caron Gary Lévesque Robert Eveleigh Francois Lefebvre Johanna Sandoval Pascale Marquis Development team Nathalie Émond David Bujold Francois Cantin Catherine Côté BurakDemirtas Daniel Guertin Louis Dumond Joseph Francois Korbuly Marc Michaud Thuong Ngo EDCC team David Morais (UdeS) Carol Gauthier (UdeS) Bryan Caron (McGill) Alain Veilleux (UdeS) ME Rousseau (McGill) guil.bourque@mcgill.ca