1 / 36

Big Data Challenges in Biology and the Sheffield Bioinformatics Hub Dr. Roy Chaudhuri Department Of Molecular Biology an

Big Data Challenges in Biology and the Sheffield Bioinformatics Hub Dr. Roy Chaudhuri Department Of Molecular Biology and Biotechnology. Marta Milo Biomedical Science. Roy Chaudhuri Molecular Biology and Biotechnology. Eran Elhaik Animal and Plant Sciences. James Bradford Oncology.

iago
Download Presentation

Big Data Challenges in Biology and the Sheffield Bioinformatics Hub Dr. Roy Chaudhuri Department Of Molecular Biology an

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Big Data Challenges in Biology and the Sheffield Bioinformatics Hub Dr. Roy Chaudhuri Department Of Molecular Biology and Biotechnology

  2. Marta Milo Biomedical Science Roy Chaudhuri Molecular Biology and Biotechnology EranElhaik Animal and Plant Sciences James Bradford Oncology Ian Sudbery Molecular Biology and Biotechnology (from December) Winston Hide SITran (from August)

  3. What is Big Data? Small data: 1 Big data: 1

  4. Biological Big Data – Imaging Data

  5. Biological Big Data – Sequence Data

  6. Sanger Dideoxy Sequencing

  7. Dye-terminator Sequencing Read lengths ~800bp

  8. phiX174 genome - 1977

  9. E.coli K-12 genome - 1997 Ordered sequencing approach Escherichia coli K-12 4.6m base pairs

  10. Shotgun Sequencing

  11. Human genome ~3 billion base pairs

  12. 119 volumes, 4.75pt Courier

  13. 2007: Next Generation Sequencing a.k.a. Massively parallel sequencing

  14. Read lengths 50-300bp (initially 37bp)

  15. De novo Genome Assembly Genomic DNA Contig Gap Contig

  16. De Bruijn graph assembly “It was the best of times, it was the worst of times, it was the age of wisdom it was the age of foolishness” Break up into fixed length chunks called k-mers

  17. As read lengths increase, the de Bruijn graph becomes simpler. • Resolving bubbles is one of the key functions of assembly software. The process uses additional information such as coverage levels and paired reads. • If a bubble cannot be resolved, it results in a break in the assembly. • Memory is the limiting factor. De novo assembly of large and complex genomes can require >1TB

  18. Resequencing: mapping to a reference genome

  19. Variant detection • Efficient Burrows-Wheeler transformed genome indexes • Memory is less of an issue than de novo assembly • Embarrassingly parallelisable task – number of cores important • Deep coverage required – issues with storage and disk I/O

  20. Transcriptome sequencing • RNA sequencing to understand gene expression • Requires splice-aware mapping to reference genome • It can be challenging to resolve alternative transcripts

  21. De novo transcriptome assembly • De novo transcriptome assembly is a complex problem • Many reads could belong to multiple transcripts • Transcripts present at different levels, so use coverage to distinguish overlapping transcripts

  22. Metagenome assembly of complex populations • Sargasso sea • Soil metagenomes • Human microbiota eg. gut, skin, oral cavity etc. “the second human genome”, linked with non-infectious conditions such as obesity and cancer

  23. Single Molecule Real Time (SMRT) Sequencing Read lengths up to 30kb

  24. Min-ION Grid-ION 50kb reads “easily obtained” Promise of direct DNA, RNA and protein sequencing, and detection of epigenetic factors such as methylation

  25. The future • Genome sequencing technologies are developing at a rate that exceeds Moore’s Law • The limiting factor is our ability to analyse the data (this is known as “the Bioinformatics Gap”) • This may be as bad as it gets, improved read lengths and sequence quality may mean that less coverage will be required for variant calling, and de novo assembly will become trivial or unnecessary • In the long run, it may be simpler to store DNA and resequence, rather than store the data • But there is no shortage of DNA to sequence, and there will be a need for real time analysis software as sequencing becomes routine and ubiquitous • Increased emphasis on understanding genome function rather than structure

More Related