1 / 25

Big Data Why it matters

Big Data Why it matters. Patrice KOEHL Department of Computer Science Genome Center UC Davis. The three I’s of Big Data. Big Data is: Ill-defined (what is it?) Immediate (we need to do something about it now) Intimidating (what if we don’t). (loosely adapted from Forbes).

quasar
Download Presentation

Big Data Why it matters

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Big DataWhy it matters Patrice KOEHL Department of Computer Science Genome Center UC Davis

  2. The three I’s of Big Data Big Data is: • Ill-defined (what is it?) • Immediate (we need to do something about it now) • Intimidating (what if we don’t) (loosely adapted from Forbes)

  3. Big Data: Volume Byte Kilobyte Megabyte Gigabyte Terabyte Petabyte Exabyte ZettabyteYottabyte KB MB GB TB PB EB ZB YB 1000 bytes 1000 KB 1000 MB 1000 GB 1000 TB 1000 PB 1000 ZB 1000YB

  4. Big Data: Volume One page of text One song 6 million books 55 storeys of DVD One movie Data up to 2003 Data in 2011 NSA data center 5 MB 30KB 5 GB 1 TB 1 PB 1.8 ZB 1 YB 5 EB Byte Kilobyte Megabyte Gigabyte Terabyte Petabyte Exabyte ZettabyteYottabyte KB MB GB TB PB EB ZB YB 1000 bytes 1000 KB 1000 MB 1000 GB 1000 TB 1000 PB 1000 ZB 1000YB

  5. Big Data: Volume One page of text One song 6 million books 55 storeys of DVD One movie Data up to 2003 Data in 2011 NSA data center 5 MB 30KB 5 GB 1 TB 1 PB 1.8 ZB 1 YB 5 EB Byte Kilobyte Megabyte Gigabyte Terabyte Petabyte Exabyte ZettabyteYottabyte KB MB GB TB PB EB ZB YB 1000 bytes 1000 KB 1000 MB 1000 GB 1000 TB 1000 PB 1000 ZB 1000YB 1s 20 mins 11 days 30 years 300 30 million 30 billion …. centuries years years

  6. Big Data: Volume, Velocity One minute in the digital world (Intel, 2013) 204 million 640 TB IP data transferred 50 GB e-mails sent of data generated at the Large Hadron Collider 3+ million 30 hours 1.3 million 6 million searches launched users connected videos uploaded videos viewed

  7. Big Data: Volume, Velocity, Variety Numbers text Images sound

  8. Big Data: Challenges • Volume and Velocity • Variety • Structured, Unstructured…. • Images, Sound, Numbers, Tables,… • Security • Reliability, Integrity, Validity

  9. Big Data: Challenges Large N: “Any dataset that is collected by a scientist whose data collection skills are far superior to her analysis skills” Computing issues: • Data transfer • Scalability of algorithms • Memory limitations • Distributed computing

  10. Big Data: Challenges Vizualization issues: The black screen problem (Matloff, 2013)

  11. Big Data: Challenges Rule of thumb: N/P > 5….what if it does not hold anymore? Large P, “small” N: • Curse of dimensionality (all data points seem equidistant) • Non linearity • Dimension reduction

  12. Big Data: Challenges and Opportunities Fourth Paradigm: data driven science Holistic approaches to major research efforts New paradigms in computing Digital Humanities Basic Translational Data Knowledge Societal Benefit

  13. Big Data: Enabling Dreams Understanding the physics of “Dark Energy” How the brain works: from neurons to cognition A holistic view of natural ecosystems Understanding climate changes From genotype to phenotype Precision medicine Big Humanities ….

  14. Big Data Dreams: Genomics

  15. Big Data Dreams: Genomics

  16. Genomics: Sequencing costs Cost per Human Genome Cost per Mbase http://www.genome.gov

  17. Genomics: Game changing technologies IlluminaHiSeq 2000 Capable of 600 Gb per run -> 1,000+ Gb 55 Gb/day 6 billion paired-end reads <$4,000 per human/plant genome <$200 per transcriptome Multiplex 384 pathogen isolates/lane • $10 (+ $50 library construction)/isolate Challenges: Library preparation & data analysis Gary Schroth (Illumina): “A single lab with one HiSeq is able to generate more sequences than was in GenBank in 2009, every four days”.

  18. Genomics @ UC Davis Massively parallel DNA sequencing 2 Illumina Genome Analyzers 1 IlluminaHiseq 2000, 2 Miseq 1 Roche 454 Junior 1 Pacific Biosystems RS GoldenGate SNP genotyping iScan, BeadArray & BeadExpress

  19. Cancer Genomics: Molecular Diagnostics

  20. Genomics: actual costs “A single lab with one HiSeq is able to generate more sequences than was in GenBank in 2009, every four days.” Gary Schroth (Illumina)

  21. Genomics: actual costs • Assembling 22GB conifer genome: • Data: • 16 billion pair reads (100 bases) • Processing: • 10 days for error correction • 11 days for assembling “super-reads” • 60 days to build contigs/scaffold • 8 days to fill in gaps “A single lab with one HiSeq is able to generate more sequences than was in GenBank in 2009, every four days.” Gary Schroth (Illumina) http://www.homolog.us/blogs/2013/05/11/ steven-salzberg-at-bog13-assembling-22gb-conifer-genome/

  22. Social Consequences of Commodity Sequencing • The danger of misuse predict sensitivities to various industrial or environmental agents discrimination by employers? • The impact of information that is likely to be incomplete an indication of a 25 percent increase in the risk of cancer? • Reversal of knowledge paradigm • Are the "products" of the Human Genome Project to be patented and commercialized? Myriad genetics and BRCA1/2 • How to educate about genetic research and its implications?

  23. Social Consequences of Commodity Sequencing

  24. Social Consequences of Commodity Sequencing

  25. How to Approach Big Data

More Related