Data-Intensive Computing Symposium: Report Out

Data-Intensive ComputingSymposium: Report Out Phillip B. Gibbons Intel Research Pittsburgh

Data-Intensive Computing Symposium • Held 3/26/08 @Yahoo! in Sunnyvale, CA • Sponsored by: • Yahoo! Research • Computing Community Consortiumsupports the computing research community in creating compelling research visions and the mechanisms to realize these visions (http://www.cra.org/ccc/) • ~100 invited attendees, ~12 invited talks • Slides and video to be posted on CCC web site • Blog: http://dita.ncsa.uiuc.edu/xllora (thanks!)

Randy Bryant (CMU)Data-Intensive Scalable Computing • Local speaker; I’ll skip in interest of time • DISC has been renamed

ChengXiang Zhai (UIUC)Text Information Management

ChengXiang Zhai (UIUC) Proposal 1: Maximum Personalization

ChengXiang Zhai (UIUC)

Dan Reed (Microsoft)Clouds and ManyCore: The Revolution • Big Data: Should focus more on the user experience • How to manage resources • Cloud computing can help organically orchestrate resources on demand • Initiative to bring academics, business, and users together under the big data problem (PCAST NITRD review)

Jill Mesirov (Broad Institute)Comput. Paradigms for Genomic Medicine • Broad has 4.8K processors, 1.4 PBs storage on site • Big Data Problem: Mining genome expression arrays • Row: patients; Column: genes, Value: expression values • Example: classify leukemias based on expression arrays • Solved by grad student over the weekend using web sources • Challenge: Computation/Analysis/Provenance infrastructure needed • Developed GenePattern 3.1: Software infrastructure for interoperable informatics • Usable by biologists

Garth Gibson (CMU)Simplicity and Complexity in Data Systems at Scale • Petascale Data Storage Institute • Understanding disk failures, cfdr.usenix.org • Another local speaker, so I’ll skip in interest of time

Jeff Dean (Google)Handling Large Datasets at Google

Jeff Dean (Google)

Jeff Dean (Google) GFS Usage

Jeff Dean (Google)

Jon Kleinberg (Cornell)Large-Scale Social Network Data Diffusion in Social Networks Why is chain letter diffusion so deep & narrow? Iraq war authorization protest chain letter diffusion (18K nodes)

Jon Kleinberg (Cornell)

Marc Najork (Microsoft Research)Mining the Web Graph Query-dependent link-based ranking algorithm (HITS, SALSA) Scalable Hyperlink Store: used internally within MSR, for web graphs

Joe Hellerstein (UC Berkeley)“What” Goes Around • Industrial revolution of data: sensors, logs, cameras • Hardware revolution: datacenters/virtualization, many-core • Industrial revolution in software? Declarative languages in some domains Why “What”: • Rapid prototyping • Pocket-size code bases • Independent from the runtime • Ease of analysis and security • Allow optimization and adaptability

Joe Hellerstein (UC Berkeley)

Joe Hellerstein (UC Berkeley) • Sensor Networks, Mobile Networks, Modular Robotics, computer games, program analysis • Distributive inference (junction trees and loopy belief propagation), graphs upon graphs • Evita Raced: Overlog Metacompiler (compiler is written declaratively) • matches datalog optimizations (dynamic prog.), cycle tests • Datalog with known extensions and tweaks • Centrality of Rendezvous & graphs • Challenges: • performance beyond number of messages (e.g., memory hierarchy), availability, real programs, not Turing complete

Raghu Ramakrishnan (Yahoo! Res.)Sherpa: Cloud Computing of the Third Kind

Raghu Ramakrishnan (Yahoo! Res.)

Alex Szalay (Johns Hopkins)Scientific Applications of Large Databases

Alex Szalay (Johns Hopkins)

Phillip Gibbons (Intel Research)Data-Rich Computing: Where It’s At I know where it’s at, man! • Important, interesting, exciting research area • Cluster approach:computing is co-locatedwhere the storage is at • Memory hierarchy issues:where the (intermediate) data are at, over the course of the computation • Pervasive multimedia sensing: processing & querying must be pushed out of the data center towhere the sensors are at Focus of this talk:

Hierarchy-Savvy Parallel Algorithm Design (HI-SPADE) project • Hierarchy-savvy: • Hide what can be hid • Expose what must be exposed • Sweet-spot between ignorant and fully aware • Support: • Develop the compilers, runtime systems,architectural features, etc. to realize the model • Important component: fine-grain threading Goal: Support a hierarchy-savvy model of computation for parallel algorithm design

IrisNet’s Two-Tier Architecture Query User OA XML database OA XML database OA XML database . . . SA SA SA senselet senselet senselet senselet senselet senselet Sensornet Sensor Sensor Sensor Two components: SAs: sensor feed processing OAs: distributed database Web Server for the url . . . . . .

Jeannette Wing (CMU/NSF)NSF Plans for SupportingData-Intensive Computing Google/IBM Data Center • ~2000 processors, large Hadoop cluster • Allocate in units of rack weeks • NSF will review proposals for use: Cluster Exploratory (CluE) • Running Xen; Won’t open up performance monitoring • Goal: Show applicable outside of computer science Academic-Industry-Government partnership

Randy Bryant (CMU)Big Data Computing Study Group • Collection of ~20 people (looking for volunteers) • Goals: • Fostering educational activities • Advocacy • Building community • CCC’s Big Data Computing Study Group seeks to foster collaborations between industry, academia, and the U.S. government to advance the state of art in the development and application of large scale computing systems for making intelligent use of the massive amounts of data being generated in science, commerce, and society

Data-Intensive Computing Symposium: Report Out

Data-Intensive Computing Symposium: Report Out

Presentation Transcript

Data Intensive Computing at Sandia

Scalable Programming and Algorithms for Data Intensive Life Science Applications

Data-intensive Computing Algorithms: Classification

Future Of Scientific Computing

Cloud Technologies for Data Intensive Biomedical Computing

Petascale Data Intensive Computing

CPS 216: Data-intensive Computing Systems

Scaling Up Data Intensive Science with Application Frameworks

An Introduction to Data Intensive Computing Chapter 2: Data Management

Wei Jiang Data-Intensive and High Performance Computing Research Group

Data-Intensive Computing with MapReduce

Octopus: Efficient Data Intensive Computing on Virtualized E nvironments

Data Intensive Computing

Data Intensive Computing at Sandia

CPS216: Data-Intensive Computing Systems Data Access from Disks

Cloud Technologies for Data Intensive Computing

Data Intensive Query Processing for Large RDF Graphs Using Cloud Computing Tools

Architecture and Performance of Runtime Environments for Data Intensive Scalable Computing

CMS Computing 2007

Extreme Data-Intensive Scientific Computing

Cooperative Computing for Data Intensive Science