360 likes | 498 Views
Data-Intensive Computing Symposium: Report Out. Phillip B. Gibbons Intel Research Pittsburgh. Data-Intensive Computing Symposium. Held 3/26/08 @Yahoo! in Sunnyvale, CA Sponsored by: Yahoo! Research
E N D
Data-Intensive ComputingSymposium: Report Out Phillip B. Gibbons Intel Research Pittsburgh
Data-Intensive Computing Symposium • Held 3/26/08 @Yahoo! in Sunnyvale, CA • Sponsored by: • Yahoo! Research • Computing Community Consortiumsupports the computing research community in creating compelling research visions and the mechanisms to realize these visions (http://www.cra.org/ccc/) • ~100 invited attendees, ~12 invited talks • Slides and video to be posted on CCC web site • Blog: http://dita.ncsa.uiuc.edu/xllora (thanks!)
Randy Bryant (CMU)Data-Intensive Scalable Computing • Local speaker; I’ll skip in interest of time • DISC has been renamed
ChengXiang Zhai (UIUC) Proposal 1: Maximum Personalization
Dan Reed (Microsoft)Clouds and ManyCore: The Revolution • Big Data: Should focus more on the user experience • How to manage resources • Cloud computing can help organically orchestrate resources on demand • Initiative to bring academics, business, and users together under the big data problem (PCAST NITRD review)
Jill Mesirov (Broad Institute)Comput. Paradigms for Genomic Medicine • Broad has 4.8K processors, 1.4 PBs storage on site • Big Data Problem: Mining genome expression arrays • Row: patients; Column: genes, Value: expression values • Example: classify leukemias based on expression arrays • Solved by grad student over the weekend using web sources • Challenge: Computation/Analysis/Provenance infrastructure needed • Developed GenePattern 3.1: Software infrastructure for interoperable informatics • Usable by biologists
Garth Gibson (CMU)Simplicity and Complexity in Data Systems at Scale • Petascale Data Storage Institute • Understanding disk failures, cfdr.usenix.org • Another local speaker, so I’ll skip in interest of time
Jeff Dean (Google) GFS Usage
Jon Kleinberg (Cornell)Large-Scale Social Network Data Diffusion in Social Networks Why is chain letter diffusion so deep & narrow? Iraq war authorization protest chain letter diffusion (18K nodes)
Marc Najork (Microsoft Research)Mining the Web Graph Query-dependent link-based ranking algorithm (HITS, SALSA) Scalable Hyperlink Store: used internally within MSR, for web graphs
Joe Hellerstein (UC Berkeley)“What” Goes Around • Industrial revolution of data: sensors, logs, cameras • Hardware revolution: datacenters/virtualization, many-core • Industrial revolution in software? Declarative languages in some domains Why “What”: • Rapid prototyping • Pocket-size code bases • Independent from the runtime • Ease of analysis and security • Allow optimization and adaptability
Joe Hellerstein (UC Berkeley) • Sensor Networks, Mobile Networks, Modular Robotics, computer games, program analysis • Distributive inference (junction trees and loopy belief propagation), graphs upon graphs • Evita Raced: Overlog Metacompiler (compiler is written declaratively) • matches datalog optimizations (dynamic prog.), cycle tests • Datalog with known extensions and tweaks • Centrality of Rendezvous & graphs • Challenges: • performance beyond number of messages (e.g., memory hierarchy), availability, real programs, not Turing complete
Raghu Ramakrishnan (Yahoo! Res.)Sherpa: Cloud Computing of the Third Kind
Alex Szalay (Johns Hopkins)Scientific Applications of Large Databases
Phillip Gibbons (Intel Research)Data-Rich Computing: Where It’s At I know where it’s at, man! • Important, interesting, exciting research area • Cluster approach:computing is co-locatedwhere the storage is at • Memory hierarchy issues:where the (intermediate) data are at, over the course of the computation • Pervasive multimedia sensing: processing & querying must be pushed out of the data center towhere the sensors are at Focus of this talk:
Hierarchy-Savvy Parallel Algorithm Design (HI-SPADE) project • Hierarchy-savvy: • Hide what can be hid • Expose what must be exposed • Sweet-spot between ignorant and fully aware • Support: • Develop the compilers, runtime systems,architectural features, etc. to realize the model • Important component: fine-grain threading Goal: Support a hierarchy-savvy model of computation for parallel algorithm design
IrisNet’s Two-Tier Architecture Query User OA XML database OA XML database OA XML database . . . SA SA SA senselet senselet senselet senselet senselet senselet Sensornet Sensor Sensor Sensor Two components: SAs: sensor feed processing OAs: distributed database Web Server for the url . . . . . .
Jeannette Wing (CMU/NSF)NSF Plans for SupportingData-Intensive Computing Google/IBM Data Center • ~2000 processors, large Hadoop cluster • Allocate in units of rack weeks • NSF will review proposals for use: Cluster Exploratory (CluE) • Running Xen; Won’t open up performance monitoring • Goal: Show applicable outside of computer science Academic-Industry-Government partnership
Randy Bryant (CMU)Big Data Computing Study Group • Collection of ~20 people (looking for volunteers) • Goals: • Fostering educational activities • Advocacy • Building community • CCC’s Big Data Computing Study Group seeks to foster collaborations between industry, academia, and the U.S. government to advance the state of art in the development and application of large scale computing systems for making intelligent use of the massive amounts of data being generated in science, commerce, and society