1 / 64

System Support for Data-Intensive Applications

System Support for Data-Intensive Applications. Katherine Yelick U.C. Berkeley, EECS. The “Post PC” Generation. Two technologies will likely dominate: 1) Mobile Consumer Electronic Devices e.g., PDA, Cell phone, wearable computers, with cameras, recorders, sensors

Download Presentation

System Support for Data-Intensive Applications

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. System Support for Data-Intensive Applications Katherine Yelick U.C. Berkeley, EECS

  2. The “Post PC” Generation Two technologies will likely dominate: 1) Mobile Consumer Electronic Devices • e.g., PDA, Cell phone, wearable computers, with cameras, recorders, sensors • make the computing “invisible” through reliability and simple interfaces 2) Infrastructure to Support such Devices • e.g., successor to Big Fat Web Servers, Database Servers • make these “utilities” with reliability and new economic models

  3. Open Research Issues • Human-computer interaction • uniformity across devices • Distributed computing • coordination across independent devices • Power • low power designs and renewable power sources • Information retrieval • finding useful information amidst a flood of data • Scalability • Scaling devices down • Scaling services up • Reliability and maintainability

  4. The problem space: big data • Big demand for enormous amounts of data • today: enterprise and internet applications • online applications: e-commerce, mail, web, archives • enterprise decision-support, data mining databases • future: richer data and more of it • computational & storage back-ends for mobile devices • more multimedia content • more use of historical data to provide better services • Two key application domains: • storage: public, private, and institutional data • search: building static indexes, dynamic discovery

  5. Reliability/Performance Trade-off • Techniques for reliability: • High level languages with strong types • avoid memory leaks, wild pointers, etc. • C vs. Java • Redundant storage, computation, etc. • adds storage and bandwidth overhead • Techniques for performance: • Optimize for a specific machine • e.g., cache or memory hierarchy • Minimize redundancy • These two goals work against each other

  6. Specific Projects • ISTORE • A reliable, scalable, maintainable storage system • Data-intensive applications for “backend” servers • Modeling the real world • Storing and finding information • Titanium • A high level language (Java) with high performance • A domain-specific language and optimizing compiler • Sparsity • Optimization using partial program input

  7. Disk Half-height canister ISTORE: Reliable Storage System • 80-node x86-based cluster, 1.4TB storage • cluster nodes are plug-and-play, intelligent, network-attached storage “bricks” • a single field-replaceable unit to simplify maintenance • each node is a full x86 PC w/256MB DRAM, 18GB disk • 2-node system running now; full system in next quarter Intelligent Disk “Brick” Portable PC CPU: Pentium II/266 + DRAM Redundant NICs (4 100 Mb/s links) Diagnostic Processor • ISTORE Chassis • 80 nodes, 8 per tray • 2 levels of switches • 20 100 Mbit/s • 2 1 Gbit/s • Environment Monitoring: • UPS, redundant PS, • fans, heat and vibration sensors...

  8. A glimpse into the future? • System-on-a-chip enables computer, memory, redundant network interfaces without significantly increasing size of disk • ISTORE HW in 5-7 years: • building block: 2006 MicroDrive integrated with IRAM • 9GB disk, 50 MB/sec from disk • connected via crossbar switch • 10,000 nodes fit into one rack! • O(10,000) scale is our ultimate design point

  9. Specific Projects • ISTORE • A reliable, scalable, maintainable storage system • Data-intensive applications for “backend” servers • Modeling the real world • Storing and finding information • Titanium • A high level language (Java) with high performance • A domain-specific language and optimizing compiler • Sparsity • Optimization using partial program input

  10. Heart Modeling • A computer simulation of a human heart • Used to design artificial heart valves • Simulations run for days on a C90 supercomputer • Done by Peskin and MacQueen at NYU • Modern machines are faster but harder to use • working with NYU • using Titanium • Shown here: close-up of aortic valve during ejection • Images from the Pittsburgh Supercomputer Center

  11. Simulation of a Beating Heart • Shown here: • Aortic valve (yellow); Mitral valve (purple) • Mitral valves closes when left ventrical pumps • Future: virtual surgery?

  12. Earthquake Simulation • Earthquake modeling • Used for retrofitting buildings, emergency preparedness, construction policies • Done by Beilak (CMU); also by Fenves (Berkeley) • Problems: grid (graph) generation; using images

  13. Earthquake Simuation • Movie shows a simulated aftershock following the 1994 Northridge earthquake in California • Future: sensors everywhere; tied to central system

  14. Pollution Standards • Simulation of ozone layer • Done by Russell (CMU) and McRae (MIT) • Used to influence automobile emissions policy Los Angeles Basin shown at 8am (left) and 2pm (right) The “cloud” shows areas where ozone levels are above federal ambient air quality standards (0.12 parts per million)

  15. Information Retrieval • Finding useful information amidst huge data sets • I/O intensive application • Today’s example: web search engines • 10 Million documents in typical matrix. • Web storage increasing 2x every 5 months • One class of techniques based on sparse matrices • Problem: Can you make this run faster, without writing hand-optimized, non-portable code? # documents ~= 10 M • Matrix is compressed • “Random” memory access • Cache miss per 2Flops • Run at 1-5% of machine peak x # keywords ~100K

  16. Image-Based Retrieval • Digital library problem: • retrieval on images • content-based • Computer vision problem • uses sparse matrix • Future: search in medical image databases; diagnosis; epidemiological studies

  17. Object Based Image Description

  18. Specific Projects • ISTORE • A reliable, scalable, maintainable storage system • Data-intensive applications for “backend” servers • Modeling the real world • Storing and finding information • Titanium • A high level language (Java) with high performance • A domain-specific language and optimizing compiler • Sparsity • Optimization using partial program input

  19. Titanium Goals • Help programmers write reliable software • Retain safety properties of Java • Extend to parallel programming constructs • Performance • Sequential code comparable to C/C++/Fortran • Parallel performance comparable to MPI • Portability • How? • Domain-specific language and compiler • No JVM • Optimizing compiler • Explicit parallelism and other language constructs for high performance

  20. Titanium Overview: Sequential Object-oriented language based on Java with: • Immutable classes • user-definable non-reference types for performance • Unordered loops • compiler is free to run iteration in any order • useful for cache optimizations and others • Operator overloading • by demand from our user community • Multidimensional arrays • points and index sets as first-class values • specific to an application domain: scientific computing with block-structured grids

  21. Titanium Overview: Parallel Extensions of Java for scalable parallelism: • Scalable parallelism • SPMD model with global address space • Global communication library • E.g., broadcast, exchange (all-to-all) • Used to build data structures in the global address space • Parallel Optimizations • Pointer operations • Communication (underway) • Bulk asynchronous I/O • speed with safety

  22. Implementation • Strategy • Compile Titanium into C • Communicate through shared memory on SMPs • Lightweight communication for distributed memory • Titanium currently runs on: • Uniprocessors • SMPs with Posix or Solaris threads • Berkeley NOW, SP2 (distributed memory) • Tera MTA (multithreaded, hierarchical) • Cray T3E (global address space) • SP3 (cluster of SMPs, e.g., Blue Horizon at SDSC)

  23. Ultrasparc: C/C++/ FORTRAN Java Arrays Titanium Arrays Overhead DAXPY 1.4s 6.8s 1.5s 7% 3D multigrid 12s 22s 83% 2D multigrid 5.4s 6.2s 15% EM3D 0.7s 1.8s 1.0s 42% Pentium II: C/C++/ FORTRAN Java Arrays Titanium Arrays Overhead DAXPY 1.8s 2.3s 27% 3D multigrid 23.0s 20.0s -13% 2D multigrid 7.3s 5.5s -25% EM3D 1.0s 1.6s 60% Sequential Performance Performance results from 98; new IR and optimization framework almost complete.

  24. SPMD Execution Model • Java programs can be run as Titanium, but the result will be that all processors do all the work • E.g., parallel hello world class HelloWorld { public static void main (String [] argv) { System.out.println(‘’Hello from proc ‘’ + Ti.thisProc()); } } • Any non-trivial program will have communication and synchronization

  25. SPMD Execution Model • A common style is compute/communicate • E.g., in each timestep within particle simulation with gravitation attraction read all particles and compute forces on mine Ti.barrier(); write to my particles using new forces Ti.barrier(); • This basic model is used on the large-scale parallel simulations described earlier

  26. SPMD Model • All processor start together and execute same code, but not in lock-step • Basic control done using • Ti.numProcs() total number of processors • Ti.thisProc() number of executing processor • Sometimes they do something independent if (Ti.thisProc() == 0) { ….. do setup ..… } System.out.println(‘’Hello from ‘’ + Ti.thisProc()); double [1d] a = new double [Ti.numProcs()];

  27. Barriers and Single • Common source of bugs is barriers or other global operations inside branches or loops barrier, broadcast, reduction, exchange • A “single” method is one called by all procs public single static void allStep(...) • A “single” variable has same value on all procs int single timestep = 0; • The compiler uses “single” type annotations to ensure there are no synchronization bugs with barriers

  28. 0 2 4 6 Explicit Communication: Exchange • To create shared data structures • each processor builds its own piece • pieces are exchanged (for object, just exchange pointers) • Exchange primitive in Titanium int [1d] single allData; allData = new int [0:Ti.numProcs()-1]; allData.exchange(Ti.thisProc()*2); • E.g., on 4 procs, each will have copy of allData:

  29. Exchange on Objects • More interesting example: class Boxed { public Boxed (int j) { val = j; } public in val; } Object [1d] single allData; allData = new Object [0:Ti.numProcs()-1]; allData.exchange(new Boxed(Ti.thisProc());

  30. Use of Global / Local • As seen, references (pointers) may be remote • easy to port shared-memory programs • Global pointers are more expensive than local • True even when data is on the same processor • Use local declarations in critical sections • Costs of global: • space (processor number + memory address) • dereference time (check to see if local) • May declare references as local

  31. Global Address Space • Processes allocate locally • References can be passed to other processes Other processes Process 0 Class C { int val;….. } C gv; // global pointer C local lv; // local pointer if (thisProc() == 0) { lv = new C(); } gv = broadcast lv from 0; gv.val = …..; ….. = gv.val; lv lv gv gv

  32. Local Pointer Analysis • Compiler can infer many uses of local • “Local Qualification Inference” (Liblit’s work) • Data structures must be well partitioned

  33. Bulk Asynchronous I/O Performance External sort benchmark on NOW • raf: random access file (Java) • ds: unbuffered stream (Java) • dsb: buffered stream (Java) • bulkraf: bulk random access (Titanium) • bulkds: bulk sequential (Titanium) • async: asynchronous (Titanium)

  34. Performance Heterogeneity • System performance limited by the weakest link • Performance heterogeneity is the norm • disks: inner vs. outer track (50%), fragmentation • processors: load (1.5-5x) • Virtual Streams: dynamically off-load I/O work from slower disks to faster ones

  35. Parallel performance on an SMP • Speedup on Ultrasparc SMP (shared memory multiprocessor) • EM3D performance linear • simple kernel • AMR largely limited by • problem size • 2 levels, with top one serial

  36. Parallel Performance on a NOW • MLC for Finite-Differences by Balls and Colella • Poisson equation with infinite boundaries • arise in astrophysics, some biological systems, etc. • Method is scalable • Low communication • Performance on • SP2 (shown) and t3e • scaled speedups • nearly ideal (flat) • Currently 2D and non-adaptive

  37. Performance on CLUMPs • Clusters of SMPs (CLUMPs) have two-levels of communication • BH at SDSC has 144 nodes, each with 8 nodes • 8th processor cannot be used effectively

  38. Cluster of SMPs • Communication within a node is shared-memory • Communication between nodes uses LAPI • for large messages, a separate thread is created by LAPI • interferes with computation performance

  39. Optimizing Parallel Programs • Would like compiler to introduce asynchronous communication, which is a form of possible reordering • Hardware also reorders • out-of-order execution • write buffered with read by-pass • non-FIFO write buffers • Software already reorders too • register allocation • any code motion • System provides enforcement primitives • volatile: at the language level not well-defined • tend to be heavy weight, unpredictable • Can the compiler hide all this?

  40. Semantics: Sequential Consistency • When compiling sequential programs: Valid if y not in expr1 and x not in expr2 (roughly) • When compiling parallel code, not sufficient test. x = expr1; y = expr2; y = expr2; x = expr1; Initially flag = data = 0 Proc A Proc B data = 1; while (flag==1); flag = 1; ... = ...data...;

  41. write data read flag write flag read data Cycle Detection: Dependence Analog • Processors define a “program order” on accesses from the same thread P is the union of these total orders • Memory system define an “access order” on accesses to the same variable A is access order (read/write & write/write pairs) • A violation of sequential consistency is cycle in P U A. • Intuition: time cannot flow backwards.

  42. Cycle Detection • Generalizes to arbitrary numbers of variables and processors • Cycles may be arbitrarily long, but it is sufficient to consider only cycles with 1 or 2 consecutive stops per processor [Sasha & Snir] write x write y read y read y write x

  43. Static Analysis for Cycle Detection • Approximate P by the control flow graph • Approximate A by undirected “dependence” edges • Let the “delay set” D be all edges from P that are part of a minimal cycle • The execution order of D edge must be preserved; other P edges may be reordered (modulo usual rules about serial code) • Synchronization analsysis also critical [Krishnamurthy] write z read x write y read x read y write z

  44. Time (normalized) Automatic Communication Optimization • Implemented in subset of C with limited pointers • Experiments on the NOW; 3 synchronization styles • Future: pointer analysis and optimizations

  45. Specific Projects • ISTORE • A reliable, scalable, maintainable storage system • Data-intensive applications for “backend” servers • Modeling the real world • Storing and finding information • Titanium • A high level language (Java) with high performance • A domain-specific language and optimizing compiler • Sparsity • Optimization using partial program input

  46. Sparsity: Sparse Matrix Optimizer • Several data mining or web search algorithms use sparse matrix-vector multiplication • use for documents, images, video, etc. • irregular, indirect memory patterns perform poorly on memory hierarchies • Performance improvements possible, but depend on: • sparsity structure, e.g., keywords within documents • machine parameters without analytical models • Good news: • operation repeated many times on similar matrix • Sparsity: automatic code generator based on matrix structure and machine

  47. Sparsity: Sparse Matrix Optimizer

  48. Summary • Future • small devices + larger servers • reliability increasingly important • Reliability techniques include • hardware: redundancy, monitoring • software: better languages, many others • Performance trades off against safety in languages • use of domain-specific features (e.g., Titanium)

  49. Backup Slides

  50. The Big Motivators for Programming Systems Research • Ease of Programming • Hardware costs -> 0 • Software costs -> infinity • Correctness • Increasing reliance on software increases cost of software errors (medical, financial, etc.) • Performance • Increasing machine complexity • New languages and applications • Enabling Java; network packet filters

More Related