220 likes | 237 Views
SimMillennium and Beyond From “Computer Systems, Computational Science and Engineering in the Large” to “petabyte stores”. David Culler, NSF Site Visit March 5, 2003. SimMillennium Project Goals.
E N D
SimMillennium and BeyondFrom “Computer Systems, Computational Science and Engineering in the Large” to “petabyte stores” David Culler, NSF Site Visit March 5, 2003
SimMillennium Project Goals • Vision: To work, think, and study in a computationally rich environment with deep information stores and powerful services • Enable major advances in Computational Science and Engineering • Simulation, Modeling, and Information Processing becoming ubiquitous • Explore novel design techniques for large, complex systems • Fundamental Computer Science problems ahead are problems of scale • Organized in concert with Univ. structure => computational economy • Develop fundamentally better ways of assimilating and interactingwith large volumes of information and with each other • Explore emerging technologies • networking, OS, devices Millennium
Research Infrastructure We Built • Cluster of Clusters (CLUMPS) distributed over multiple departments • gigabit ethernet within and between • Myrinet High speed interconnect • Vineyard Cluster System Architecture • Rootstock remote cluster installation tools • Ganglia remote cluster monitoring • GEXEC remote execution, GM (Myricom) messaging, MPI • PCP – parallel file tools • collection of port daemons, tools to make it all hand together • Gigabit to desktop, immersadesk, ... Millennium
Cluster Counts • Millennium Central Cluster • 99 Dell 2300/6400/6450 Xeon Dual/Quad: 336 processors • Total: 238 GB memory, 2 TB disk • Myrinet 2000 + 1000Mb fiber ethernet • Millennium Campus Clusters (Astro, Math, CE, EE, Physics, Bio) • 176 proc, 34 GB mem, 1.2 TB local disk • total: 512 proc, 292 GB mem, 3.2 TB scratch • NPACI ROCKS Cluster • 8 proc, 2 GB mem, 36 GB • OceanStore/ROC cluster • PlanetLab Cluster • 6 prc, 1.32 GHz, 3 GB mem, 180 GB • CITRIS Cluster 1: 3/2002 deployment (Intel Donation) • 4 Dell Precision 730 Itanium Duals: 8 processors • Total: 8GB memory, 128GB disk • Myrinet 2000 + 1000Mb copper ethernet (SimMil) • CITRIS Cluster 2: deployment (Intel Donation) • ~128 Dell McKinley class Duals: 256 processors • 16x2 installed • Total: ~512GB memory, ~8TB disk • Myrinet 2000 + 1000Mb copper ethernet (SimMil) • Many phasing out • NOW, Ninja, Dig Lab. ... Millennium
Cluster Top Users 2/2003 http://ganglia.millennium.berkeley.edu • ~800 users total on central cluster • 84 major users for 2/2003: average 62% total CPU utilization • ROC – middle tier storage layer testing/performance (bling,ach,fox@stanford) • Computer Vision Group – image recognition, boundary detection and segmentation, data mining (aberg,lwalk,dmartin,ryanw, xren) “2 hours on cluster vs. 2 weeks on local resources” • Computational Biology Lab - large-scale biological sequence database searches in parallel (brenner@compbio) • Tempest - TCAD tools for Next Generation Lithography (yunfei) • Internet services – performance characteristics of multithreaded servers (jrvb,jcondit) • Sensor Networks – power reduction (vwen) • Economic modeling – (stanton@haas) • Machine learning – information retrieval, text processing (blei) • Analyzing trends in BGP routing tables (sagarwal, mccaesar) • Graphics - Optical simulation and high quality rendering (adamb, csh) • Digital Library Project – image retreival by image content (loretta) • Bottleneck Analysis of Fine-grain Parallelism – (bfields) • SPUR – Earthquake simulation (jspark@ce) • Titanium – compiler and runtime system design for high performance parallel programming languages (bonachea) • AMANDA – neutrino detection from polar ice core samples (amanda) Millennium
Impact • Numerous groups doing research they could not have done without it • Malik photorealistic rendering, physics simulation,.. • Yelick, Titanium, Heart Modeling, ... • Wilensky, Digital Library, image segmentation • Brewer, Culler, Ninja Internet Service Arch... • Price, AMANDA, ... • Kubiatowicz, OceanStore, Katz, Sahara, Hellerstein PIER • First eScience Portals • Tempest, EUV lithography, Sugar MEMS simulation services • safe.millennium.berkeley.edu on Sept 11 • built w/i hours, scaled to million hits per day • CS267 – core of MS of computation science X • Cluster tools widely adopted • NPACI ROCKS • Ganglia the most downloaded cluster tool, in all the distributions, OSCAR, open source development team Millennium
Computational Economy • Developed economic-based resource allocation • decentralized design • interactive and batch • Advanced the SOA • controlled experiments with priced and unpriced clusters • analysis of utility gain relative to traditional resource allocation algorithms • Picked up in several other areas • index – pricing internet bandwidth • iceberg – pricing in telco/internet merge • core to internet design for planetary scale services Millennium
Emergence of Planetary-Scale Services • In past year Millennium became THE simulation engine for P2P • oceanstore, I^3, Sahara, BGP alternatives, PIER • Ganglia was the technical enabler for planetlab • > 100 machines at > 50 sites in > 8 countries • THE testbed for internet-scale systems research Millennium
Fundamental Bottleneck: Storage • Current storage hierarchy • based on NPACI reference • 3 TB local /scratch and /net/MMxx/scratch 4-day deletion • 0.5 TB global NFS /work 9-day deletion • inadequate BW and capacity • ~4 TB /home and /project • uniform naming through automount • doesn’t scale to cluster access • => augment capacity, BW, and metadata BW • we’ve been tracking cluster storage options since xFS on NOW and Tertiary Disk in 1995. Millennium
Another Cluster – a storage cluster Millennium Clusters Scalable GigE Core Massive Storage Clusters Myrinet SAN Citris Clusters Designed for higher reliability Avoid competition from on-going computation Local disks heavily used as scratch Millennium
Initial Cluster Design with 3.5TB Distributed File Store 2 Frontend Nodes Myrinet 2000 Campus Core Foundry 8000 2 2 Foundry 1500 1TFlop 1.6TB memory 128 Dual Itanium 2 Compute Nodes 128 128 Foundry 8000 4 Storage Controller 2 MetaServers 6 6 4 1 Gigabit Ethernet Myrinet 3.5TB Fibre Channel Storage Fibre Channel Millennium
Initial 3.5 TB Cluster Data Store Meta Server Meta Server 864GB 864GB 864GB 864GB Storage Controller Storage Controller Storage Controller Storage Controller BlueARC si8300 with 24 36GB 15K rpm disks and growth room Millennium = 36GB 15K rpm = fibre channel = gbit ethernet = myrinet
Lustre: A High-Performance, Scalable, Distributed File System for Clusters and Shared-Data Environments • Progress since xFS • TruCluster, GPFS, pvfs, ... • need “production quality” • NAS is finally here • History: CMU, Seagate, Los Alamos, Sandia, TriLabs • Distributed Filesystem replacing NFS • Object based file storage • object like inode represents a file • Opensource development managed by Cluster File Systems, Inc. • Gaining wide acceptance for production high-performance computing • PNNL and LLNL • Los Alamos and Sandia Labs • HP support as part of linux cluster effort • Intel Enterprise Architecture Lab Millennium
Lustre: Key Advantages • Open protocols, standards: Portals API, XML, LDAP • Runs on commodity PC hardware + 3rd party OST • such as BlueArc • Uses commodity filesystems on OSTs • such as ext3, JFS ReiserFS and XFS • Scalable and efficient design splits • (qty 2) Metadata servers: storing file system metadata • (up to 100) Object storage targets: storing files • To support up to 2000+ clients • Flexible model for adding new storage to existing Lustre file system. • Metadata server failover Millennium
Lustre: Functionality recovery, file status, file creation Meta Servers (Meta Data Servers) Storage Controllers (Object Storage Targets) system and parallel file I/O, file locking directory metadata and concurrency Clients Millennium
Growth Plan • based on conservative 50% per year density • expect roughly double 35 TB 8 SS 3 MS 23 TB 8 SS 3 MS 14 TB 8 SS 3 MS 8 TB 6 SS 3 MS 3.5 TB 4 SS 2 MS y03 y04 y05 y06 y07 Millennium
Example Projects • Cluster monitoring trace • ¼ TB per year for 300 nodes • ROC failure data • ¼ TB per year, much higher if get industrial feeds • Digital Library • Video • 100 GB/hour uncompressed • Vision • 100 GB per experiement • PlanetLab • internet wide instrumentation and logging We will look back and say, “we are doing research today that we could not have done without this” Millennium
disk log $/GB tape year 2001 End of the Tape Era Millennium
Emergence of the Sensor Net Era • 100s of research groups and companies using the Berkeley Mote / TinyOS platform • dozens of projects on campus • billions of networked devices connected to the physical world – constantly streaming data • => start building the storage and processing infrastructure for this new class of system today! Millennium
Internet Patch Network Sensor Node Sensor Patch Gateway Transit Network Client Data Browsing and Processing Basestation Base-Remote Link Data Service Environment Monitoring Experience • Canonical “patch” net architecture • live & historical readings www.greatduckisland.net • 43 nodes, 7/13-11/18 • above and below ground • light, temperature, relative humidity, and occupancy data, at 1 minute resolution • >1 million measurements • Best nodes ~90,000 • 3 major maintenance events • node design and packaging in harsh environment • -20 – 100 degrees, rain, wind • power mgmt and interplay with sensors and environment Millennium
Node Lifetime and Utility Sample Results Effective Communication Phase Packet Loss Correlation Millennium