1.32k likes | 1.33k Views
This course provides an overview of distributed systems, grid computing, and cloud computing. It discusses the performance evolution of computer components and the need to invest in networks and data storage. The course also covers parallel systems and cluster computing.
E N D
An Introduction to Large Scale Computing Lionel Brunie National Institute of Applied Sciences (INSA) LIRIS Laboratory/DRIM Team – UMR CNRS 5205 Lyon, France http://liris.cnrs.fr/lionel.brunie
“A Brain is a Lot of Data!”(Mark Ellisman, UCSD) And comparisons must be made among many Master Course, Lyon, January 2015 - Large Scale Computing
Data Intensive Processing • High energy & nuclear physics • Simulation • Earth observation, climate modeling • Geophysics, earthquake modeling • Fluids, aerodynamic design • Pollutant dispersal scenarios • Astronomy- Digital sky surveys: modern telescopes produce over 10 Petabytes per year (upto 30 TB per day)! • Molecular genomics • Chemistry and biochemistry • Financial applications • Medical images • … Master Course, Lyon, January 2015 - Large Scale Computing
Performance Evolution of Computer Components: the Table of the Three Laws • Note: a special case of 1936 Wright’s Law about he cost of airplanes? • Moore’s Law (processing) (1965) • The number of transistors on integrated • circuits doubles every two years • Correlate: the computing power doubles every 18 months • Seems to become less true… • Kryder’s Law (storage) (2005) • Storage capacity doubles every 12 months • Butter's law (optical fiber) • The amount of data coming out of an optical fiber is doubling every 9 months • In 3 years (36 months), networkx16, storagex8, processingx4 • In 9 years (108 months), networkx4096, storagex512, processingx64 Moore’s Law vs. storage improvements vs. optical improvements. Graph from Scientific American (Jan-2001) by Cleo Vilett, source Vined Khoslan, Kleiner, Caufield and Perkins. Master Course, Lyon, January 2015 - Large Scale Computing
Conclusion: Invest in networks (and data storage)! Master Course, Lyon, January 2015 - Large Scale Computing
Alternative Conclusion: future systems will be distributed(and clouds should play a role) Master Course, Lyon, January 2015 - Large Scale Computing
Contents of the Course • A Quick Overview of Distributed System Models • An Introduction to Grid Computing • An Introduction to Cloud Computing Master Course, Lyon, January 2015 - Large Scale Computing
A Quick Overview of Distributed Systems Master Course, Lyon, January 2015 - Large Scale Computing
Hansel and Gretel are Lost in the Forest of Definitions • Distributed system • Parallel system • Cluster computing • Meta-computing • Grid computing • Peer to peer computing • Global computing • Internet computing • Network computing • Cloud computing Master Course, Lyon, January 2015 - Large Scale Computing
Distributed System • N autonomous computers (sites): n administrators, n data/control flows • an interconnection network • User view: one single (virtual) system • «A distributed system is a collection of independent computers that appear to the users of the system as a single computer » Distributed Operating Systems, A. Tanenbaum, Prentice Hall, 1994 • « Traditional » programmer view: client-server Master Course, Lyon, January 2015 - Large Scale Computing
Parallel System • 1 computer, n nodes: one administrator, one scheduler, one power source • memory: it depends • Programmer view: one single machine executing parallel codes. Various programming models (message passing, distributed shared memory, data parallelism…) Master Course, Lyon, January 2015 - Large Scale Computing
Examples of Parallel System CPU Memory Device network CPU CPU CPU Interconnection network CPU Memory CPU CPU Memory Memory network network network CPU CPU CPU CPU CPU CPU Memory CPU Memory CPU A CC-NUMA Architecture A Shared Nothing Architecture Master Course, Lyon, January 2015 - Large Scale Computing
Cluster Computing • Use of PCs interconnected by a (high performance) network as a parallel (cheap) machine • Two main approaches • Dedicated network (based on a high performance network: Infiniband, Fiber Channel, GB-Ethernet...) • Non-dedicated network (based on a (good) LAN) Master Course, Lyon, January 2015 - Large Scale Computing
High Performance Computing (HPC) Performance Evolution • 1993 (prehistoric times!) • n°1: 59.7 GFlops • n°500: 0.4 Gflops • Sum = 1.17 TFlops • 2004 (yesterday) • n°1: 70 TFlops (x1118) • n°500: 850 Gflops (x2125) • Sum = 11274 Tflops Master Course, Lyon, January 2015 - Large Scale Computing
High Performance Computing (HPC) Where are we Today ? • www.top500.org (#1 since June 2013) • n°1: 33,9 Pflops (Tianhe-2) (x567202) • n°500: 153 Tflops (x382500) • Sum = 309 Pflops • See historical panorama course Master Course, Lyon, January 2015 - Large Scale Computing
NEC Earth Simulator (1st en 2004 ; 30th in 2007) Single stage crossbar: 2700 km of cables A MIMD with Distributed Memory 700 TB disk space 1.6 PB mass storage Area: 4 tennis courts, 3 floors Master Course, Lyon, January 2015 - Large Scale Computing
2014 - The Tianhe-2 (Milky Way-2) Ranked 1st in the top500 list of the most “powerful” (computing intensive) computers (since June 2013) Ranked 6th in the graph500 list of the most “powerful” (data intensive processing) computers (June 2013) Ranked 32nd in the green500 list of the most energy efficient computer (June 2013) and 57th in December 2014 China (National University of Defense Technology) Master Course, Lyon, January 2015 - Large Scale Computing 17
2014 - The Tianhe-2 (Milky Way-2) Rmax = 33862 (i.e., 33,9 Pflops) – Rpeak = 54902 (computing efficiency : 61,7 %) 3,120,000 cores – Memory: 1.375 PB – Disk: 12,4 PB –fat-tree based Interconnection Network 16000 computer nodes 1 node = 2 Intel (12 cores) Ivy Bridge Xeon + 3 (57 cores) Xeon Phi co-procs + 88GB memory shared by the Ivy Bridges procs + 8 GB memory shared by the Xeon Phi chips Power:17,8 MW (1,9 Tflops/kW – 1,9 Gflops/W … only!) « Tianhe-2 operation for 1 hour is equivalent to 1.3 billion people calculator operating one thousand years » (best-news.us – assertion not checked) Master Course, Lyon, January 2015 - Large Scale Computing 18
Supercomputing… A Quick Look at the Web • Top500.org • performance development • logarithmic progression! (x10 in 3years) • clusters, clusters (86%)! • 51% in industry • max power efficiency: 5,3 Gflops/W • #500: 153 TFlops! – Total : 309 Pflops • poster Top500 • Graph500.org • BlueGene++ • Green500.org and GreenGraph500 • List • max: 5,3 Gflops/W • #1 green500 = #168 top500 (317-594 Tflops) • #1 top500 = #57 green500 (2GFlops/W) Master Course, Lyon, January 2015 - Large Scale Computing
Network Computing • From LAN (cluster) computing to WAN computing • Set of machines distributed over a MAN/WAN that are used to execute parallel loosely coupled codes • Depending on the infrastructure (soft and hard), network computing is derived in Internet computing, P2P, Grid computing, etc. Master Course, Lyon, January 2015 - Large Scale Computing
Meta Computing (Beginning 90’s) Visualization • Definitions become fuzzy... • A meta computer = set of (widely) distributed (high performance) processing resources that can be associated for processing a parallel not so loosely coupled code • A meta computer = parallel virtual machine over a distributed system SAN LAN Cluster of PCs WAN SAN Cluster of PCs Supercomputer Master Course, Lyon, January 2015 - Large Scale Computing
Internet Computing • Use of (idle) computer interconnected by Internet for processing large throughput applications • Ex: SETI@HOME • 5M+ users since launching • 20013/10: 1,4M users, 3,5M computers; 135k active users, 190k active computers • 625 Tflops (average 505 Tflops)! • 233 « countries » • 2M years of CPU time since 1999; • BOINC infrastructure (Décrypthon, RSA-155…) • much less active than it used to be (:2 since 2011) • Programmer view: a single master, n servants Master Course, Lyon, January 2015 - Large Scale Computing
Internet Computing: Statistics http://boincstats.com (Oct. 2013) Master Course, Lyon, January 2015 - Large Scale Computing
Global Computing • Internet computing on a pool of sites • Meta computing with loosely coupled codes • Grid computing with poor communication facilities • Ex: Condor (invented in the 80’s) Master Course, Lyon, January 2015 - Large Scale Computing
Peer to Peer Computing • A site is both client and server: servent • Dynamic servent discovery by « contamination » • 2 approaches: • centralized management: Napster, Kazaa, eDonkey… • distributed management: Gnutella, KAD, Freenet, Bittorrent… • Applications: file sharing, video delivery, collaborative computing Master Course, Lyon, January 2015 - Large Scale Computing
Grid Computing (1) “Coordinated resource sharing and problem solving in dynamic, multi-institutional virtual organisations” (I. Foster) Master Course, Lyon, January 2015 - Large Scale Computing
Grid Computing (2) • Information grid • Large access to distributed data (the Web) • Data grid • Management and processing of very large distributed data sets – Data intensive computing • Computing grid • Meta computer Master Course, Lyon, January 2015 - Large Scale Computing
An Introduction to Grid Computing Master Course, Lyon, January 2015 - Large Scale Computing
Parallelism vs Grids: an Historical Look • Grids date back “only” 1996 • Parallelism is older! (first classification in 1972) • Motivations: • need more computing power (weather forecast, atomic simulation, genomics…) • need more storage capacity (Petabytes and more) • in a word: improve performance! 3 ways ... Work harder --> Use faster hardware Work smarter --> Optimize algorithms Get help --> Use more computers ! Master Course, Lyon, January 2015 - Large Scale Computing
Parallelism vs Grids: CERN Opinion • The answer is "money"... In 1999, the "LHC Computing Grid" was merely a concept on the drawing board for a computing system to store, process and analyse data produced from the Large Hadron Collider at CERN. However when work began on the design of the computing system for LHC data analysis, it rapidly became clear that the required computing power was far beyond the funding capacity available at CERN. On the other hand, most of the laboratories and universities collaborating on the LHC had access to national or regional computing facilities. The obvious question was: Could these facilities be somehow integrated to provide a single LHC computing service? The rapid evolution of wide area networking - increasing capacity and bandwidth coupled with falling costs - made it look possible. From there, the path to the LHC Computing Grid was set. Master Course, Lyon, January 2015 - Large Scale Computing
Parallelism vs Grids: CERN OpinionAdditional Benefits • Multiple copies of data can be kept in different sites, ensuring access for all scientists involved, independent of geographical location • Allows optimum use of spare capacity for multiple computer centres, making it more efficient • Having computer centres in multiple time zones eases round-the-clock monitoring and the availability of expert support • No single points of failure • The cost of maintenance and upgrades is distributed, since individual institutes fund local computing resources and retain responsibility for these, while still contributing to the global goal • Independently managed resources have encouraged novel approaches to computing and analysis • So-called “brain drain”, where researchers are forced to leave their country to access resources, is reduced when resources are available from their desktop • The system can be easily reconfigured to face new challenges, making it able to dynamically evolve throughout the life of the LHC, growing in capacity to meet the rising demands as more data is collected each year • Provides considerable flexibility in deciding how and where to provide future computing resources • Allows community to take advantage of new technologies that may appear and that offer improved usability, cost effectiveness or energy efficiency Master Course, Lyon, January 2015 - Large Scale Computing
Applications of Grid Computing • Distributed supercomputing • High throughput computing • On demand (real time) computing • Data intensive computing • Collaborative computing Master Course, Lyon, January 2015 - Large Scale Computing
An Introduction to Grid ComputingGrid Characteristics Master Course, Lyon, January 2015 - Large Scale Computing
Starting Point • Real need for very high performance infrastructures • Basic idea: share distributed computing resources • “The sharing that the GRID is concerned with is not primarily file exchange but rather direct access to computers, software, data, and other resources, as is required by a range of collaborative problem-solving and resource-brokering strategies emerging in industry, science, and engineering” (I. Foster) Master Course, Lyon, January 2015 - Large Scale Computing
Is Using Many Computers Always Efficient? What about Performance ? Ideally it Grows Linearly • Speed-up: • if TS is the best time to process a problem sequentially, • then the parallel processing time should be TP=TS/P with P processors • speedup = TS/TP • the speedup is limited by Amdhal law: any parallel program has a purely sequential and a parallelizable part TS= F + T//, • thus the speedup is limited: S = (F + T//) / (F + (T///P)) < P • Scale-up: • if TPS is the time to solve a problem of size S with P processors, • then TPS should also be the time to process a problem of size n*S with n*P processors Master Course, Lyon, January 2015 - Large Scale Computing
An Example Virtual Organization: CERN’s Large Hadron ColliderWorldwide LHC Computing Grid (WLCG) 8000 Physicists, 170 Sites, 34 Countries 15 PB of data per year; 100,000 CPUs Master Course, Lyon, January 2015 - Large Scale Computing
LCG System Architecture • A 4 layers Computing Model • Tier-0: CERN: accelerator • Data Acquisition and Reconstruction • Data Distribution to Tier-1 (~online) • Tier-1 • 24x7 Access and Availability, • Quasi-online data Acquisition • Data Service on the Grid • “Heavy” Analysis of the data • ~10 countries • Tier-2 • Simulation • Final User, Analysis of the data (batch and interactive modes) • ~40 Countries • Tier-3 • Final User, Scientific analysis Tier-0 (1) Tier-1 (11) Tier-2 (160) « Tier-3 » End User • LHC • 40 millions collisions per second • ~100 interesting collisions per second after filtering • 1-10 MB of data per collision • Acquisition rate: 0.1 to 1 GB/sec • 1010 collisions recorded every year • ~10 PBytes/year Master Course, Lyon, January 2015 - Large Scale Computing
LCG System Architecture (Cont’d) Tier-0 Trigger and Data Acquisition System 10 Gbps links Optical Private Network (to almost all sites) Tier-1 General Purpose/Academic/Research Network Tier-2 From F. Malek – LCG France Master Course, Lyon, January 2015 - Large Scale Computing
Computational Grid • “Hardware and software infrastructure that provides dependable, consistent, pervasive and inexpensive access to high-end computational capabilities” (I. Foster) • Performance criteria: • security • reliability • computing power • latency • throughput • scalability • services Master Course, Lyon, January 2015 - Large Scale Computing
Grid Characteristics • Large scale • Heterogeneity • Multiple administration domain • Autonomy… and coordination • Dynamicity • Flexibility • Extensibility / Scalability • Security Master Course, Lyon, January 2015 - Large Scale Computing
Basic Services • Authentication/Authorization/Traceability • Activity control (monitoring) • Resource discovery • Resource brokering • Scheduling • Job submission, data access/migration and execution • Accounting Master Course, Lyon, January 2015 - Large Scale Computing
An Introduction to Grid ComputingGrid Architecture and Components Master Course, Lyon, January 2015 - Large Scale Computing
Levels of Cooperation in a Computing Grid • End system (computer, disk, sensor…) • multithreading, local I/O • Cluster • synchronous communications, DSM, parallel I/O • parallel processing • Intranet/Organization • heterogeneity, distributed admin, distributed FS and databases • load balancing • access control • Internet/Grid • global supervision • brokers, negotiation, cooperation… Master Course, Lyon, January 2015 - Large Scale Computing
Layered Grid Architecture of a Grid(by analogy with Internet Architecture) Application Application Internet Protocol Architecture “Coordinating multiple resources”: ubiquitous infrastructure services, app-specific distributed services Collective “Sharing single resources”: negotiating access, controlling use Resource “Talking to things”: communication (Internet protocols) & security Connectivity Transport Internet “Controlling things locally”: Access to, & control of, resources Fabric Link From I. Foster Master Course, Lyon, January 2015 - Large Scale Computing
Resources • Description • Advertising • Cataloging • Matching • Claiming • Reserving • Checkpointing Master Course, Lyon, January 2015 - Large Scale Computing
Resource Management • Services and protocols depend on the infrastructure • Some parameters • stability of the infrastructure (same set of resources or not) • freshness of the resource availability information • reservation facilities • multiple resource or single resource brokering • Example of request: I need from 10 to 100 CE each with at least 512 MB RAM and a computing power of 150 Mflops Master Course, Lyon, January 2015 - Large Scale Computing
Resource Management and Scheduling (1) • Levels of scheduling • job scheduling (global level ; perf: throughput) • resource scheduling (perf: fairness, utilization) • application scheduling (perf: response time, speedup, produced data…) • Mapping/Scheduling process • resource discovery and selection • assignment of tasks to computing resources • data distribution • task scheduling on the computing resources • (communication scheduling) Master Course, Lyon, January 2015 - Large Scale Computing
Resource management and scheduling (2) • Individual perfs are not necessarily consistent with the global (system) perf ! • Grid problems • predictions are not definitive: dynamicity ! • Heterogeneous platforms • Checkpointing and migration Master Course, Lyon, January 2015 - Large Scale Computing
An Example of Resource Management System (Globus) Broker Co-allocator RSL specialization RSL Resource Specification Language Application Information Service Queries & Info Ground RSL Simple ground RSL Local resource managers GRAM GRAM GRAM LSF Condor NQE NQE: Network Queuing Env. (batch management; developed by Cray Research LSF: Load Sharing Facility (task scheduling and load balancing; Developed by Platform Computing) Master Course, Lyon, January 2015 - Large Scale Computing
Resource Information (1) • What is to be stored ? • virtual organizations, people, computing resources, software packages, communication resources, event producers, devices… • what about data ??? • A key issue in such dynamics environments • Characteristics: • dynamicity • complex relationships • frequent updates • complex queries Master Course, Lyon, January 2015 - Large Scale Computing