Distributed Computing and Analysis

Distributed Computingand Analysis Lamberto Luminari Italo – Hellenic School of Physics 2004 Martignano - May 20, 2004

Outline • Introduction • General remarks • Distributed computing • Principles • Projects • Computing facilities: testbeds and production infrastructures • Database Systems • Principles • Distributed analysis • Requirements and issues Lamberto Luminari

General remarks • Schematic approach • For the purpose of clarity, differences among possible alternatives are stressed: in reality, solutions are often a mix or a compromise • Only main features of relevant items are described: no aim of exhaustivity • HEP (LHC) oriented presentation • Examples are mainly taken from HEP world • Projects with HEP community involvement are preferred • Options chosen by LHC Lamberto Luminari

Distributed Computing

Distributed computing • What is it: • processing of data and objects across a network of connected systems; • hardware and software infrastructure that provides pervasive (and inexpensive) access to computational capabilities. • A long story: • mainframes more and more expensive; • cluster technology; • RISC machines very powerful. • What makes it appealing now: • CPU power! • Storage capacity!! • Network bandwidth!!! • ... but Distr. Comp. is not a choice, rather a necessity or an opportunity. Lamberto Luminari

Network performances Lamberto Luminari

Advantages of distributed computing • Scalability and flexibility: • in principle, distributed computing systems are infinitely scalable: simply add more units and get more computing power.Moreover you can add or remove specific resources and adapt the system to your needs. • Efficiency: • private resources are usually poorly used: pooling them greatly increases their exploitation. • Reliability: • failure of a component little affects the overall performances. • Load balancing and averaging: • distributing tasks according to the availability of resources optimize the behavior of the whole system and minimize the execution time; • load peaks arising from different user communities rarely sum up, then the use of resources is averaged (and optimized) over long periods. Lamberto Luminari

Disadvantages of distributed computing • Difficult integration and coordination: • many heterogeneous computing systems have to be integrated; • data sets are splitted over different storage systems; • many users have to cooperate and share resources. • Unpredictability: • the quantity of available resources may largely fluctuate; • computing units may become unavailable or unreachable suddenly and for long periods, making unpredictable the completion time of the tasks running there. • Security problems: • distributed systems are prone to intrusion. Lamberto Luminari

Applications and distributed computing • Suitable: • high compute to data ratio; • batch processes; • loosely coupled tasks; • statistical evaluations dependent on random trials; • data mining through distributed filesystems or databases. • Unsuitable: • real time; • interactive processes; • strongly coupled; • sequential. Lamberto Luminari

Distributed computing architectures • Peer-to-peer: • flat organization of components, with similar functionalities, talking to each other; • suitable for: • independent tasks or poor inter-task communication; • access to sparse data organized in a non hierarchical way. • Client - server: • components with different functionalities and roles: • processing unit (client) provided with a lightweight agent able to perform simple operations:detect system status and notify it to the server, ask (or wait) for tasks, accept and send data, execute processes according to priorities or in spare cycles, .... • dedicated unit (server) provided with complex software able to: take or send computing requests, monitor the status of the jobs sent to the clients, receive the results and assemble them, possibly in a database. It also takes care of security and access policy, and stores statistics and accounting data. • suitable for: • complex architectures and tasks. Lamberto Luminari

Multi-tier computing systems • Components with different levels of service, arranged in tiers: • computing centers (multi-processors, PC farms, data storage systems); • clusters of dedicated machines; • individual, general use PCs. • Different functionalities for each tier: • amount of CPU power installed and data stored; • quality and schedule of user support; • level of reliability and security. Lamberto Luminari

Lamberto Luminari

Distributed computing models • Clusters: • groups of homogeneous, tightly coupled components, sharing file systems and peripheral devices (e.g., Beowulf); • Pools of desktop PCs: • loosely interconnected private machines (e.g., Condor); • Grids: • heterogeneous systems of (mainly dedicated) resources (e.g., LCG). Lamberto Luminari

Comparison of computing models Lamberto Luminari

Condor is a specialized workload management system for compute-intensive jobs. It provides: • a job queueing mechanism; • scheduling policy; • priority scheme; • resource monitoring; • resource management. • Users submit their serial or parallel jobs to Condor, which places them into a queue, chooses when and where to run the jobs based upon a policy, carefully monitors their progress, and ultimately informs the user upon completion. • Unique mechanisms enable Condor to effectively harness wasted CPU power from otherwise idle desktop workstations. Condor is able to transparently produce a checkpoint and migrate a job to a different machine. • Condor does not require a shared file system across machines: if no shared file system is available, Condor can transfer the job's data files on behalf of the user, or Condor may be able to transparently redirect all the job's I/O requests back to the submit machine. Lamberto Luminari

resources data network Lamberto Luminari

Distributed computing environment • DCE standards: • A distributed computing network may include many different systems. The Distributed Computing Environment (DCE) — formulated by The Open Group — formalizes the technologies needed to make the components communicate with each other, such as remote procedural calls and middleware. DCE runs on all major computing platforms and is designed to support distributed applications in heterogeneous hardware and software environments. • DCE provides a complete infrastructure, with services, interfaces, protocols, encoding rules for: • authentication and security (Kerberos, Public Key certificate); • objects interoperability across different platforms (CORBA: Common Object Request Broker Architecture); • directories (with global name and cell name) for distributed resources; • time services (including synchronization); • distributed file systems; • Remote Procedure Call; • Internet/Intranet communications. Lamberto Luminari

Grid computing specifications • The Global Grid Forum (GGF) is the primary organization whose purpose is to define specifications about Grid Computing. It is a forum for information exchange and collaboration among people who are • doing Grid research, • designing and building Grid software, • deploying Grids, • using Grids, spanning technology areas: scheduling, data handling, security… • The Globus Toolkit (developed in Argonne Nat. Lab. and Univ. of Southern California) is an implementation of these standards, and has become a de facto standard for grid middleware because of some attractive features: • a object-oriented approach, which allows developers of specific applications to take just what meets their needs, to introduce tools one at a time and to make programs increasingly "Grid-enabled“; • the toolkit software is “open-source“: this allows developers to freely make and add improvements. Lamberto Luminari

Globus toolkit • Practically all major Grid projects are being built on protocols and services provided by the Globus Toolkit, a software "work-in-progress" which is being developed by the Globus Alliance, which involves primarily Ian Foster's team at Argonne National Laboratory and Carl Kesselman's team at the University of Southern California in Los Angeles. • The toolkit provides a set of software tools to implement the basic services and capabilities required to construct a computational Grid, such as security, resource location, resource management, and communications. • Globus includes programs such as: • Computing Element: receives job requests and delivers them to the Worker Nodes, which will perform the real work. The Computing Element provides an interface to the local batch queuing systems. A Computing Element can manage one or more Worker Nodes: Lamberto Luminari

Globus Toolkit • The Globus toolkit provides a set of software tools to implement the basic services and capabilities required to construct a computational Grid, such as security, resource location, resource management, and communications: • GRAM (Globus Resource Allocation Manager), to convert a request for resources into commands that local computers can understand; • GSI (Grid Security Infrastructure), to provide authentication of the user and work out that person's access rights; • MDS (Monitoring and Discovery Service), to collect information about resource (processing capacity, bandwidth capacity, type of storage, etc); • GRIS (Grid Resource Information Service), to query resources for their current configuration, capabilities, and status; • GIIS (Grid Index Information Service), to coordinate arbitrary GRIS services; • GridFTP, to provide a high-performance, secure and robust data transfer mechanism • Replica Catalog, a catalog that allows other Globus tools to look up where on the Grid other replicas of a given dataset can be found • Replica Management system, which ties together the Replica Catalog and GridFTP technologies, allowing applications to create and manage replicas of large datasets. Lamberto Luminari

OGSA: the future? Lamberto Luminari

Grid projects … and many others! Lamberto Luminari

Grid projects • UK – GRIPP • Netherlands – DutchGrid • Germany – UNICORE, Grid project • France – Grid funding approved • Italy – INFN Grid • Eire – Grid project • Switzerland - Network/Grid project • Hungary – DemoGrid • Norway, Sweden – NorduGrid • ……… • NASA Information Power Grid • DOE Science Grid • NSF National Virtual Observatory • NSF GriPhyN • DOE Particle Physics Data Grid • NSF TeraGrid • DOE ASCI Grid • DOE Earth Systems Grid • DARPA CoABS Grid • NEESGrid • DOH BIRN • NSF iVDGL • Grid2003 • ……. • DataGrid (CERN, ...) • EuroGrid (Unicore) • DataTag (CERN,…) • Astrophysical Virtual Observatory • GRIP (Globus/Unicore) • GRIA (Industrial applications) • GridLab (Cactus Toolkit) • CrossGrid (Infrastructure Components) • EGSO (Solar Physics) • EGEE • ……… Lamberto Luminari

Middleware projects relevant for HEP • EDG • European Data Grid (EU project) • EGEE • Enabling Grids for E-science in Europe (EU project) • Grid2003 • joint project of the U.S. Grid projects iVDGL, GriPhyN and PPDG, and the U.S. participants in the LHC experiments ATLAS and CMS. Lamberto Luminari

LCG hierarchical information service Lamberto Luminari

Replica management Lamberto Luminari

Job submission steps (1) Lamberto Luminari

Job submission steps (2) Lamberto Luminari

Portals • Why a portal? • It can be accessed from everywhere and by “everything” (desktop, laptop, PDA, phone). • It can keep the same user interface independently of the underlying middleware. • It must be redundantly “secure” at all levels: • secure for web transactions, • secure for user credentials, • secure for user authentication, • secure at VO level. • All available grid services must be incorporated in a logic way, just “one mouse click away”. • Its layout must be easily understandable and user friendly. Lamberto Luminari

Computing facilities (1) • Computing facilities (testbeds or production infrastructures) are made up of one or more nodes. Each node (computer center or cluster of resources) contains a certain number of components, which may be playing different roles. Some are site specific: • Computing Element: receives job requests and delivers them to the Worker Nodes, which will perform the real work. The Computing Element provides an interface to the local batch queuing systems. A Computing Element can manage one or more Worker Nodes: • Worker Node: the machine that will actually process data. Typically managed via a local batch system. A Worker Node can also be installed on the same machine as the Computing Element. • Storage Element: provides storage space to the facility. The storage element may control large disk arrays, mass storage systems and the like; however, the SE interface hides the differences between these systems allowing uniform user access. • User Interface: the machine that allows users to access the facility. This is typically the machine the end-user logs into to submit jobs to the grid and to retrieve the output from those jobs. Lamberto Luminari

Computing facilities (2) • Some other roles are shared by groups of users or by thwe whole grid: • Resource Broker: receives users' requests and queries the Information Index to find suitable resources. • Information Index: resides on the same machine as the Resource Broker, keeps information about the available resources. • Replica Manager: coordinates file replication from one Storage Element to another. Useful for data redundancy but also to move data closer to the machines which will perform computation. • Replica Catalog: can reside on the same machine as the Replica Manager, keeps information about file replicas. A logical file can be associated to one or more physical files which are replicas of the same data. Thus a logical file name can refer to one or more physical file names. Lamberto Luminari

Computing facilities relevant for HEP • EDG • Testbed • LCG • Production infrastructure • EGEE • Production infrastructure • Grid3 • Production infrastructure operated jointly by the U.S. Grid projects iVDGL, GriPhyN and PPDG, and the U.S. participants in the LHC experiments ATLAS and CMS. Lamberto Luminari

LCG hybrid architecture Multi-tier hierarchy + Grids Lamberto Luminari

EGEE Timeline • May 2003: proposal submitted • July 2003: proposal accepted • April 2004: start project Lamberto Luminari

Grid3 infrastructure Lamberto Luminari

Virtual Organizations (User Communities) I. Foster Lamberto Luminari

Multi-VO and one Grid Grid (shared resources and services) Lamberto Luminari

One VO and multi-Grid ATLAS Production System Lamberto Luminari

Distributed Computing and Analysis

Distributed Computing and Analysis

Presentation Transcript

Distributed computing

Distributed Computing

DISTRIBUTED COMPUTING

Distributed Computing

Distributed Computing

Distributed Computing

DISTRIBUTED COMPUTING

Distributed Computing

DISTRIBUTED COMPUTING

Distributed Computing

Distributed Computing

DISTRIBUTED COMPUTING

DISTRIBUTED COMPUTING

Distributed Computing

DISTRIBUTED COMPUTING

Distributed Computing

Distributed Computing

Distributed computing

DISTRIBUTED COMPUTING