820 likes | 1k Views
Distributed Computing and Analysis. Lamberto Luminari Italo – Hellenic School of Physics 2004 Martignano - May 20, 2004. Outline. Introduction General remarks Distributed computing Principles Projects Computing facilities: testbeds and production infrastructures Database Systems
E N D
Distributed Computingand Analysis Lamberto Luminari Italo – Hellenic School of Physics 2004 Martignano - May 20, 2004
Outline • Introduction • General remarks • Distributed computing • Principles • Projects • Computing facilities: testbeds and production infrastructures • Database Systems • Principles • Distributed analysis • Requirements and issues Lamberto Luminari
General remarks • Schematic approach • For the purpose of clarity, differences among possible alternatives are stressed: in reality, solutions are often a mix or a compromise • Only main features of relevant items are described: no aim of exhaustivity • HEP (LHC) oriented presentation • Examples are mainly taken from HEP world • Projects with HEP community involvement are preferred • Options chosen by LHC Lamberto Luminari
Distributed computing • What is it: • processing of data and objects across a network of connected systems; • hardware and software infrastructure that provides pervasive (and inexpensive) access to computational capabilities. • A long story: • mainframes more and more expensive; • cluster technology; • RISC machines very powerful. • What makes it appealing now: • CPU power! • Storage capacity!! • Network bandwidth!!! • ... but Distr. Comp. is not a choice, rather a necessity or an opportunity. Lamberto Luminari
Network performances Lamberto Luminari
Advantages of distributed computing • Scalability and flexibility: • in principle, distributed computing systems are infinitely scalable: simply add more units and get more computing power.Moreover you can add or remove specific resources and adapt the system to your needs. • Efficiency: • private resources are usually poorly used: pooling them greatly increases their exploitation. • Reliability: • failure of a component little affects the overall performances. • Load balancing and averaging: • distributing tasks according to the availability of resources optimize the behavior of the whole system and minimize the execution time; • load peaks arising from different user communities rarely sum up, then the use of resources is averaged (and optimized) over long periods. Lamberto Luminari
Disadvantages of distributed computing • Difficult integration and coordination: • many heterogeneous computing systems have to be integrated; • data sets are splitted over different storage systems; • many users have to cooperate and share resources. • Unpredictability: • the quantity of available resources may largely fluctuate; • computing units may become unavailable or unreachable suddenly and for long periods, making unpredictable the completion time of the tasks running there. • Security problems: • distributed systems are prone to intrusion. Lamberto Luminari
Applications and distributed computing • Suitable: • high compute to data ratio; • batch processes; • loosely coupled tasks; • statistical evaluations dependent on random trials; • data mining through distributed filesystems or databases. • Unsuitable: • real time; • interactive processes; • strongly coupled; • sequential. Lamberto Luminari
Distributed computing architectures • Peer-to-peer: • flat organization of components, with similar functionalities, talking to each other; • suitable for: • independent tasks or poor inter-task communication; • access to sparse data organized in a non hierarchical way. • Client - server: • components with different functionalities and roles: • processing unit (client) provided with a lightweight agent able to perform simple operations:detect system status and notify it to the server, ask (or wait) for tasks, accept and send data, execute processes according to priorities or in spare cycles, .... • dedicated unit (server) provided with complex software able to: take or send computing requests, monitor the status of the jobs sent to the clients, receive the results and assemble them, possibly in a database. It also takes care of security and access policy, and stores statistics and accounting data. • suitable for: • complex architectures and tasks. Lamberto Luminari
Multi-tier computing systems • Components with different levels of service, arranged in tiers: • computing centers (multi-processors, PC farms, data storage systems); • clusters of dedicated machines; • individual, general use PCs. • Different functionalities for each tier: • amount of CPU power installed and data stored; • quality and schedule of user support; • level of reliability and security. Lamberto Luminari
Distributed computing models • Clusters: • groups of homogeneous, tightly coupled components, sharing file systems and peripheral devices (e.g., Beowulf); • Pools of desktop PCs: • loosely interconnected private machines (e.g., Condor); • Grids: • heterogeneous systems of (mainly dedicated) resources (e.g., LCG). Lamberto Luminari
Comparison of computing models Lamberto Luminari
Condor is a specialized workload management system for compute-intensive jobs. It provides: • a job queueing mechanism; • scheduling policy; • priority scheme; • resource monitoring; • resource management. • Users submit their serial or parallel jobs to Condor, which places them into a queue, chooses when and where to run the jobs based upon a policy, carefully monitors their progress, and ultimately informs the user upon completion. • Unique mechanisms enable Condor to effectively harness wasted CPU power from otherwise idle desktop workstations. Condor is able to transparently produce a checkpoint and migrate a job to a different machine. • Condor does not require a shared file system across machines: if no shared file system is available, Condor can transfer the job's data files on behalf of the user, or Condor may be able to transparently redirect all the job's I/O requests back to the submit machine. Lamberto Luminari
resources data network Lamberto Luminari
Distributed computing environment • DCE standards: • A distributed computing network may include many different systems. The Distributed Computing Environment (DCE) — formulated by The Open Group — formalizes the technologies needed to make the components communicate with each other, such as remote procedural calls and middleware. DCE runs on all major computing platforms and is designed to support distributed applications in heterogeneous hardware and software environments. • DCE provides a complete infrastructure, with services, interfaces, protocols, encoding rules for: • authentication and security (Kerberos, Public Key certificate); • objects interoperability across different platforms (CORBA: Common Object Request Broker Architecture); • directories (with global name and cell name) for distributed resources; • time services (including synchronization); • distributed file systems; • Remote Procedure Call; • Internet/Intranet communications. Lamberto Luminari
Grid computing specifications • The Global Grid Forum (GGF) is the primary organization whose purpose is to define specifications about Grid Computing. It is a forum for information exchange and collaboration among people who are • doing Grid research, • designing and building Grid software, • deploying Grids, • using Grids, spanning technology areas: scheduling, data handling, security… • The Globus Toolkit (developed in Argonne Nat. Lab. and Univ. of Southern California) is an implementation of these standards, and has become a de facto standard for grid middleware because of some attractive features: • a object-oriented approach, which allows developers of specific applications to take just what meets their needs, to introduce tools one at a time and to make programs increasingly "Grid-enabled“; • the toolkit software is “open-source“: this allows developers to freely make and add improvements. Lamberto Luminari
Globus toolkit • Practically all major Grid projects are being built on protocols and services provided by the Globus Toolkit, a software "work-in-progress" which is being developed by the Globus Alliance, which involves primarily Ian Foster's team at Argonne National Laboratory and Carl Kesselman's team at the University of Southern California in Los Angeles. • The toolkit provides a set of software tools to implement the basic services and capabilities required to construct a computational Grid, such as security, resource location, resource management, and communications. • Globus includes programs such as: • Computing Element: receives job requests and delivers them to the Worker Nodes, which will perform the real work. The Computing Element provides an interface to the local batch queuing systems. A Computing Element can manage one or more Worker Nodes: Lamberto Luminari
Globus Toolkit • The Globus toolkit provides a set of software tools to implement the basic services and capabilities required to construct a computational Grid, such as security, resource location, resource management, and communications: • GRAM (Globus Resource Allocation Manager), to convert a request for resources into commands that local computers can understand; • GSI (Grid Security Infrastructure), to provide authentication of the user and work out that person's access rights; • MDS (Monitoring and Discovery Service), to collect information about resource (processing capacity, bandwidth capacity, type of storage, etc); • GRIS (Grid Resource Information Service), to query resources for their current configuration, capabilities, and status; • GIIS (Grid Index Information Service), to coordinate arbitrary GRIS services; • GridFTP, to provide a high-performance, secure and robust data transfer mechanism • Replica Catalog, a catalog that allows other Globus tools to look up where on the Grid other replicas of a given dataset can be found • Replica Management system, which ties together the Replica Catalog and GridFTP technologies, allowing applications to create and manage replicas of large datasets. Lamberto Luminari
OGSA: the future? Lamberto Luminari
Grid projects … and many others! Lamberto Luminari
Grid projects • UK – GRIPP • Netherlands – DutchGrid • Germany – UNICORE, Grid project • France – Grid funding approved • Italy – INFN Grid • Eire – Grid project • Switzerland - Network/Grid project • Hungary – DemoGrid • Norway, Sweden – NorduGrid • ……… • NASA Information Power Grid • DOE Science Grid • NSF National Virtual Observatory • NSF GriPhyN • DOE Particle Physics Data Grid • NSF TeraGrid • DOE ASCI Grid • DOE Earth Systems Grid • DARPA CoABS Grid • NEESGrid • DOH BIRN • NSF iVDGL • Grid2003 • ……. • DataGrid (CERN, ...) • EuroGrid (Unicore) • DataTag (CERN,…) • Astrophysical Virtual Observatory • GRIP (Globus/Unicore) • GRIA (Industrial applications) • GridLab (Cactus Toolkit) • CrossGrid (Infrastructure Components) • EGSO (Solar Physics) • EGEE • ……… Lamberto Luminari
Middleware projects relevant for HEP • EDG • European Data Grid (EU project) • EGEE • Enabling Grids for E-science in Europe (EU project) • Grid2003 • joint project of the U.S. Grid projects iVDGL, GriPhyN and PPDG, and the U.S. participants in the LHC experiments ATLAS and CMS. Lamberto Luminari
LCG hierarchical information service Lamberto Luminari
Replica management Lamberto Luminari
Job submission steps (1) Lamberto Luminari
Job submission steps (2) Lamberto Luminari
Portals • Why a portal? • It can be accessed from everywhere and by “everything” (desktop, laptop, PDA, phone). • It can keep the same user interface independently of the underlying middleware. • It must be redundantly “secure” at all levels: • secure for web transactions, • secure for user credentials, • secure for user authentication, • secure at VO level. • All available grid services must be incorporated in a logic way, just “one mouse click away”. • Its layout must be easily understandable and user friendly. Lamberto Luminari
Computing facilities (1) • Computing facilities (testbeds or production infrastructures) are made up of one or more nodes. Each node (computer center or cluster of resources) contains a certain number of components, which may be playing different roles. Some are site specific: • Computing Element: receives job requests and delivers them to the Worker Nodes, which will perform the real work. The Computing Element provides an interface to the local batch queuing systems. A Computing Element can manage one or more Worker Nodes: • Worker Node: the machine that will actually process data. Typically managed via a local batch system. A Worker Node can also be installed on the same machine as the Computing Element. • Storage Element: provides storage space to the facility. The storage element may control large disk arrays, mass storage systems and the like; however, the SE interface hides the differences between these systems allowing uniform user access. • User Interface: the machine that allows users to access the facility. This is typically the machine the end-user logs into to submit jobs to the grid and to retrieve the output from those jobs. Lamberto Luminari
Computing facilities (2) • Some other roles are shared by groups of users or by thwe whole grid: • Resource Broker: receives users' requests and queries the Information Index to find suitable resources. • Information Index: resides on the same machine as the Resource Broker, keeps information about the available resources. • Replica Manager: coordinates file replication from one Storage Element to another. Useful for data redundancy but also to move data closer to the machines which will perform computation. • Replica Catalog: can reside on the same machine as the Replica Manager, keeps information about file replicas. A logical file can be associated to one or more physical files which are replicas of the same data. Thus a logical file name can refer to one or more physical file names. Lamberto Luminari
Computing facilities relevant for HEP • EDG • Testbed • LCG • Production infrastructure • EGEE • Production infrastructure • Grid3 • Production infrastructure operated jointly by the U.S. Grid projects iVDGL, GriPhyN and PPDG, and the U.S. participants in the LHC experiments ATLAS and CMS. Lamberto Luminari
LCG hybrid architecture Multi-tier hierarchy + Grids Lamberto Luminari
EGEE Timeline • May 2003: proposal submitted • July 2003: proposal accepted • April 2004: start project Lamberto Luminari
Grid3 infrastructure Lamberto Luminari
Virtual Organizations (User Communities) I. Foster Lamberto Luminari
Multi-VO and one Grid Grid (shared resources and services) Lamberto Luminari
One VO and multi-Grid ATLAS Production System Lamberto Luminari