310 likes | 383 Views
Distributed Systems: Message Passing, Clusters, and Implementation of Clusters in Representative Operating Systems. Distributed message passing. Communication and synchronization mechanisms in distributed systems Distributed message passing Remote procedure call
E N D
Distributed Systems: Message Passing, Clusters, andImplementation of Clusters in Representative Operating Systems
Distributed message passing • Communication and synchronization mechanisms in distributed systems • Distributed message passing • Remote procedure call • An implementation approach for message passing • Use the services of a message-passing module • Service is requested in the form of primitives and parameters CS-550 (M.Soneru): Distributed Systems – Message Passing, Clusters, and Comparative OSs [Sta’01]
Distributed message passing (cont.) • Send primitive • Parameters • Destination process identifier • The message contents • Operation • Sending process uses ‘Send’ primitive (destination, message contents) • Message-passing module constructs data unit with destination and contents • Data unit is sent to the destination machine using communication facility (e.g., TCP/IP) • Data unit is received by the destination machine and is routed by the communication facility to the message-passing module • The message-passing module stores the message in the buffer for the destination process • Receive primitive • Operation • Destination process assigns buffer area for messages and uses ‘Receive’ primitive to the message passing module • Alternatively, message-passing module signals destination process with ‘Receive' signal and places message in shared buffer CS-550 (M.Soneru): Distributed Systems – Message Passing, Clusters, and Comparative OSs [Sta’01]
Distributed message passing (cont.) • Design issues: • Reliability vs. unreliability • Blocking vs. non-blocking • Reliability vs. unreliability • Reliable message passing • Guarantees delivery if possible • Uses a reliable transport protocol • Performs error checking, acknowledgment, retransmission, and reordering of messages if delivered out of sequence • Acknowledgment to the sending process that delivery was either successful or it failed (e.g. network failure) • Unreliable message passing • Message-passing facility sends the message without reporting success or failure • Message passing facility has a simple design and low overhead • Applications may use ‘Request’ and ’Reply’ to acknowledge delivery CS-550 (M.Soneru): Distributed Systems – Message Passing, Clusters, and Comparative OSs [Sta’01]
Distributed message passing (cont.) • Blocking vs. non-blocking • Blocking or synchronous primitives • Blocking ‘Send’ does not return control to the sending process (process suspended) • until • Message has been transmitted (unreliable service), or • Message has been sent and an acknowledgment received (reliable service) • Blocking ‘Receive’ does not return control to the receiving process until • Message has been placed in the allocated buffer CS-550 (M.Soneru): Distributed Systems – Message Passing, Clusters, and Comparative OSs [Sta’01]
Distributed message passing (cont.) • Blocking vs. non-blocking • Non-blocking or asynchronous primitives • ‘Send’ primitive does not suspend process • Control returned to the process as soon as the message has been queued for transmission or a copy has been made • After the message has been transmitted or copied to a safe place for later transmission, sending process is interrupted to be informed that the message buffer is available • ‘Receive’ primitive does not suspend process • Process is sent an interrupt upon message arrival or process can poll periodically for messages • Advantages/disadvantages • Efficient use of message passing mechanism • Difficult to test and debug: time-dependent sequences can lead to obscure bugs CS-550 (M.Soneru): Distributed Systems – Message Passing, Clusters, and Comparative OSs [Sta’01]
Remote procedure calls • Provides access to remote services by providing simple procedure call/return semantics, similar to those used for local services • Advantages • The procedure call is used extensively • Remote interfaces can be specified and clearly documented as a set of named operations with designated types • The interface is standardized • The communication code for an application can be generated automatically • Client/server modules can be easily ported between different OSs and target systems • Example of procedure call for the calling program CALL P (X, Y) where P = procedure name X = passed arguments Y = returned values CS-550 (M.Soneru): Distributed Systems – Message Passing, Clusters, and Comparative OSs [Sta’01]
Remote procedure calls (cont.) • Dummy or stub procedure on the local machine • Included in the caller’s address space or dynamically linked at call time • Creates message identifying remote procedure and includes parameters • Sends message to remote system and waits for reply • When reply arrives, it returns to the calling program providing the returned values • Dummy or stub procedure on the remote machine • Upon receiving the message, generates a local CALL P (X, Y) • Returns reply CS-550 (M.Soneru): Distributed Systems – Message Passing, Clusters, and Comparative OSs [Sta’01]
Remote procedure calls (cont.) • Design issues • Parameter passing • Call by value (parameters passed as values) • Parameters copied into the message and sent to remote system • Easy to implement for RPCs • Call by reference (pointers to a location that contains the value) • More difficult to implement for RPCs • Parameters and results representation • No problem if the calling and called programs use the same language and run on the same type of OSs and machines • If there are differences, the remote procedure call mechanism must provide the conversion: standardized format for common objects (e.g., integers, characters) • Client/server binding • A client/server binding is established after the two applications have made a logical connection and are ready to exchange commands and data • Non-persistent binding: Logical connection between the two processes established at the time of RPC and disconnected after the values are returned • Persistent binding: Connection set up for RPC remains up after return CS-550 (M.Soneru): Distributed Systems – Message Passing, Clusters, and Comparative OSs [Sta’01]
Remote procedure calls (cont.) • Design issues (cont.) • Synchronous vs. asynchronous • Synchronous RPC • Calling process waits for the returned values • Traditional, functions like a subroutine call • Easy to understand and test but leads to lower performance • Asynchronous RPC • Calling process is not blocked • Methods for synchronizing the client and the server • Higher layer applications in both client and server initiate the exchange and then verifies that all actions have been completed • Client uses a series of asynchronous RPCs followed by a synchronous RPC CS-550 (M.Soneru): Distributed Systems – Message Passing, Clusters, and Comparative OSs [Sta’01]
Remote procedure calls (cont.) • Design issues (cont.) • Object-oriented mechanisms • Operation • Client sends request to an object request broker • Broker acts as a directory of all remote services on the network. Broker calls appropriate remote object and passes data. • Remote object services request, replies to broker, which returns response to client • Competing approaches: • Common Object Request Broker Architecture (CORBA) from the Object Management Group, backed by IBM, Apple, Sun • Common Object Model (COM), the basis for Object Linking and Embedding (OLE) from Microsoft CS-550 (M.Soneru): Distributed Systems – Message Passing, Clusters, and Comparative OSs [Sta’01]
Clusters • Cluster: group of interconnected computers (nodes) working together as a unified computer recourse and creating the illusion of being one machine • Advantages of clusters: • Absolute scalability • Clusters can consist of hundreds of machines, each being a multiprocessor • Incremental scalability • A cluster can grow in small increments with minimum service disruption • High availability • Fault-tolerant operation in software • High price/performance ratio • Off-the shelf building blocks CS-550 (M.Soneru): Distributed Systems – Message Passing, Clusters, and Comparative OSs [Sta’01]
Clusters (cont.) • Cluster configurations • Passive standby • Active system processes the entire load, the standby takes over in case of failure of primary • Active sends ‘heartbeat’ messages to standby to indicate continued operation • High cost – no tasks sharing • Easy to implement • Active secondary • Secondary server is also used for processing tasks • Reduced cost due to tasks sharing • Increased complexity CS-550 (M.Soneru): Distributed Systems – Message Passing, Clusters, and Comparative OSs [Sta’01]
Clusters (cont.) • Cluster configurations (cont.) • Separate servers • Each server has its own disk, no disks shared • Data copied between servers periodically • Scheduling assigns client requests to servers to balance the load • High availability • High server and network overhead due to data copying • Shared disks, non-shared volumes (shared nothing) • Common disks are partitioned into volumes, each volume owned by only one computer • On computer failure, cluster is reconfigured to assign volumes to remaining computers • Shared disks, shared volumes • Each computer has access to all volumes on all disks • Locking mechanism used to ensure that data is accessed by one computer at a time CS-550 (M.Soneru): Distributed Systems – Message Passing, Clusters, and Comparative OSs [Sta’01]
Clusters (cont.) • OS design issues • Failure management • Highly available clusters • High probability that all resources will be in service • In case of failure, the queries in progress are lost • If retried, the query will be serviced by another computer in the cluster • Fault-tolerant clusters • Redundant shared disks and fault-tolerant operations • Fail-over: switching an application from a failed system to an alternative • Fail-back: the restoration of applications and data resources to the failed system after recovery • Load balancing • Load must be balanced among available computers • When a new computer is added to the cluster, loads needs to be rebalanced to include the new computer CS-550 (M.Soneru): Distributed Systems – Message Passing, Clusters, and Comparative OSs [Sta’01]
Clusters (cont.) • OS design issues (cont.) • Parallelizing computation: executing software from a single application in parallel • Parallelizing compiler • It is determined, at compile time, which parts of the application can be run in parallel • The parallel parts are assigned to different computers in the cluster • Parallelized application • The application is designed to run on the cluster and uses message passing for communication • Most powerful approach to exploit clusters • Parametric computing • Useful for programs that must be executed a large number of times, each time with a different set of parameters (e.g., a simulation model) • Parametric processing tools are needed to organize, run, and manage the jobs CS-550 (M.Soneru): Distributed Systems – Message Passing, Clusters, and Comparative OSs [Sta’01]
Clusters (cont.) • Cluster computer architecture • All computers are interconnected by a high-speed LAN or switch • Each computer is capable of operating independently • A middleware layer of software runs on each computer to implement the cluster functionality • Provides a unified system image to the user, called a single-system image • Is responsible for providing load balancing and high availability • Middleware services and functions • Single entry point: A user logs into the cluster, not on a specific computer • Single file hierarchy: The user sees only a single file hierarchy, under one root directory • Single control point: A default workstation is used for cluster management and control • Single virtual networking: There is a single virtual network connecting the cluster computers, even if it consists of multiple interconnected networks CS-550 (M.Soneru): Distributed Systems – Message Passing, Clusters, and Comparative OSs [Sta’01]
Clusters (cont.) • Middleware services and functions (cont.) • Single memory space: A distributed shared memory is used to share variables • Single job-management system: The cluster has a job scheduler and jobs are submitted to the cluster and not to individual computers • Single user interface: A common graphic interface is used for all users, regardless of the workstation they use to enter the cluster • Single I/O space: Any node can access any I/O device • Single process space: A process on any node can create or communicate with any other process in the cluster • Check-pointing: Process states and intermediate results are saved periodically, permitting rollback recovery after failures • Process migration: Processes can mode inside the cluster to provide load balancing CS-550 (M.Soneru): Distributed Systems – Message Passing, Clusters, and Comparative OSs [Sta’01]
Clusters (cont.) • Clusters compared with SMPs • SMPs • Easier to manage and configure than clusters • Much closer to the original uniprocessor model • Major difference from the uniprocessor is the scheduler function • Uses less physical space and requires less energy than a comparable cluster • SMP products are well established and stable • Clusters • Far superior to SMPs in terms of absolute and incremental scalability • Far superior in terms of availability • Clusters are likely to dominate the high-performance server market CS-550 (M.Soneru): Distributed Systems – Message Passing, Clusters, and Comparative OSs [Sta’01]
Windows 2000 Cluster Server • The configuration is a shared-nothing cluster, where each volume and other resources are owned by a single system at a time (initially code-named Wolfpack) • Main concepts • Cluster Service: • The software on each node responsible for cluster-specific activities • Resource: • These are the resources managed by the cluster service • They are objects representing either physical hardware devices (e.g., disk drives, network cards) or logical items (e.g., disk volumes, IP addresses, applications, databases) • Resources are implemented as dynamically linked libraries (DLLs) and managed by a resource monitor • Online: A resource is online at a node if it provides a service at that node • Group: • A collection of resources that are managed as a single entity • Consists of all elements needed to run a specific application and to allow the client systems to connect to the service provided by that application • Operations can be performed on the entire group (e.g., transfer to another node) CS-550 (M.Soneru): Distributed Systems – Message Passing, Clusters, and Comparative OSs [Sta’01]
Windows 2000 Cluster Server (cont.) CS-550 (M.Soneru): Distributed Systems – Message Passing, Clusters, and Comparative OSs [Sta’01]
Windows 2000 Cluster Server (cont.) • The W2K Cluster Server components and their relationship in a single node of a cluster • Node manager • Responsible for maintaining this node’s membership in the cluster • It sends periodic heartbeat messages to the node managers of the other nodes in the same cluster • If it detects the loss of heartbeat messages from another node • It broadcasts a message to the entire cluster • All members exchange messages to verify their view of current cluster membership • If a node manager does not reply, it is removed from cluster and its active groups are transferred to one or more of the other nodes in the cluster • Configuration database manager • Responsible for the cluster configuration database • The database has information about all cluster resources, groups, and node ownership of groups • Database managers on all nodes communicate with each other to maintain a consistent view of configuration information in the cluster • The integrity of the database is maintained by using fault-resistant software for all changes to cluster configuration CS-550 (M.Soneru): Distributed Systems – Message Passing, Clusters, and Comparative OSs [Sta’01]
Windows 2000 Cluster Server (cont.) • The W2K Cluster Server components and their relationship in a single node of a cluster (cont.) • Resource manager / fail-over manager • Responsible for management of resource groups • Initiates actions such as startup, reset, and fail-over • In case of fail-over, the fail-over managers on the active nodes negotiate the redistribution of resource groups from the failed node to the remaining active ones • When the node that failed has recovered, the fail-over managers may decide to move back some groups • Event processor • Connects all the components of the cluster service • Handles common operations • Controls cluster service initialization • Communications manager • Provides the facilities for message exchange with other nodes in the cluster • Global update manger • Provides an update service for other components CS-550 (M.Soneru): Distributed Systems – Message Passing, Clusters, and Comparative OSs [Sta’01]
Sun cluster • Solaris UNIX has been extended to make the Sun Cluster distributed operating system • It appears to users and applications as a single computer running the Solaris OS • Components: • Object and communications support • Process management • Networking • Global distributed file system CS-550 (M.Soneru): Distributed Systems – Message Passing, Clusters, and Comparative OSs [Sta’01]
Sun cluster (cont.) • Object and communications support • Object oriented: uses the CORBA object model to define objects and the remote procedure call (RPC) mechanism • Global process management • The location of a process is transparent to the user • Each process has a unique identifier within the cluster • Process migration is possible: a process can move from node to node to achieve load balancing and for fail-over (caveat: the threads of a single process must be on the same node) • Networking • Strategy: • A packet filter is used to route packets to the proper node • Cluster appears externally as a single server with a single IP address • Operation • Incoming packets are received on the node that has the network adapter, filtered, and delivered to the correct target node for protocol processing over cluster interconnect • For outgoing packets, originating node performs protocol processing, transfers packet over cluster interconnect to the node that has external network physical connection CS-550 (M.Soneru): Distributed Systems – Message Passing, Clusters, and Comparative OSs [Sta’01]
Sun cluster (cont.) • Global file system • Like the standard Solaris, the Sun Cluster is based on the the concepts of virtual node (vnode) and the virtual file system (vfs) • Standard Solaris • Vnode • The vnode structure is used to provide a general-purpose interface to all types of file systems • A vnode provides mapping to an object in any file system type (by contrast, an inode in UNIX can provide mapping to UNIX files only) • The vnode interface accepts general-purpose file manipulation commands (e.g., read, write) and translates them into the actions appropriate for the respective file system • Vfs • Vfs structures are used to describe entire file systems • The Vfs interface accepts general-purpose commands that operate on entire files and translates them into actions appropriate for a particular file system CS-550 (M.Soneru): Distributed Systems – Message Passing, Clusters, and Comparative OSs [Sta’01]
Sun cluster (cont.) • Global file system (cont.) • Global file access • The global file system provides an uniform interface to files distributed over the cluster • Processes on all nodes use the same pathname to locate a file and can open any file • Implementation • A proxy file system was built on top of the existing Solaris file system at the vnode interface • Vfs/vnode operations are converted by the proxy layer into object invocations • The invoked object may reside on any node in the cluster; it performs a local vnode/vfs operation on the underlying file system • Caching is used for file contents, directory information, and file attributes CS-550 (M.Soneru): Distributed Systems – Message Passing, Clusters, and Comparative OSs [Sta’01]
Beowulf and Linux clusters • Beowulf • Beowulf project • Initiated under the NASA High Performance Computing and Communications (HPCC) project • Goal: expand the capabilities of clustered PCs for performing important computational tasks • Widely implemented, the most important new cluster technology available • Beowulf features • Use of off-the shelf components, no custom components, available from many vendors • Dedicated processors • Dedicated private network (LAN or WAN or inter-networked combination) • Scalable I/O • Free software base and distributed computing tools • Return of the design and improvements to the community CS-550 (M.Soneru): Distributed Systems – Message Passing, Clusters, and Comparative OSs [Sta’01]
Beowulf and Linux clusters (cont.) CS-550 (M.Soneru): Distributed Systems – Message Passing, Clusters, and Comparative OSs [Sta’01]
Beowulf and Linux clusters (cont.) • Most Beowulf implementations use a cluster of Linux workstations or PCs • A representative Linux implementation of Beowulf contains • A number of workstations (not necessarily the same platform) all running Linux • Secondary storage at each workstation can be available for distributed access (e.g., distributed file sharing) • The Linux nodes are interconnected with an off-the-shelf network (e.g., Ethernet switch or an interconnected set of Ethernet switches) • Beowulf software • Open-source Beowulf software • Beowulf tools and utilities • Linux kernel, modified to allow the individual nodes to participate in a number of global namespaces CS-550 (M.Soneru): Distributed Systems – Message Passing, Clusters, and Comparative OSs [Sta’01]
Beowulf and Linux clusters (cont.) • Examples of Beowulf system software • Beowulf distributed process space (BPROC) • Allows a process to span multiple nodes in a cluster environment • Provides a mechanism for starting a process on another node without logging in that node • Makes all remote processes visible in the process table of the cluster’s front end node • Beowulf Ethernet channel bonding • Mechanism that joins multiple networks into a single logical network with high bandwidth • Distributes packets over the available device transmit queues • Provides load balancing over multiple Ethernets connected to Linux workstations • PVMSYNC • Provides a synchronization mechanism and shared data objects within a cluster • EnFusion • Set of tools for parametric computing, i.e., execution of a program as a large number of jobs, each with different parameters CS-550 (M.Soneru): Distributed Systems – Message Passing, Clusters, and Comparative OSs [Sta’01]