2.3 컴퓨터 클러스터의 설계 원칙

2.3 컴퓨터 클러스터의 설계 원칙 2.3.1 Single-System Image Featues • It means the illusion of a single system, single control, symmetry, and transparency. • Single system: the entire cluster is viewed by users as one system that has multiple processors. • Single control: Logically, an user or system user utilizes services from one place with a single interface. • Symmetry: All clusters services and functionalities are symmetric to all nodes and all users, except those protected by access rights. • Location-transparent: The user is not aware of the where about of the physical device that eventually provides a service. • Cluster nodes • home node • local node • remote nodes • The illusion of an SSI can be obtained at several layers: application software layer, hardware or kernel layer, middleware layer. Ch. 2-2 Computer Clusters

Single Entry Point • The single entry point enables users to login to a cluster as one virtual host. • The system transparently distribute the user’s login and connection requests to different physical hosts to balance the load. • Realizing a Single Entry Point in a Cluster of Computers • Fig. 2.13 • Single File Hierarchy • From the view-point of any process, files can reside on three types of locations in a cluster, as shown in Fig. 2.14. • A stable storage requires two aspects: persistent, fault-tolerant. • Stable storage (global files) could be implemented as one centralized, large RAID disk. But it could also be distributed using local disks of cluster nodes. • Single I/O Space over Distributed RAID for I/O-Centric Clusters • Fig. 2.16 Ch. 2-2 Computer Clusters

RAID 2.3.2 High Availability through Redundancy • When designing robust, high available systems three terms are often used together: reliability, availability, and serviceability (RAS). • 신뢰성: 시스템이 고장 없이 얼마나 오래 동작할 수 있는지를 측정 • 가용성: 시스템이 사용자에게 가용인 시간 백분율 • 서비스 가능성: 시스템을 서비스(유지, 보수, 업그레이드)하는 것이 얼마나 쉬운지를 말한다. Ch. 2-2 Computer Clusters

Availability and Failure Rate • Availability=MTTF/(MTTF+MTTR) • MTTF (mean time to failure) • MTTR (mean time to repair) • Planned vs. Unplanned Failures • Transient vs. Permanent Failures • Partial vs. Total Failures • Single Point of failure in an SMP and in Clusters of Computers, Fig. 2.19. • Redundancy Techniques • Table 2.5 Availability of Computer System Types • Isolated Redundancy • When a component (the primary component) fails, the service it provided is take over the another component (the backup component). • The primary and the backup components should be isolated from each other. • Benefits • not a single point of failure • 고장 된 구성요소는 나머지 시스템이 작동 중 일 때, 수리될 수 있다. • 주된 구성요소와 백업 구성요소는 서로 테스트하고 디버거 할 수 있다. Ch. 2-2 Computer Clusters

N-Version Programming to Enhance Software Reliability • The software is implemented by N isolated teams who may not even know the other exist. • Different teams are asked to implement the software using different algorithms, programming languages, environment tools, and even platform. • In a fault-tolerant system, the N versions all run simultaneously and their results are constantly compared. If the results differ, the system is notified that a fault has occurred. 2.3.3 Fault-Tolerant Cluster Configurations • Three ascending levels of availability • Hot standby server clusters • Active-takeover clusters • Failover cluster • 시스템 대체작동은 다수의 기능들: 고장 진단, 고장 공지, 고장 복구를 제공해야 한다. • Recovery Scheme • Backward recovery • Checkpoint • Rollback Ch. 2-2 Computer Clusters

2.4 클러스터 작업 및 자원 관리 2.4.1 Cluster Job Scheduling Methods • Cluster jobs may be scheduled to run at a specific time (calendar scheduling) or when a particular event happens (event scheduling). • Table 2.6 Job Scheduling Issues and Schemes for Cluster Nodes • Space Sharing • Multiple jobs can run on disjointed partitions of nodes simultaneously. • At most, a process is assigned to a node at a time. • Job Scheduling by Tiling over Cluster Nodes, Fig. 2.22 • Time Sharing • Independent scheduling (local scheduling) • Gang scheduling • The gang scheduling scheme schedules all processes of a parallel job together. • When one process is active, all processes are active. • Competition with foreign jobs Ch. 2-2 Computer Clusters

2.4.2 Cluster Job Management Systems • A Job Management System (JMS) should have three parts: • user server • job scheduler • resource manager: 자원 할당/감시, 스케줄링 정책 시행, 회계정보 수집 • JMS Administration • Cluster Job Types • Characteristics of a Cluster Workload • NAS 벤치마크 경험에 기초한 작업 부하 특성, p. 108 참조 • Migration Schemes • Node availability • Migration overhead • Recruitment threshold • The recruitment threshold is the amount of time a workstation stays unused before the cluster considers it an idle node. 2.4.3 Load Sharing Facility for Cluster Computing Ch. 2-2 Computer Clusters

2.3 컴퓨터 클러스터의 설계 원칙

2.3 컴퓨터 클러스터의 설계 원칙

Presentation Transcript