390 likes | 402 Views
Explore the primary characteristics, motivations, and challenges of building distributed systems through real-world examples like banking, retail, and air-traffic control. Understand the necessity of transparency, scalability, and reliability in distributed applications.
E N D
Outline • Definitions • Challenges • Examples to Illustrate Challenges • Goals in Application Development • Summary
Definition of a Distributed System • A distributed system: • Multiple connected CPUs working together • A collection of independent computers that appears to its users as a single coherent system • Examples: parallel machines, networked machines • Lamport’s definition: A distributed system is one in which I cannot get something done because a machine I've never heard of is down.
Primary Characteristics of a Distributed System • Multiple computers • Concurrent execution • Independent operation and failures • Communications • Ability to communicate • No tight synchronization • Relatively easy to expand or scale • Transparency
intranet % % ISP % % backbone satellite link desktop computer: server: network link: Example: A Typical Portion of the Internet (Coulouris)
Example: Portable and Handheld Devices in a Distributed System (Coulouris)
Motivation for Building Distributed Systems • Economics • Share resources • Relatively easy to expand or scale • Speed – A distributed system may have more total computing power then a mainframe. • Cost • Personalized environments • Location independence • People and information are distributed • Expandibility • Availability and Reliability • If a machine crashes, the system as a whole can survive.
Distributed Application Examples • Banking, stock markets, stock brokerages • Heath care, hospital automation • Control of power plants, electric grid • Telecommunications infrastructure • Electronic commerce and electronic cash on the Web (very important emerging area) • Corporate “information” base: a company’s memory of decisions, technologies, strategy • Military command, control, intelligence systems • Retail • Air-traffic control • GAUL
Examples in More Detail • Air-Traffic Control • This is not an Internet application. • In many countries, airspace is divided into areas which in turn may be divided into sectors. • Each area is managed by a control center. • Control systems communicate with tower control and other control systems (to allow a plane to cross boundaries). • The planes and air-traffic controls are “distributed”. A single centralized system is not feasible.
Examples in More Detail • World Wide Web • Shared Resources: Documents • Unique identification using URLs • Users interested in the documents are distributed. • The documents are also distributed. • Banking • Clients may access their accounts from ATM machines. • There may be multiple clients attempt to access their accounts simultaneously. • Multiple copies of account information allows quicker access.
Examples in More Detail • Retail • Stores are located near their customer base. • Point of Sale (POS) terminals are used to customer interactions while mobile units are used for inventory control. • These units talk to a local processor which in turn may communicate with remote processors. • Gaul • What is being shared includes disk space, e-mail server, web server, software
Challenges • Heterogeneity • Networks • Hardware • Operating systems • Programming languages
Challenges • Failure Handling • Partial failures • Can non-failed components continue operation? • Can the failed components easily recover? • Detecting failures • Recovery • Replication • We will now examine the challenges in the context of two applications
Illustrative Example: Banking Bank Branch Bank Branch Network Request money withdrawal of 100 euros in Paris Bank Branch
Illustrative Example: Banking Bank Branch Bank Branch Network Money given right away Bank Branch
Illustrative Example: Banking Bank Branch Bank Branch Network Later, ATM contacts bank Bank Branch
Illustrative Example: Banking • Accounts are replicated • Why replication? • Performance; A single server does not scale very well • Reliability; What if the single server went down?
Illustrative Example: Banking Bank Branch Bank Branch Network Bank Branch University contacts another bank branch to deposit a salary
Illustrative Example: Banking Bank Branch Bank Branch Network Bank Branch Bank Branch applies interest to an account
Illustrative Example: Banking • Hmm. If the operations on the account are not done in the same order then the accounts will have different amounts. • Replicas of an account should be consistent. • What’s the big deal? The ATM transaction goes first, followed by salary deposit which is followed by the interest operation. • How do you actually know which operation occurred first?
Illustrative Example: Banking • Use clocks • There is no global clock; Must rely on local clocks. • It is very difficult to synchronize local physical clocks. • Network latency is a factor. • Let’s say that the ATM operation occurs at 10:00 AM, the salary deposit occurs at 10:01 AM and the interest payout occurs at 10:02 AM. • Network latency may mean that the ATM operation arrives at the bank branches at 10:05AM which may be after the other operations have arrived at the bank branches. • It’s actually worse. The ATM operation may arrive at a bank branch after an interest calculation but arrive before an interest calculation at another branch. • How does a bank branch know to wait for the ATM operation?
Illustrative Example: Banking • Replication is a headache. Don’t replicate. • Could do that but it overloads a server and causes poor performance. • A bank does not want to limit the number of its users as the result of slowness. • An e-commerce site does not want to lose customers as the result of a slow system. • What ifthe server goes down? • Wait … You may have replication and one of the servers goes down. • Operations at the other branches continue • What if the server comes back up? Isn’t it going to have different contents?
Illustrative Example: Banking • Can’t rely on clocks and we want to replicate? Then what? • We will study algorithms that provide the notion of “logical” clocks. • The concept of logical clocks will be the basis of several algorithms that provide consistency across replicas in a transparent fashion. • Transparent: Should users have to know that the system is replicated e.g., should the ATM user know that their account is replicated in order to use the system.
Illustrative Example: Banking • Bank mergers: • Different (heterogeneous) systems • How do we integrate • How open should systems be? • Can the system be extended and re-implemented • Are interfaces published • Is there a uniform mechanism to access resources • How do we ensure that updates to an account are valid?
Illustrative Example: Game • Some games require that the game state (or part of the game state) is found with each player. • Would like to make sure that the game state is consistent e.g., • Three users (U1, U2, U3) participate in a first person shooter. • As viewed from U1: U1 pushes a button that disarms all opponents. • As viewed from U2: Just before U1 pushes the button U2 shoots U1. • What does U3 see? • Ordering of events (even if they appear to happen concurrently) is required. • Ensuring every user views events in the same order is commonly termed identical ordering or total ordering.
Illustrative Example: Game • Consistency is important but so is speed. • Does a game have the same consistency requirements as a banking application? • Turns out the answer is no. • We will study different types of consistency and the algorithms and systems support to provide for the different types of consistency
Illustrative Example: Game • A trivial attempt at satisfying ordering is to use TCP to ensure FIFO and have a central server through which all messages must pass through. • The central server, together with TCP, ensures all nodes receive the same messages in the same order • What about node failure? • TCP is slow; Why not use UDP? • Well UDP is faster but doesn’t ensure FIFO ordering.
Goals of Application Development • Connectivity • Transparency • Reliability • Consistency • Security • Openness • Scalability
Connectivity • It should be easy for users to access remote resources and to share them with other users in a controlled fashion. • Resources that can be shared include printers, storage facilities, data, files, web pages, etc; • Why? Economical • Connecting users and resources makes collaboration and the exchange of information easier. • Just look at e-mail
Transparency • A distributed system that is able to present itself to users and applications as if it were only a single computer system is said to be transparent. • Very difficult to make distributed systems completely transparent. • You may not want to, since transparency often comes at the cost of performance.
Transparency in a Distributed System Different forms of transparency in a distributed system.
Degree of Transparency • The goal of full transparency is not always desirable. • Users may be located in different continents; distribution is apparent and not something you want to hide. • Completely hiding failures of networks and nodes is (theoretically and practically) impossible: • You cannot distinguish a slow computer from a failing one. • You can never be sure that a server actually performed an operation before a crash. • Full transparency will cost in performance. • Keeping Web caches exactly up-to-date with the master copy • Immediately flushing write operations to disk for fault tolerance.
Openness • An open distributed system allows for interaction with services from other open systems, irrespectively of the underlying environment. • Systems should conform to well-defined interfaces. • Systems should support portability of applications. • Systems should easily interoperate. Interoperability is characterized by the extent by which two implementations of systems or components from different manufacturers can co-exist and work together. • Example: In computer networks there are rules that govern the format, contents and meaning of messages send and received.
Scalability • There are three dimensions to scalability: • The number of users and processes (size scalability) • The maximum distance between nodes (geographical scalability) • The number of administrative domains (administrative scalability)
Techniques for Scaling • Partition data and computations across multiple machines • Move computations to clients (Java applets) • Decentralized naming services (DNS) • Decentralized information systems (WWW) • Make copies of data available at different machines • Replicated file servers (for fault tolerance) • Replicated databases • Mirrored web sites • Allow client processes to access local copies • Web caches (browser/Web proxy) • File caching (at server and client)
Scaling – The problem • Applying scaling techniques is easy, except for the following: • Having multiple copies (cached or replicated) leads to inconsistencies – modifying one copy makes that copy different from the rest. • Always keeping copies consistent requires global synchronization. • Global synchronization is expensive with respect to performance. • We have learned to tolerate some inconsistencies.
Summary • Distributed systems consist of autonomous computers that work together. • When properly designed, distributed systems can scale well with respect to the size of the underlying network. • Many challenges of which many will be addressed in the course.