Introduction CS 188 Distributed Systems January 6, 2015

IntroductionCS 188Distributed SystemsJanuary 6, 2015

Description of Class • Topics covered and structure of class • Grading • Reading materials • Student assignments • Office hours • Web page

Topics to Be Covered • Distributed systems • Basic principles • Distributed systems algorithms • Important case studies • Focusing on distributed storage systems • For specificity

Specific Topics • Concepts and basic architectures • Concurrent processes and synchronization methods • IPC • Distributed processes • Distributed file systems • Models of distributed computation

Specific Topics, Con’t • Synchronization and election • Distributed agreement • Replicated data management • Checkpoint and recovery • Distributed system security • Cloud computing

Pre-Requisites for Class • Assumes understanding of material in • CS111 (Operating Systems) • CS118 (Computer Networks) • Without having taken these classes, you will be at a serious disadvantage • Thursday class will briefly review relevant stuff from these courses

Structure of Class • A bit different • Some lectures • Some in-class discussion/design sections • Grading based on tests and group projects

The Basic Class Plan • I will lecture on some topic in distributed systems on Tuesdays • There will be a taped lecture you should watch before Thursdays • Thursday classes will be interactive • Discussing how to make use of the ideas discussed in earlier lectures

Taped Lectures • In Powerpoint • Made available from the class web page • Intention is that you view this lecture before the Thursday class • They will start in week 2

The Core Example • Distributed storage • How can we make effective use of data and storage space on multiple machines? • One example of an important distributed system problem • That touches on many of the key challenges in distributed systems

Investigating Distributed Storage • We’ll look at different design alternatives • Some embodied in well-known code • Others not necessarily implemented at all • We’ll discuss how we should go about designing such a system

How Will We Start? • We’ll start simple • I want to access data stored on another machine • How do I get to the data? • What problems am I likely to face? • Then we’ll add interesting complications

Some Complications • What if multiple parties use writeable data? • What if machines/networks are of limited power? • What if we keep multiple copies of the data? • What if the scale of our system is very large? • What if there are failures? • Temporary • Permanent • What if we don’t trust everyone in the system equally?

The Format of the Discussions • Not lectures by me • In-class discussions • Which, hopefully, everyone will participate in • Not graded • But if you want to learn about distributed systems, be there and join in

Grading • Midterm - 20% • Project - 40% • Final - 40%

Reading Materials • No textbook • Online readings will be assigned on the course web page

Office Hours • Tuesday/Thursday 1-2 PM • Held in 3532F Boelter Hall • Other times available by prior arrangement • I’m usually around

Class Web Page http://lever.cs.ucla.edu/classes/188_winter15 • Slides for classes will be posted there • By 5 PM the previous afternoon • Readings will be posted there • With links to papers • Also links to other interesting info

Class Projects • Groups of 4-5 students • Implementation of software relevant to distributed systems • Must be demonstrated • Accompanied by a short (10 page) report • Topic to be chosen by end of week 3 • Must submit 1 paragraph description

More on the Projects • Largely handled by our TA • Turker Garip • He will present a set of possible projects • Groups will choose from among those projects • Groups meet regularly with Turker • Details at first recitation section

Tests • Midterm and final • Format will be determined and discussed later

Introduction • What is a distributed system? • Basic issues in distributed systems • Basic architectures for distributed systems

What Is a Distributed System? • A system with more than one active machine • Typically connected by a network • Typically cooperating • Either on one specific task • Or to give the illusion of a bigger, more powerful machine

Some Examples • An office’s local area network • A client/server system • A peer file sharing service • A factory’s industrial control system • A cloud computing environment • Specialized services like DNS

Problems With Distributed Systems • Computations involving multiple machines inherently more difficult • They make resource control harder • They make coordination harder • They make security harder

Basic Issues in Distributed Systems • Transparency • Naming issues • Consistency issues • Failure and recovery issues • Heterogeneity • Security

Transparency • One goal of most distributed systems is to hide the distribution • Make the system look like one computer • Make the network and multiple CPUs transparent • Elusive, difficult, not always as desirable as it seems

Goals of Transparency • Hide where processes execute • Hide where data is stored • Hide where IPC goes to/comes from • Hide effects of failures

Access Transparency • Uniform method of access to local and remote resources • User doesn’t have to worry about where his resources are located • Implies that system must try to make very different operations look the same

Location Transparency • Sometimes called name transparency • Users don’t know where resources are located • Users don’t worry about locations of objects • Users don’t worry when objects move • But system worries a lot

Failure Transparency • System looks the same even when components fail • User is insulated from effects of failures • System works like crazy to pretend five machines are the same as six

Why Is Transparency Important? • Distributed systems are hard • People find even single machines too complicated • The system must handle all unpleasant details that it can

Why Is Transparency Hard? • Transparency implies that the system worries about all the nasty issues • Different local/remote overheads • Hiding/handling failures • Translating user-level names to physical locations • The nasty issues are hard

So What? • Aren’t software systems supposed to handle the nasty details? • Yes, but . . . • If the state of the technology isn’t capable of handling them, transparency can be expensive and constraining • We aren’t smart enough to provide full transparency yet

Naming • One of the key recurring problems in distributed systems • How do you name both local and remote resources? • How do you resolve the names to physical locations?

Naming Local and Remote Resources • Does the resource have the same name locally and remotely? • If not, hard to work with remote resources • If so, requires keeping distributed data consistent

Resolving Names • Standard operating systems can resolve names for local resources • File system names, process names, etc. • All required information is local • How does the system map the name of a remote resource to its remote location? • Keep all required information local? • Or find it remotely?

A Simple Example in Distributed Naming Read File X Create File X X Node 1 Node 2 How does node 2 even know File X exists?

Problems in Naming • Naming remote resources • Consistency issues • Scaling issues • Name conflicts • Most of these are related to general problems of keeping distributed data

Consistency • Many distributed systems support distributed computations • User computations running at more than one node • Even if only a data storage system, writes, creates, and deletes raise issues • Unlike multi-process jobs on one node, processes don’t share memory • Only accessible over a network • How do you ensure they’re synchronized?

Typical Problems in Consistency • Consistency in name spaces • Detecting file creations/deletions • Consistency in saved data • If caches/replication used, how does update to one copy change others? • Consistency in system state • Can I even reach agreement on what nodes are working?

Failure and Recovery • In single machine system, failure typically halts entire system • In distributed system, one failed machine doesn’t halt the system • But what if the failed machine was performing part of a distributed computation?

Heterogeneity • What if not all the machines in the distributed system are the same? • Different processors, different CPU speeds, different configurations, etc. • Even seemingly homogenous things are actually heterogeneous • Causes great problems

Security Challenges for Distributed Systems • Machines are doing things for remote machines • Do you know who you’re talking to? • Do you understand what he’s asking you to do? • Can you limit your risk? • Can you protect the distributed service that spans several machines?

Distributed System Architectures • Workstation-server model • Peer workstation model • Cloud model • Parallel computer model

Workstation-Server Model • Some machines are dedicated for user client use • Some machines are specially designated servers • Servers have special abilities and responsibilities

Characteristics of Workstation-Server Model • User workstations often lightly utilized • Waste of resources • High response when needed • Servers may be temporarily or permanently overloaded • Failure of important servers can have serious consequences

Server Systems and Load • Many services are very popular • One machine can’t handle all the load • Typically, divide the load among several machines • But that becomes complicated if clients need to worry about it • Transparent load balancing usually required

Peer Model • System is made up of individual users’ machines • Each servicing a particular user • Machines may act as servers for each other • But most machines not formally servers

Characteristics of Peer Model • Matches what most people want • My machine interoperates seamlessly with everyone else’s • Scaling challenges • Especially beyond LAN scale • Some peer services popular • NFS • Peer file sharing systems

Introduction CS 188 Distributed Systems January 6, 2015

Introduction CS 188 Distributed Systems January 6, 2015

Presentation Transcript

Distributed Systems CS 15-440

Distributed Systems CS 15-440

Distributed Systems CS 15-440

Introduction CS 188 Secure Design for Embedded Systems Peter Reiher January 3, 2011

CS 425: Distributed Systems

CS-556: Distributed Systems

CS 775 : Distributed Systems

CS 194: Distributed Systems Distributed File Systems

Distributed Systems: Introduction

CS 6601 – Distributed Systems

CS 425: Distributed Systems

Data Replication CS 188 Distributed Systems February 3, 2015

CS 194: Distributed Systems Distributed based Object Systems

Recovery From Failure in Distributed Systems CS 188 Distributed Systems February 26, 2015

Distributed Systems CS 15-440

Distributed Systems CS 15-440

When Is Agreement Possible? CS 188 Distributed Systems February 24, 2015

Distributed Systems CS 15-440

Examples of Remote File Systems CS 188 Distributed Systems January 29, 2015