940 likes | 1.45k Views
Introduction to Distributed System & PageRank Algorithm. http://net.pku.edu.cn/~course/cs402/2009/ 彭波 pb@net.pku.edu.cn 北京大学信息科学技术学院 7/7/2009. 大纲. 作业回顾 分布式系统基础 PageRank 算法的 MapReduce 实现 课程项目. Review of Lecture 1 . SaaS PaaS Utility Computing. “ Data Center is a Computer ”
E N D
Introduction to Distributed System&PageRank Algorithm http://net.pku.edu.cn/~course/cs402/2009/ 彭波 pb@net.pku.edu.cn 北京大学信息科学技术学院 7/7/2009
大纲 • 作业回顾 • 分布式系统基础 • PageRank算法的MapReduce实现 • 课程项目
Review of Lecture 1 SaaS PaaS Utility Computing “Data Center is a Computer” Parallelism everywhere Massive Scalable Reliable Resource Management Data Management Programming Model & Tools
Divide and Conquer “Work” Partition w1 w2 w3 “worker” “worker” “worker” r1 r2 r3 Combine “Result”
What’s Mapreduce • Parallel/Distributed Computing Programming Model shuffle output Input split
CodeLab1 • 1.开始eclipse无法运行,主要是因为没有添加本地jar文件 • 2.在0.13.0中找不到input目录,也没有/user目录,并且也无法上传,有同学说重启后就能看到/user目录,我没有证实。我是在命令行下添加input目录,就会自动创建/user目录。后来的0.17.0虚拟机中我已经把Input目录放好了,就没有这个问题。 • 3.对Java不熟悉。特别是要根据0.13.0虚拟机修改的时候很多人都反应有困难,看起来大部分人都没有什么Java基础。
HW1 Exercises • Have you ever encountered a Heisenbug? How did you isolate and fix it? • For the different failure typeslisted above, consider what makes each one difficult for a programmer trying to guard against it. What kinds of processing can be added to a program to deal with these failures? • Explain why each of the 8 fallaciesis actually a fallacy. • Contrast TCP and UDP. Under what circumstances would you choose one over the other?
Exercises • What's the difference between caching and data replication? • What are stubs in an RPC implementation? • What are some of the error conditions we need to guard against in a distributed environment that we do not need to worry about in a local programming environment? • Why are pointers (references) not usually passed as parameters to a Remote Procedure Call?
Exercises • Here is an interesting problem called partial connectivity that can occur in a distributed environment. Let's say A and B are systems that need to talk to each other. C is a master that also talks to A and B individually. The communications between A and B fail. C can tell that A and B are both healthy. C tells A to send something to B and waits for this to occur. C has no way of knowing that A cannot talk to B, and thus waits and waits and waits. What diagnostics can you add in your code to deal with this situation?
Exercises • This is the Byzantine Generals problem: Two generals are on hills either side of a valley. They each have an army of 1000 soldiers. In the woods in the valley is an enemy army of 1500 men. If each general attacks alone, his army will lose. If they attack together, they will win. They wish to send messengers through the valley to coordinate when to attack. However, the messengers may get lost or caught in the woods (or brainwashed into delivering different messages). How can they devise a scheme by which they either attack with high probability, or not at all?
Introduction to Distributed System Parallelization & Synchronization
Parallelization Idea • 如果任务可以被cleanly split into n units,并行is very “easy”.
Parallelization Pitfalls But this model is too simple! • 怎样分配任务: • How do we assign work units to worker threads? • 执行单元和任务数不匹配 • What if we have more work units than threads? • 怎样把结果聚合 • How do we aggregate the results at the end? • 怎样知道任务都完成了 • How do we know all the workers have finished? • 如果任务不能分割为完全独立的子任务 • What if the work cannot be divided into completely separate tasks? What is the common theme of all of these problems?
Parallelization Pitfalls (2) • 这些问题的关键都在于:multiple threads must communicate with one another, or access a shared resource. • Golden rule: Any memory that can be used by multiple threads must have an associated synchronization system!
Process synchronization refers to the coordination of simultaneous threads or processes to complete a task in order to get correct runtime order and avoid unexpected race conditions.
What is Wrong With This? Thread 1: void foo() { x++; y = x; } Thread 2: void bar() { y++; x++; } If the initial state is y = 0, x = 6, what happens after these threads finish running?
Multithreaded = Unpredictability • When we run a multithreaded program, we don’t know what order threads run in, nor do we know when they will interrupt one another. • Many things that look like “one step” operations actually take several steps under the hood: Thread 1: void foo() { eax = mem[x]; inc eax; mem[x] = eax; ebx = mem[x]; mem[y] = ebx; } Thread 2: void bar() { eax = mem[y]; inc eax; mem[y] = eax; eax = mem[x]; inc eax; mem[x] = eax; }
Multithreaded = Unpredictability This applies to more than just integers: • Pulling work units from a queue • Reporting work back to master unit • Telling another thread that it can begin the “next phase” of processing … All require synchronization!
Lock Semaphore Barriers Synchronization Primitives Thread 2: void bar() { sem.lock(); y++; x++; sem.unlock(); } Thread 1: void foo() { sem.lock(); x++; y = x; sem.unlock(); }
Too Much Synchronization? Deadlock Thread A: semaphore1.lock(); semaphore2.lock(); /* use data guarded by semaphores */ semaphore1.unlock(); semaphore2.unlock(); Thread B: semaphore2.lock(); semaphore1.lock(); /* use data guarded by semaphores */ semaphore1.unlock(); semaphore2.unlock(); (Image: RPI CSCI.4210 Operating Systems notes)
The Moral: Be Careful! • Synchronization is hard • Need to consider all possible shared state • Must keep locks organized and use them consistentlyand correctly • Knowing there are bugs may be tricky; fixing them can be even worse! • Keeping shared state to a minimum reduces total system complexity
Introduction to Distributed System Fundamentals of Networking
TCP ROUTER IP FTP UDP HTTP Gateway Protocol PORT SOCKET SWITCH Firewall
What makes this work? • Underneath the socket layer are several more protocols • Most important are TCP and IP (which are used hand-in-hand so often, they’re often spoken of as one protocol: TCP/IP)
Why is This Necessary? • Not actually tube-like “underneath the hood” • Unlike phone system (circuit switched), the packet switched Internet uses many routes at once
Networking Issues • If a party to a socket disconnects, how much data did they receive? • Did they crash? Or did a machine in the middle? • Can someone in the middle intercept/modify our data? • Traffic congestion makes switch/router topology important for efficient throughput
Introduction to Distributed System Distributed Systems
Outline • Models of computation • Connecting distributed modules • Failure & reliability
Models of Computation Instructions Single (SI) Multiple (MI) Single (SD) Data Multiple (MD) Flynn’s Taxonomy
SISD Processor D D D D D D D Instructions
SIMD Processor D0 D0 D0 D0 D0 D0 D0 D1 D1 D1 D1 D1 D1 D1 D2 D2 D2 D2 D2 D2 D2 D3 D3 D3 D3 D3 D3 D3 D4 D4 D4 D4 D4 D4 D4 … … … … … … … Dn Dn Dn Dn Dn Dn Dn Instructions
MIMD Processor Processor D D D D D D D D D D D D D D Instructions Instructions
Memory(Instructions and Data) Instructions Data Processor Interface to external world
Interface to external world Interface to external world Processor Processor Instructions Data Data Instructions Memory(Instructions and Data) Instructions Data Data Instructions Processor Processor Interface to external world Interface to external world
Memory(Instructions and Data) Memory(Instructions and Data) Instructions Data Data Instructions Processor Processor Interface to external world Interface to external world Network Interface to external world Interface to external world Processor Processor Instructions Data Data Instructions Memory(Instructions and Data) Memory(Instructions and Data)
Memory(Instructions and Data) Memory(Instructions and Data) Instructions Data Data Instructions Instructions Data Data Instructions Processor Processor Processor Processor Interface to external world Interface to external world Network Interface to external world Interface to external world Processor Processor Processor Processor Instructions Data Data Instructions Instructions Data Data Instructions Memory(Instructions and Data) Memory(Instructions and Data)
Outline • Models of computation • Connecting distributed modules • Failure & reliability
System Organization • Having one big memory would make it a huge bottleneck • Eliminates all of the parallelism
Interconnect Networks • Bottleneck in the CTA is transferring values from one local memory to another • Interconnect network design very important; several options are available • Design constraint: How to minimize interconnect network usage?
A Brief History… 1985-95 • “Massively parallel architectures” start rising in prominence • Message Passing Interface (MPI) and other libraries developed • Bandwidth was a big problem • For external interconnect networks in particular
A Brief History… 1995-Today • Cluster/grid architecture increasingly dominant • Special node machines eschewed in favor of COTS technologies • Web-wide cluster software • Companies like Google take this to the extreme (10,000 node clusters)
More About Interconnects • Several types of interconnect possible • Bus • Crossbar • Torus • Tree