1 / 90

Introduction to Distributed System & PageRank Algorithm

Introduction to Distributed System & PageRank Algorithm. http://net.pku.edu.cn/~course/cs402/2009/ 彭波 pb@net.pku.edu.cn 北京大学信息科学技术学院 7/7/2009. 大纲. 作业回顾 分布式系统基础 PageRank 算法的 MapReduce 实现 课程项目. Review of Lecture 1 . SaaS PaaS Utility Computing. “ Data Center is a Computer ”

corina
Download Presentation

Introduction to Distributed System & PageRank Algorithm

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Introduction to Distributed System&PageRank Algorithm http://net.pku.edu.cn/~course/cs402/2009/ 彭波 pb@net.pku.edu.cn 北京大学信息科学技术学院 7/7/2009

  2. 大纲 • 作业回顾 • 分布式系统基础 • PageRank算法的MapReduce实现 • 课程项目

  3. Review of Lecture 1 SaaS PaaS Utility Computing “Data Center is a Computer” Parallelism everywhere Massive Scalable Reliable Resource Management Data Management Programming Model & Tools

  4. Divide and Conquer “Work” Partition w1 w2 w3 “worker” “worker” “worker” r1 r2 r3 Combine “Result”

  5. What’s Mapreduce • Parallel/Distributed Computing Programming Model shuffle output Input split

  6. CodeLab1 • 1.开始eclipse无法运行,主要是因为没有添加本地jar文件 • 2.在0.13.0中找不到input目录,也没有/user目录,并且也无法上传,有同学说重启后就能看到/user目录,我没有证实。我是在命令行下添加input目录,就会自动创建/user目录。后来的0.17.0虚拟机中我已经把Input目录放好了,就没有这个问题。 • 3.对Java不熟悉。特别是要根据0.13.0虚拟机修改的时候很多人都反应有困难,看起来大部分人都没有什么Java基础。

  7. HW1 Exercises • Have you ever encountered a Heisenbug? How did you isolate and fix it? • For the different failure typeslisted above, consider what makes each one difficult for a programmer trying to guard against it. What kinds of processing can be added to a program to deal with these failures? • Explain why each of the 8 fallaciesis actually a fallacy. • Contrast TCP and UDP. Under what circumstances would you choose one over the other?

  8. Exercises • What's the difference between caching and data replication? • What are stubs in an RPC implementation? • What are some of the error conditions we need to guard against in a distributed environment that we do not need to worry about in a local programming environment? • Why are pointers (references) not usually passed as parameters to a Remote Procedure Call?

  9. Exercises • Here is an interesting problem called partial connectivity that can occur in a distributed environment. Let's say A and B are systems that need to talk to each other. C is a master that also talks to A and B individually. The communications between A and B fail. C can tell that A and B are both healthy. C tells A to send something to B and waits for this to occur. C has no way of knowing that A cannot talk to B, and thus waits and waits and waits. What diagnostics can you add in your code to deal with this situation?

  10. Exercises • This is the Byzantine Generals problem: Two generals are on hills either side of a valley. They each have an army of 1000 soldiers. In the woods in the valley is an enemy army of 1500 men. If each general attacks alone, his army will lose. If they attack together, they will win. They wish to send messengers through the valley to coordinate when to attack. However, the messengers may get lost or caught in the woods (or brainwashed into delivering different messages). How can they devise a scheme by which they either attack with high probability, or not at all?

  11. Introduction to Distributed System Parallelization & Synchronization

  12. Parallelization Idea • 如果任务可以被cleanly split into n units,并行is very “easy”.

  13. Parallelization Idea (2)

  14. Parallelization Idea (3)

  15. Parallelization Idea (4)

  16. Parallelization Pitfalls But this model is too simple! • 怎样分配任务: • How do we assign work units to worker threads? • 执行单元和任务数不匹配 • What if we have more work units than threads? • 怎样把结果聚合 • How do we aggregate the results at the end? • 怎样知道任务都完成了 • How do we know all the workers have finished? • 如果任务不能分割为完全独立的子任务 • What if the work cannot be divided into completely separate tasks? What is the common theme of all of these problems?

  17. Parallelization Pitfalls (2) • 这些问题的关键都在于:multiple threads must communicate with one another, or access a shared resource. • Golden rule: Any memory that can be used by multiple threads must have an associated synchronization system!

  18. Process synchronization refers to the coordination of simultaneous threads or processes to complete a task in order to get correct runtime order and avoid unexpected race conditions.

  19. And if you thought I was joking…

  20. What is Wrong With This? Thread 1: void foo() { x++; y = x; } Thread 2: void bar() { y++; x++; } If the initial state is y = 0, x = 6, what happens after these threads finish running?

  21. Multithreaded = Unpredictability • When we run a multithreaded program, we don’t know what order threads run in, nor do we know when they will interrupt one another. • Many things that look like “one step” operations actually take several steps under the hood: Thread 1: void foo() { eax = mem[x]; inc eax; mem[x] = eax; ebx = mem[x]; mem[y] = ebx; } Thread 2: void bar() { eax = mem[y]; inc eax; mem[y] = eax; eax = mem[x]; inc eax; mem[x] = eax; }

  22. Multithreaded = Unpredictability This applies to more than just integers: • Pulling work units from a queue • Reporting work back to master unit • Telling another thread that it can begin the “next phase” of processing … All require synchronization!

  23. Lock Semaphore Barriers Synchronization Primitives Thread 2: void bar() { sem.lock(); y++; x++; sem.unlock(); } Thread 1: void foo() { sem.lock(); x++; y = x; sem.unlock(); }

  24. Too Much Synchronization? Deadlock Thread A: semaphore1.lock(); semaphore2.lock(); /* use data guarded by semaphores */ semaphore1.unlock(); semaphore2.unlock(); Thread B: semaphore2.lock(); semaphore1.lock(); /* use data guarded by semaphores */ semaphore1.unlock(); semaphore2.unlock(); (Image: RPI CSCI.4210 Operating Systems notes)

  25. And if you thought I was joking…

  26. The Moral: Be Careful! • Synchronization is hard • Need to consider all possible shared state • Must keep locks organized and use them consistentlyand correctly • Knowing there are bugs may be tricky; fixing them can be even worse! • Keeping shared state to a minimum reduces total system complexity

  27. Introduction to Distributed System Fundamentals of Networking

  28. TCP ROUTER IP FTP UDP HTTP Gateway Protocol PORT SOCKET SWITCH Firewall

  29. What makes this work? • Underneath the socket layer are several more protocols • Most important are TCP and IP (which are used hand-in-hand so often, they’re often spoken of as one protocol: TCP/IP)

  30. Why is This Necessary? • Not actually tube-like “underneath the hood” • Unlike phone system (circuit switched), the packet switched Internet uses many routes at once

  31. Networking Issues • If a party to a socket disconnects, how much data did they receive? • Did they crash? Or did a machine in the middle? • Can someone in the middle intercept/modify our data? • Traffic congestion makes switch/router topology important for efficient throughput

  32. Introduction to Distributed System Distributed Systems

  33. Outline • Models of computation • Connecting distributed modules • Failure & reliability

  34. Models of Computation Instructions Single (SI) Multiple (MI) Single (SD) Data Multiple (MD) Flynn’s Taxonomy

  35. SISD Processor D D D D D D D Instructions

  36. SIMD Processor D0 D0 D0 D0 D0 D0 D0 D1 D1 D1 D1 D1 D1 D1 D2 D2 D2 D2 D2 D2 D2 D3 D3 D3 D3 D3 D3 D3 D4 D4 D4 D4 D4 D4 D4 … … … … … … … Dn Dn Dn Dn Dn Dn Dn Instructions

  37. MIMD Processor Processor D D D D D D D D D D D D D D Instructions Instructions

  38. Memory(Instructions and Data) Instructions Data Processor Interface to external world

  39. Interface to external world Interface to external world Processor Processor Instructions Data Data Instructions Memory(Instructions and Data) Instructions Data Data Instructions Processor Processor Interface to external world Interface to external world

  40. Memory(Instructions and Data) Memory(Instructions and Data) Instructions Data Data Instructions Processor Processor Interface to external world Interface to external world Network Interface to external world Interface to external world Processor Processor Instructions Data Data Instructions Memory(Instructions and Data) Memory(Instructions and Data)

  41. Memory(Instructions and Data) Memory(Instructions and Data) Instructions Data Data Instructions Instructions Data Data Instructions Processor Processor Processor Processor Interface to external world Interface to external world Network Interface to external world Interface to external world Processor Processor Processor Processor Instructions Data Data Instructions Instructions Data Data Instructions Memory(Instructions and Data) Memory(Instructions and Data)

  42. Outline • Models of computation • Connecting distributed modules • Failure & reliability

  43. System Organization • Having one big memory would make it a huge bottleneck • Eliminates all of the parallelism

  44. CTA: Memory is Distributed

  45. Interconnect Networks • Bottleneck in the CTA is transferring values from one local memory to another • Interconnect network design very important; several options are available • Design constraint: How to minimize interconnect network usage?

  46. A Brief History… 1985-95 • “Massively parallel architectures” start rising in prominence • Message Passing Interface (MPI) and other libraries developed • Bandwidth was a big problem • For external interconnect networks in particular

  47. A Brief History… 1995-Today • Cluster/grid architecture increasingly dominant • Special node machines eschewed in favor of COTS technologies • Web-wide cluster software • Companies like Google take this to the extreme (10,000 node clusters)

  48. More About Interconnects • Several types of interconnect possible • Bus • Crossbar • Torus • Tree

More Related