1 / 56

The Manycore Shift: Making Parallel Computing Mainstream

The Manycore Shift: Making Parallel Computing Mainstream. Wishful thinking?. Bart J.F. De Smet bartde@microsoft.com http:// blogs.bartdesmet.net/bart Software Development Engineer Microsoft Corporation Session Code: DTL206. Agenda. The concurrency landscape Language headaches

rusty
Download Presentation

The Manycore Shift: Making Parallel Computing Mainstream

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The Manycore Shift: Making Parallel Computing Mainstream Wishful thinking? Bart J.F. De Smet bartde@microsoft.com http://blogs.bartdesmet.net/bart Software Development Engineer Microsoft Corporation Session Code: DTL206

  2. Agenda • The concurrency landscape • Language headaches • .NET 4.0 facilities • Task Parallel Library • PLINQ • Coordination Data Structures • Asynchronous programming • Incubation projects • Summary

  3. Let’s sell processors Moore’s law • The number of transistors incorporated in a chip willapproximately double every 24 months. Gordon Moore – Intel – 1965

  4. Let’s sell even more processors Moore’s law today • It can't continue forever.The nature of exponentials is that you push them out and eventually disaster happens. Gordon Moore – Intel – 2005

  5. Problem statement • Shared mutable state • Needs synchronization primitives • Locks are problematic • Risk for contention • Poor discoverability (SyncRoot anyone?) • Not composable • Difficult to get right (deadlocks, etc.) • Coarse-grained concurrency • Threads well-suited for large units of work • Expensive context switching • Asynchronous programming

  6. Microsoft Parallel Computing Initiative Constructing Parallel Applications Executing fine-grain Parallel Applications VB C# F# Coordinating system resources/services

  7. Agenda • The concurrency landscape • Language headaches • .NET 4.0 facilities • Task Parallel Library • PLINQ • Coordination Data Structures • Asynchronous programming • Incubation projects • Summary

  8. Languages: two extremes Mutable state No mutable state F# Fundamentalistfunctional programming Fortran heritage(C, C++, C#, VB) LISP heritage(Haskell, ML)

  9. Mutability • Mutable by default (C# et al) • Immutable by default (F# et al) Synchronization required int x = 5; // Share out x x++; Explicit opt-in let x = 5 // Share out x // Can’t mutate x let mutable x = 5 // Share out x x <- x + 1 No locking required

  10. Side-effects will kill you • Elimination of common sub-expressions? • Runtime out of control • Can’t optimize code • Types don’t reveal side-effects • Haskell concept of IO monad • Did you know? LINQ is a monad! let now = DateTime.Nowin (now, now) (DateTime.Now, DateTime.Now) static DateTime Now { get; } Source: www.cse.chalmers.se

  11. Languages: two roadmaps? Haskell • Making C# better • Add safety nets? • Immutability • Purity constructs • Linear types • Software Transactional Memory • Kamikaze-style of concurrency • Simplify common patterns • Making Haskell mainstream • Just right? Too academic? • Not a smooth upgrade path? Nirvana C#

  12. Agenda • The concurrency landscape • Language headaches • .NET 4.0 facilities • Task Parallel Library • PLINQ • Coordination Data Structures • Asynchronous programming • Incubation projects • Summary

  13. Parallel Extensions Architecture .NET Program PLINQ Execution Engine Parallel Algorithms Declarative Queries Query Analysis • Data Partitioning • Chunk • Range • Hash • Striped • Repartitioning • Operator Types • Map • Scan • Build • Search • Reduction • Merging • Async (pipeline) • Synch • Order Preserving • Sorting • ForAll C# Compiler VB Compiler C++ Compiler F# Compiler PLINQ Other .NET Compiler Task Parallel Library (TPL) Coordination Data Structures OS Scheduling Primitives(also UMS in Windows 7 and up) IL TPL or CDS Task APIsTask Parallelism Futures Scheduling Thread-safe Collections Synchronization Types Coordination Types Proc 1 Proc p …

  14. Task Parallel Library – Tasks • System.Threading.Tasks • Task • Parent-child relationships • Explicit grouping • Waiting and cancelation • Task<T> • Tasks that produce values • Also known as futures

  15. Work Stealing • Internally, the runtime uses • Work stealing techniques • Lock-free concurrent task queues • Work stealing has provably • Good locality • Work distribution properties 1 2 3 4 4 p1 p2 p3

  16. Example code to parallelize voidMultiplyMatrices(intsize,double[,] m1,double[,] m2,double[,] result) { for (inti = 0; i < size; i++) { for (intj = 0; j < size; j++) { result[i, j] = 0; for (intk = 0; k < size; k++) { result[i, j] += m1[i, k] * m2[k, j]; } } } }

  17. Solution today Knowledge of Synchronization Primitives Static Work Distribution intN = size; intP = 2 * Environment.ProcessorCount; intChunk = N / P; // size of a work chunk ManualResetEventsignal = new ManualResetEvent(false); intcounter = P; // counter limits kernel transitions for (intc = 0; c < P; c++) { // for each chunk ThreadPool.QueueUserWorkItem(o => { intlc = (int)o; for (inti = lc * Chunk; // process one chunk i < (lc + 1 == P ? N : (lc + 1) * Chunk); // respect upper bound i++) { // original loop body for (int j = 0; j < size; j++) { result[i, j] = 0; for (intk = 0; k < size; k++) { result[i, j] += m1[i, k] * m2[k, j]; } } } if (Interlocked.Decrement(ref counter) == 0) { // efficient interlocked ops signal.Set(); // and kernel transition only when done } }, c); } signal.WaitOne(); High Overhead Error Prone Tricks Lack of Thread Reuse Heavy Synchronization

  18. Solution with Parallel Extensions voidMultiplyMatrices(intsize,double[,] m1,double[,] m2,double[,] result) { Parallel.For(0, size, i => { for(intj = 0; j < size; j++) { result[i, j] = 0; for (intk = 0; k < size; k++) { result[i, j] += m1[i, k] * m2[k, j]; } } }); } Structured parallelism

  19. Task Parallel Library – Loops Why immutability gains attention • Common source of work in programs • System.Threading.Parallel class • Parallelism when iterations are independent • Body doesn’t depend on mutable state • E.g. static variables, writing to local variables used in subsequent iterations • Synchronous • All iterations finish, regularly or exceptionally for (inti = 0; i < n; i++) work(i); … foreach (T e in data) work(e); Parallel.For(0, n, i => work(i)); … Parallel.ForEach(data, e => work(e));

  20. demo Task Parallel Library Bart J.F. De Smet Software Development Engineer Microsoft Corporation

  21. Amdahl’s law by example Theoretical maximum speedup determined by amount of linear code

  22. Performance Tips • Compute intensive and/or large data sets • Work done should be at least 1,000s of cycles • Do not be gratuitous in task creation • Lightweight, but still requires object allocation, etc. • Parallelize only outer loops where possible • Unless N is insufficiently large to offer enough parallelism • Prefer isolation & immutability over synchronization • Synchronization == !Scalable • Try to avoid shared data • Have realistic expectations • Amdahl’s Law • Speedup will be fundamentally limited by the amount of sequential computation • Gustafson’s Law • But what if you add more data, thus increasing the parallelizable percentage of the application?

  23. Parallel LINQ (PLINQ) • Enable LINQ developers to leverage parallel hardware • Fully supports all .NET Standard Query Operators • Abstracts away the hard work of using parallelism • Partitions and merges data intelligently (classic data parallelism) • Minimal impact to existing LINQ programming model • AsParallel extension method • Optional preservation of input ordering (AsOrdered) • Query syntax enables runtime to auto-parallelize • Automatic way to generate more Tasks, like Parallel • Graph analysis determines how to do it • Very little synchronization internally: highly efficient .AsParallel() var q = from p in people         where p.Name == queryInfo.Name && p.State == queryInfo.State && p.Year >= yearStart && p.Year <= yearEnd orderbyp.Year ascending         select p;

  24. demo PLINQ Bart J.F. De Smet Software Development Engineer Microsoft Corporation

  25. Coordination Data Structures • New synchronization primitives (System.Threading) • Barrier • Multi-phased algorithm • Tasks signal and wait for phases • CountdownEvent • Has an initial counter value • Gets signaled when count reaches zero • LazyInitializer • Lazy initialization routines • Reference type variable gets initialized lazily • SemaphoreSlim • Slim brother to Semaphore (goes kernel mode) • SpinLock, SpinWait • Loop-based wait (“spinning”) • Avoids context switch or kernel mode transition

  26. Coordination Data Structures • Concurrent collections (System.Collections.Concurrent) • BlockingCollection<T> • Producer/consumer scenarios • Blocks when no data is available (consumer) • Blocks when no space is available (producer) • ConcurrentBag<T> • ConcurrentDictionary<TKey, TElement> • ConcurrentQueue<T>, ConcurrentStack<T> • Thread-safe and scalable collections • As lock-free as possible • Partitioner<T> • Facilities to partition data in chunks • E.g. PLINQ partitioning problems

  27. demo Coordination Data Structures Bart J.F. De Smet Software Development Engineer Microsoft Corporation

  28. Asynchronous workflows in F# • Language feature unique to F# • Based on theory of monads • But much more exhaustive compared to LINQ… • Overloadable meaning for specific keywords • Continuation passing style • Not: ‘a -> ‘b • But: ‘a -> (‘b -> unit) -> unit • In C# style: Action<T, Action<R>> • Core concept: async{ /* code */ } • Syntactic sugar for keywords inside block • E.g. let!, do!, use! Function takes computation result

  29. Asynchronous workflows in F# letprocessAsync i = async { use stream = File.OpenRead(sprintf"Image%d.tmp" i) let! pixels = stream.AsyncRead(numPixels) let pixels' = transform pixels i use out = File.OpenWrite(sprintf"Image%d.done" i) do! out.AsyncWrite(pixels') } letprocessAsyncDemo = printfn"async demo..." let tasks = [ for i in 1 .. numImages -> processAsync i ] Async.RunSynchronously (Async.Parallel tasks) |> ignore printfn"Done!" stream.Read(numPixels, pixels -> let pixels' = transform pixels i useout = File.OpenWrite(sprintf"Image%d.done" i) do! out.AsyncWrite(pixels') ) Run tasks in parallel

  30. demo Asynchronous workflows in F# Bart J.F. De Smet Software Development Engineer Microsoft Corporation

  31. Reactive Fx • First-class events in .NET • Dualism of IEnumerable<T> interface • IObservable<T> • Pull versus push • Pull (active): IEnumerable<T> and foreach • Push (passive): raise events and event handlers • Events based on functions • Composition at its best • Definition of operators: LINQ to Events • Realization of the continuation monad

  32. IObservable<T> and IObserver<T> Co-variance // Dual of IEnumerable<out T> public interface IObservable<out T> { IDisposableSubscribe(IObserver<T> observer); } // Dual of IEnumerator<out T> publicinterfaceIObserver<in T> { // IEnumerator<T>.MoveNext return value voidOnCompleted(); // IEnumerator<T>.MoveNext exceptional return voidOnError(Exception error); // IEnumerator<T>.Current property voidOnNext(T value); } Way to unsubscribe Contra-variance Signaling the last event Virtually two return types

  33. demo Visit channel9.msdn.com for info ReactiveFx Bart J.F. De Smet Software Development Engineer Microsoft Corporation

  34. Agenda • The concurrency landscape • Language headaches • .NET 4.0 facilities • Task Parallel Library • PLINQ • Coordination Data Structures • Asynchronous programming • Incubation projects • Summary

  35. DevLabs project (previously “Maestro”) • Coordination between components • “Disciplined sharing” • Actor model • Agents communicate via messages • Channels to exchange data via ports • Language features (based on C#) • Declarative data pipelines and protocols • Side-effect-free functions • Asynchronous methods • Isolated methods • Also suitable in distributed setting

  36. Channels for message exchange agentProgram : channelMicrosoft.Axum.Application { public Program() { string[] args = receive(PrimaryChannel::CommandLine); PrimaryChannel::ExitCode<-- 0; } }

  37. Agents and channels channelAdder { input intNum1; input intNum2; output intSum; } agentAdderAgent : channelAdder { publicAdderAgent() { int result = receive(PrimaryChannel::Num1) + receive(PrimaryChannel::Num2); PrimaryChannel::Sum <-- result; } } Send / receive primitives

  38. Protocols channelAdder { input intNum1; input intNum2; output intSum; Start: { Num1 -> GotNum1; } GotNum1: { Num2 -> GotNum2; } GotNum2: { Sum -> End; } } State transition diagram

  39. Use of pipelines agentMainAgent: channelMicrosoft.Axum.Application{ function intFibonacci(int n) { if (n <= 1) return n; returnFibonacci(n - 1) + Fibonacci(n - 2); } int c = 10; voidProcessResult(intn) { Console.WriteLine(n); if (--c == 0) PrimaryChannel::ExitCode <-- 0; } publicMainAgent() { varnums= newOrderedInteractionPoint<int>(); nums ==> Fibonacci ==> ProcessResult; for (int i = 0; i < c; i++) nums<-- 42 - i; } } Mathematical function Description of data flow

  40. Domains domainChatroom { privatestringm_Topic; privateintm_UserCount; readeragentUser : channelUserCommunication { // ... } writeragentAdministrator : channelAdminCommunication { // ... } } Unit of sharing between agents

  41. demo Axum in a nutshell Bart J.F. De Smet Software Development Engineer Microsoft Corporation

  42. Another DevLabs project • Cutting edge, released 7/28 • Specialized fork from .NET 4.0 Beta 1 • CLR modifications required • First-class transactions on memory • As an alternative to locking • “Optimistic” concurrency methodology • Make modifications • Rollback changes on conflict • Core concept: atomic { /* code */ }

  43. Transactional memory • Subtle difference • Problems with locks: • Potential for deadlocks… • …and more ugliness • Granularity matters a lot • Don’t compose well atomic { m_x++; m_y--; throw new MyException() } lock (GlobalStmLock) { m_x++; m_y--; thrownewMyException() }

  44. Bank account sample public static void Transfer(BankAccount from, BankAccountbackup, BankAccountto, int amount) { Atomic.Do(() => { // Be optimistic, credit the beneficiary first to.ModifyBalance(amount); // Find the appropriate funds in source accounts try { from.ModifyBalance(-amount); } catch (OverdraftException) { backup.ModifyBalance(-amount); } }); }

  45. The hard truth about STM • Great features • ACID • Optimistic concurrency • Transparent rollback and re-execute • System.Transactions (LTM) and DTC support • Implementation • Instrumentation of shared state access • JIT compiler modification • No hardware support currently • Result: • 2x to 7x serial slowdown (in alpha prototype) • But improved parallel scalability

  46. demo Visit msdn.microsoft.com/devlabs STM.NET Bart J.F. De Smet Software Development Engineer Microsoft Corporation

  47. DryadLINQ • Dryad • Infrastructure for cluster computation • Concept of job • DryadLINQ • LINQ over Dryad • Decomposition of query • Distribution over computation nodes • Roughly similar to PLINQ • A la “map-reduce” • Declarative approach works

  48. DryadLINQ = LINQ + Dryad Collection<T> collection; boolIsLegal(Key k); string Hash(Key); var results = from c in collection where IsLegal(c.key) select new { Hash(c.key), c.value}; Vertexcode Queryplan (Dryad job) Data collection C# C# C# C# results

  49. demo Visit research.microsoft.com/dryad DryadLINQ Bart J.F. De Smet Software Development Engineer Microsoft Corporation

More Related