560 likes | 738 Views
The Manycore Shift: Making Parallel Computing Mainstream. Wishful thinking?. Bart J.F. De Smet bartde@microsoft.com http:// blogs.bartdesmet.net/bart Software Development Engineer Microsoft Corporation Session Code: DTL206. Agenda. The concurrency landscape Language headaches
E N D
The Manycore Shift: Making Parallel Computing Mainstream Wishful thinking? Bart J.F. De Smet bartde@microsoft.com http://blogs.bartdesmet.net/bart Software Development Engineer Microsoft Corporation Session Code: DTL206
Agenda • The concurrency landscape • Language headaches • .NET 4.0 facilities • Task Parallel Library • PLINQ • Coordination Data Structures • Asynchronous programming • Incubation projects • Summary
Let’s sell processors Moore’s law • The number of transistors incorporated in a chip willapproximately double every 24 months. Gordon Moore – Intel – 1965
Let’s sell even more processors Moore’s law today • It can't continue forever.The nature of exponentials is that you push them out and eventually disaster happens. Gordon Moore – Intel – 2005
Problem statement • Shared mutable state • Needs synchronization primitives • Locks are problematic • Risk for contention • Poor discoverability (SyncRoot anyone?) • Not composable • Difficult to get right (deadlocks, etc.) • Coarse-grained concurrency • Threads well-suited for large units of work • Expensive context switching • Asynchronous programming
Microsoft Parallel Computing Initiative Constructing Parallel Applications Executing fine-grain Parallel Applications VB C# F# Coordinating system resources/services
Agenda • The concurrency landscape • Language headaches • .NET 4.0 facilities • Task Parallel Library • PLINQ • Coordination Data Structures • Asynchronous programming • Incubation projects • Summary
Languages: two extremes Mutable state No mutable state F# Fundamentalistfunctional programming Fortran heritage(C, C++, C#, VB) LISP heritage(Haskell, ML)
Mutability • Mutable by default (C# et al) • Immutable by default (F# et al) Synchronization required int x = 5; // Share out x x++; Explicit opt-in let x = 5 // Share out x // Can’t mutate x let mutable x = 5 // Share out x x <- x + 1 No locking required
Side-effects will kill you • Elimination of common sub-expressions? • Runtime out of control • Can’t optimize code • Types don’t reveal side-effects • Haskell concept of IO monad • Did you know? LINQ is a monad! let now = DateTime.Nowin (now, now) (DateTime.Now, DateTime.Now) static DateTime Now { get; } Source: www.cse.chalmers.se
Languages: two roadmaps? Haskell • Making C# better • Add safety nets? • Immutability • Purity constructs • Linear types • Software Transactional Memory • Kamikaze-style of concurrency • Simplify common patterns • Making Haskell mainstream • Just right? Too academic? • Not a smooth upgrade path? Nirvana C#
Agenda • The concurrency landscape • Language headaches • .NET 4.0 facilities • Task Parallel Library • PLINQ • Coordination Data Structures • Asynchronous programming • Incubation projects • Summary
Parallel Extensions Architecture .NET Program PLINQ Execution Engine Parallel Algorithms Declarative Queries Query Analysis • Data Partitioning • Chunk • Range • Hash • Striped • Repartitioning • Operator Types • Map • Scan • Build • Search • Reduction • Merging • Async (pipeline) • Synch • Order Preserving • Sorting • ForAll C# Compiler VB Compiler C++ Compiler F# Compiler PLINQ Other .NET Compiler Task Parallel Library (TPL) Coordination Data Structures OS Scheduling Primitives(also UMS in Windows 7 and up) IL TPL or CDS Task APIsTask Parallelism Futures Scheduling Thread-safe Collections Synchronization Types Coordination Types Proc 1 Proc p …
Task Parallel Library – Tasks • System.Threading.Tasks • Task • Parent-child relationships • Explicit grouping • Waiting and cancelation • Task<T> • Tasks that produce values • Also known as futures
Work Stealing • Internally, the runtime uses • Work stealing techniques • Lock-free concurrent task queues • Work stealing has provably • Good locality • Work distribution properties 1 2 3 4 4 p1 p2 p3
Example code to parallelize voidMultiplyMatrices(intsize,double[,] m1,double[,] m2,double[,] result) { for (inti = 0; i < size; i++) { for (intj = 0; j < size; j++) { result[i, j] = 0; for (intk = 0; k < size; k++) { result[i, j] += m1[i, k] * m2[k, j]; } } } }
Solution today Knowledge of Synchronization Primitives Static Work Distribution intN = size; intP = 2 * Environment.ProcessorCount; intChunk = N / P; // size of a work chunk ManualResetEventsignal = new ManualResetEvent(false); intcounter = P; // counter limits kernel transitions for (intc = 0; c < P; c++) { // for each chunk ThreadPool.QueueUserWorkItem(o => { intlc = (int)o; for (inti = lc * Chunk; // process one chunk i < (lc + 1 == P ? N : (lc + 1) * Chunk); // respect upper bound i++) { // original loop body for (int j = 0; j < size; j++) { result[i, j] = 0; for (intk = 0; k < size; k++) { result[i, j] += m1[i, k] * m2[k, j]; } } } if (Interlocked.Decrement(ref counter) == 0) { // efficient interlocked ops signal.Set(); // and kernel transition only when done } }, c); } signal.WaitOne(); High Overhead Error Prone Tricks Lack of Thread Reuse Heavy Synchronization
Solution with Parallel Extensions voidMultiplyMatrices(intsize,double[,] m1,double[,] m2,double[,] result) { Parallel.For(0, size, i => { for(intj = 0; j < size; j++) { result[i, j] = 0; for (intk = 0; k < size; k++) { result[i, j] += m1[i, k] * m2[k, j]; } } }); } Structured parallelism
Task Parallel Library – Loops Why immutability gains attention • Common source of work in programs • System.Threading.Parallel class • Parallelism when iterations are independent • Body doesn’t depend on mutable state • E.g. static variables, writing to local variables used in subsequent iterations • Synchronous • All iterations finish, regularly or exceptionally for (inti = 0; i < n; i++) work(i); … foreach (T e in data) work(e); Parallel.For(0, n, i => work(i)); … Parallel.ForEach(data, e => work(e));
demo Task Parallel Library Bart J.F. De Smet Software Development Engineer Microsoft Corporation
Amdahl’s law by example Theoretical maximum speedup determined by amount of linear code
Performance Tips • Compute intensive and/or large data sets • Work done should be at least 1,000s of cycles • Do not be gratuitous in task creation • Lightweight, but still requires object allocation, etc. • Parallelize only outer loops where possible • Unless N is insufficiently large to offer enough parallelism • Prefer isolation & immutability over synchronization • Synchronization == !Scalable • Try to avoid shared data • Have realistic expectations • Amdahl’s Law • Speedup will be fundamentally limited by the amount of sequential computation • Gustafson’s Law • But what if you add more data, thus increasing the parallelizable percentage of the application?
Parallel LINQ (PLINQ) • Enable LINQ developers to leverage parallel hardware • Fully supports all .NET Standard Query Operators • Abstracts away the hard work of using parallelism • Partitions and merges data intelligently (classic data parallelism) • Minimal impact to existing LINQ programming model • AsParallel extension method • Optional preservation of input ordering (AsOrdered) • Query syntax enables runtime to auto-parallelize • Automatic way to generate more Tasks, like Parallel • Graph analysis determines how to do it • Very little synchronization internally: highly efficient .AsParallel() var q = from p in people where p.Name == queryInfo.Name && p.State == queryInfo.State && p.Year >= yearStart && p.Year <= yearEnd orderbyp.Year ascending select p;
demo PLINQ Bart J.F. De Smet Software Development Engineer Microsoft Corporation
Coordination Data Structures • New synchronization primitives (System.Threading) • Barrier • Multi-phased algorithm • Tasks signal and wait for phases • CountdownEvent • Has an initial counter value • Gets signaled when count reaches zero • LazyInitializer • Lazy initialization routines • Reference type variable gets initialized lazily • SemaphoreSlim • Slim brother to Semaphore (goes kernel mode) • SpinLock, SpinWait • Loop-based wait (“spinning”) • Avoids context switch or kernel mode transition
Coordination Data Structures • Concurrent collections (System.Collections.Concurrent) • BlockingCollection<T> • Producer/consumer scenarios • Blocks when no data is available (consumer) • Blocks when no space is available (producer) • ConcurrentBag<T> • ConcurrentDictionary<TKey, TElement> • ConcurrentQueue<T>, ConcurrentStack<T> • Thread-safe and scalable collections • As lock-free as possible • Partitioner<T> • Facilities to partition data in chunks • E.g. PLINQ partitioning problems
demo Coordination Data Structures Bart J.F. De Smet Software Development Engineer Microsoft Corporation
Asynchronous workflows in F# • Language feature unique to F# • Based on theory of monads • But much more exhaustive compared to LINQ… • Overloadable meaning for specific keywords • Continuation passing style • Not: ‘a -> ‘b • But: ‘a -> (‘b -> unit) -> unit • In C# style: Action<T, Action<R>> • Core concept: async{ /* code */ } • Syntactic sugar for keywords inside block • E.g. let!, do!, use! Function takes computation result
Asynchronous workflows in F# letprocessAsync i = async { use stream = File.OpenRead(sprintf"Image%d.tmp" i) let! pixels = stream.AsyncRead(numPixels) let pixels' = transform pixels i use out = File.OpenWrite(sprintf"Image%d.done" i) do! out.AsyncWrite(pixels') } letprocessAsyncDemo = printfn"async demo..." let tasks = [ for i in 1 .. numImages -> processAsync i ] Async.RunSynchronously (Async.Parallel tasks) |> ignore printfn"Done!" stream.Read(numPixels, pixels -> let pixels' = transform pixels i useout = File.OpenWrite(sprintf"Image%d.done" i) do! out.AsyncWrite(pixels') ) Run tasks in parallel
demo Asynchronous workflows in F# Bart J.F. De Smet Software Development Engineer Microsoft Corporation
Reactive Fx • First-class events in .NET • Dualism of IEnumerable<T> interface • IObservable<T> • Pull versus push • Pull (active): IEnumerable<T> and foreach • Push (passive): raise events and event handlers • Events based on functions • Composition at its best • Definition of operators: LINQ to Events • Realization of the continuation monad
IObservable<T> and IObserver<T> Co-variance // Dual of IEnumerable<out T> public interface IObservable<out T> { IDisposableSubscribe(IObserver<T> observer); } // Dual of IEnumerator<out T> publicinterfaceIObserver<in T> { // IEnumerator<T>.MoveNext return value voidOnCompleted(); // IEnumerator<T>.MoveNext exceptional return voidOnError(Exception error); // IEnumerator<T>.Current property voidOnNext(T value); } Way to unsubscribe Contra-variance Signaling the last event Virtually two return types
demo Visit channel9.msdn.com for info ReactiveFx Bart J.F. De Smet Software Development Engineer Microsoft Corporation
Agenda • The concurrency landscape • Language headaches • .NET 4.0 facilities • Task Parallel Library • PLINQ • Coordination Data Structures • Asynchronous programming • Incubation projects • Summary
DevLabs project (previously “Maestro”) • Coordination between components • “Disciplined sharing” • Actor model • Agents communicate via messages • Channels to exchange data via ports • Language features (based on C#) • Declarative data pipelines and protocols • Side-effect-free functions • Asynchronous methods • Isolated methods • Also suitable in distributed setting
Channels for message exchange agentProgram : channelMicrosoft.Axum.Application { public Program() { string[] args = receive(PrimaryChannel::CommandLine); PrimaryChannel::ExitCode<-- 0; } }
Agents and channels channelAdder { input intNum1; input intNum2; output intSum; } agentAdderAgent : channelAdder { publicAdderAgent() { int result = receive(PrimaryChannel::Num1) + receive(PrimaryChannel::Num2); PrimaryChannel::Sum <-- result; } } Send / receive primitives
Protocols channelAdder { input intNum1; input intNum2; output intSum; Start: { Num1 -> GotNum1; } GotNum1: { Num2 -> GotNum2; } GotNum2: { Sum -> End; } } State transition diagram
Use of pipelines agentMainAgent: channelMicrosoft.Axum.Application{ function intFibonacci(int n) { if (n <= 1) return n; returnFibonacci(n - 1) + Fibonacci(n - 2); } int c = 10; voidProcessResult(intn) { Console.WriteLine(n); if (--c == 0) PrimaryChannel::ExitCode <-- 0; } publicMainAgent() { varnums= newOrderedInteractionPoint<int>(); nums ==> Fibonacci ==> ProcessResult; for (int i = 0; i < c; i++) nums<-- 42 - i; } } Mathematical function Description of data flow
Domains domainChatroom { privatestringm_Topic; privateintm_UserCount; readeragentUser : channelUserCommunication { // ... } writeragentAdministrator : channelAdminCommunication { // ... } } Unit of sharing between agents
demo Axum in a nutshell Bart J.F. De Smet Software Development Engineer Microsoft Corporation
Another DevLabs project • Cutting edge, released 7/28 • Specialized fork from .NET 4.0 Beta 1 • CLR modifications required • First-class transactions on memory • As an alternative to locking • “Optimistic” concurrency methodology • Make modifications • Rollback changes on conflict • Core concept: atomic { /* code */ }
Transactional memory • Subtle difference • Problems with locks: • Potential for deadlocks… • …and more ugliness • Granularity matters a lot • Don’t compose well atomic { m_x++; m_y--; throw new MyException() } lock (GlobalStmLock) { m_x++; m_y--; thrownewMyException() }
Bank account sample public static void Transfer(BankAccount from, BankAccountbackup, BankAccountto, int amount) { Atomic.Do(() => { // Be optimistic, credit the beneficiary first to.ModifyBalance(amount); // Find the appropriate funds in source accounts try { from.ModifyBalance(-amount); } catch (OverdraftException) { backup.ModifyBalance(-amount); } }); }
The hard truth about STM • Great features • ACID • Optimistic concurrency • Transparent rollback and re-execute • System.Transactions (LTM) and DTC support • Implementation • Instrumentation of shared state access • JIT compiler modification • No hardware support currently • Result: • 2x to 7x serial slowdown (in alpha prototype) • But improved parallel scalability
demo Visit msdn.microsoft.com/devlabs STM.NET Bart J.F. De Smet Software Development Engineer Microsoft Corporation
DryadLINQ • Dryad • Infrastructure for cluster computation • Concept of job • DryadLINQ • LINQ over Dryad • Decomposition of query • Distribution over computation nodes • Roughly similar to PLINQ • A la “map-reduce” • Declarative approach works
DryadLINQ = LINQ + Dryad Collection<T> collection; boolIsLegal(Key k); string Hash(Key); var results = from c in collection where IsLegal(c.key) select new { Hash(c.key), c.value}; Vertexcode Queryplan (Dryad job) Data collection C# C# C# C# results
demo Visit research.microsoft.com/dryad DryadLINQ Bart J.F. De Smet Software Development Engineer Microsoft Corporation