380 likes | 538 Views
Getting the most out of Parallel Extensions for .NET. Dr. Mike Liddell Senior Developer Microsoft (mikelid@microsoft.com). Agenda. Why parallelism, why now? Parallelism with today’s technologies Parallel Extensions to the .NET Framework PLINQ Task Parallel Library
E N D
Getting the most out of Parallel Extensions for .NET Dr. Mike Liddell Senior Developer Microsoft (mikelid@microsoft.com)
Agenda • Why parallelism, why now? • Parallelism with today’s technologies • Parallel Extensions to the .NET Framework • PLINQ • Task Parallel Library • Coordination Data Structures • Demos
Sun’s Surface 10,000 1,000 100 10 1 Rocket Nozzle Nuclear Reactor Power Density (W/cm2) 8086 Hot Plate 4004 8085 Pentium® processors 8008 386 286 486 8080 ‘70 ‘80 ‘90 ‘00 ‘10 Hardware Paradigm Shift Today’s Architecture: Heat becoming an unmanageable problem! To Grow, To Keep Up, We Must Embrace Parallel Computing 32,768 2,048 128 16 Many-core Peak Parallel GOPs Parallelism Opportunity 80X GOPS Single Threaded Perf 10% per year 2004 2006 2008 2010 2012 2015 Intel Developer Forum, Spring 2004 - Pat Gelsinger “… we see a very significant shift in what architectures will look like in the future ...fundamentally the way we've begun to look at doing that is to move from instruction level concurrency to … multiple cores per die. But we're going to continue to go beyond there. And that just won't be in our server lines in the future; this will permeate every architecture that we build. All will have massivelymulticore implementations.” Pat Gelsinger Chief Technology Officer, Senior Vice President, Intel Corporation
It's An Industry Thing • Open MP • Intel TBB • Java libraries • Open CL • CUDA • MPI • Erlang • Cilk • (many others)
demo • Raytracer
What's the Problem? • Multithreaded programming is “hard” today • Robust solutions only by specialists • Parallel patterns are not prevalent, well known, nor easy to implement • Many potential correctness & performance issues • Races, deadlocks, livelocks, lock convoys, cache coherency overheads, missed notifications, non-serializable updates, priority inversion, false-sharing, sub-linear scaling and so on… • Features that can are often skimped on • Last delta of perf, ensuring no missed exceptions, composable cancellation, dynamic partitioning, efficient and custom scheduling • Businesses have little desire to “go deep” • Developers should focus on business value, not concurrency hassles and common concerns
Example: Matrix Multiplication voidMultiplyMatrices(int size, double[,] m1, double[,] m2, double[,] result) { for (int i = 0; i < size; i++) { for (int j = 0; j < size; j++) { result[i, j] = 0; for (int k = 0; k < size; k++) { result[i, j] += m1[i, k] * m2[k, j]; } } } }
Manual Parallel Solution Static Work Distribution intN = size; intP = 2 * Environment.ProcessorCount; intChunk = N / P; ManualResetEventsignal = newManualResetEvent(false); intcounter = P; for (intc = 0; c < P; c++) { ThreadPool.QueueUserWorkItem(o => { intlc = (int)o; for(inti = lc * Chunk; i < (lc + 1 == P ? N : (lc + 1) * Chunk); i++) { // original loop body for(intj = 0; j < size; j++) { result[i, j] = 0; for(intk = 0; k < size; k++) { result[i, j] += m1[i, k] * m2[k, j]; } } } if(Interlocked.Decrement(refcounter) == 0) { signal.Set(); } }, c); } signal.WaitOne(); Potential scalability bottleneck Error Prone Error Prone Manual locking Manual Synchronization
Parallel Solution voidMultiplyMatrices(int size, double[,] m1, double[,] m2, double[,] result) { Parallel.For(0, size, i => { for (int j = 0; j < size; j++) { result[i, j] = 0; for (int k = 0; k < size; k++) { result[i, j] += m1[i, k] * m2[k, j]; } } }); } Demo!
Parallel Extensions to the .NET Framework • What is it? • Additional APIs shipping in .NET BCL (mscorlib, System, System.Core) • With corresponding enhancements to the CLR & ThreadPool • Provides primitives, task parallelism and data parallelism • Coordination/synchronization constructs (Coordination Data Structures) • Imperative data and task parallelism (Task Parallel Library) • Declarative data parallelism (PLINQ) • Common exception handling model • Common and rich cancellation model • Why do we need it? • Supports parallelism in any .NET language • Delivers reduced concept count and complexity, better time to solution • Begins to move parallelism capabilities from concurrency experts to domain experts
Parallel Extensions Architecture User Code Applications PLINQ Execution Engine Data Partitioning (Chunk, Range, Stripe, Custom) Operators (Map, Filter, Sort, Search, Reduction Merging (Pipeline, Synchronous, Order preserving) Task Parallel Library Coordination Data Structures Thread-safe Collections Coordination Types Cancellation Types Structured Task Parallelism Pre-existing Primitives ThreadPool Monitor, Events, Threads
Task Parallel Library 1st-class debugger support! • System.Threading.Tasks • Task • Parent-child relationships • Structured waiting and cancellation • Continuations on succes, failure, cancellation • Implements IAsyncResult to compose with Async-Programming Model (APM). • Task<T> • A tasks that has a value on completion • Asynchronous execution with blocking on task.Value • Combines ideas of futures, and promises • TaskScheduler • We ship a scheduler that makes full use of the (vastly) improved ThreadPool • Custom Task Schedulers can be written for specific needs. • Parallel • Convenience APIs: Parallel.For(), Parallel.ForEach() • Automatic, scalable & dynamic partitioning.
Task Parallel LibraryLoops • Loops are a common source of work • Can be parallelized when iterations are independent • Body doesn’t depend on mutable state • e.g. static vars, writing to local vars to be used in subsequent iterations for (int i = 0; i < n; i++) work(i); … foreach (T e in data) work(e); Parallel.For(0, n, i => work(i)); … Parallel.ForEach(data, e => work(e));
Task Parallel Library • Supports early exit via a Break API • Parallel.For, Parallel.ForEach for loops. • Parallel.Invoke for easy creation of simple tasks • Synchronous (blocking) APIs, but with cancellation support Parallel.Invoke( () => StatementA() , () => StatementB , () => StatementC() ); Parallel.For(…, cancellationToken);
Parallel LINQ (PLINQ) • Enable LINQ developers to leverage parallel hardware • Supports all of the .NET Standard Query Operators • Plus a few other extension methods specific to PLINQ • Abstracts away parallelism details • Partitions and merges data intelligently (“classic” data parallelism) • Works for any IEnumerable<T> eg data.AsParallel().Select(..).Where(..); eg array.AsParallel().WithCancellation(ct)…
Writing a PLINQ Query • Different ways to write PLINQ queries • Comprehensions • Syntax extensions to C# and Visual Basic • Normal APIs (two flavours) • Used as extension methods on IParallelEnumerable<T> • Direct use of ParallelEnumerable var q = from x in Y.AsParallel() where p(x) orderby x.f1 select x.f2; var q = Y.AsParallel() .Where(x => p(x)) .OrderBy(x => x.f1) .Select(x => x.f2); var q = ParallelEnumerable.Select( ParallelEnumerable.OrderBy( ParallelEnumerable.Where(Y.AsParallel(), x => p(x)), x => x.f1), x => x.f2);
Plinq Partitioning and Merging • Input to a single operator is partitioned into p disjoint subsets • Operators are replicated across the partitions • A merge marshals data back to consumer thread foreach(int i in D.AsParallel() .where(x=>p(x)) .Select(x=> x*x*x) .OrderBy(x=>-x) • Each partition executes in (almost) complete isolation PLINQ … Task 1 … where p(x) select x3 LocalSort() D partition Merge foreach … Task n… where p(x) select x3 LocalSort()
Coordination Data Structures • Used throughout PLINQ and TPL • Assist with key concurrency patterns • Thread-safe collections • ConcurrentStack<T> • ConcurrentQueue<T> • … • Work exchange • BlockingCollection<T> • … • Phased Operation • CountdownEvent • … • Locks and Signaling • ManualResetEventSlim • SemaphoreSlim • SpinLock … • Initialization • LazyInit<T> … • Cancellation • CancellationTokenSource • CancellationToken • OperationCanceledException
Common Cancellation • A CancellationTokenSource is a source of cancellation requests. • A CancellationToken is a notifier of a cancellation request. • Linking tokens allows combining of cancellation requesters. • Slow code should poll every 1ms • Blocking calls should observe a Token. Workers… Get, share, and copy tokens Routinely poll token which observes CTS May attach callbacks to token Work co-ordinator Creates a CTS Starts work Cancels CTS if reqd CT CT CT CT CTS CT1 CTS12 CT CT2
Common Cancellation (cont.) • All blocking calls allow a CancellationToken to be supplied. var results = data .AsParallel() .WithCancellation(token) .Select(x => f(x)) .ToArray(); • User code can observe the cancellation token, and cooperatively enact cancellation • var results = data .AsParallel() .WithCancellation(token) .Select(x => { if (token.IsCancellationRequested) throw new OperationCanceledEx(token); return f(x); } ) .ToArray();
Extension Points in TPL & PLINQ • Partitioning strategies for Parallel & Plinq • Extend via Partitioner<T>, OrderablePartitioner<T>eg partitioners for heterogenous data. • TaskScheduling • Extend via TaskScheduler eg GUI-thread scheduler, throttled scheduler • BlockingCollection • extend via IProducerConsumerCollectioneg blocking priority queue.
Debugging Parallel Apps in VS2010 • Two new debugger tool windows • “Parallel Tasks” • “Parallel Stacks” .
Parallel Tasks Thread Assignment Location + Tooltip Status Parent ID Task Entry Point Identifier Current Task Task’s thread is frozen Column context menu Flagging . Tooltip shows info on waiting/deadlocked status Item context menu
Parallel Stacks active frame of other thread(s) Context menu active frame of current thread current frame Zoom control method tooltip . header tooltip Bird’s eye view Blue highlights path of current thread
Summary • The ManyCore Shift is happening • Parallelism in your code is inevitable • Invest in a platform that enables parallelism …like the Parallel Extensions for .NET
Further Info and News MSDN Concurrency Developer Center http://msdn.microsoft.com/concurrency Getting the bits! June 2008 CTP - http://msdn.microsoft.com/concurrency Microsoft Visual Studio 2010 – Beta coming soon. http://www.microsoft.com/visualstudio/en-us/products/2010/default.mspx Parallel Extensions Team Blog http://blogs.msdn.com/pfxteam Blogs • Parallel Extensions Team http://blogs.msdn.com/pfxteam • Joe Duffy http://www.bluebytesoftware.com • Daniel Moth http://www.danielmoth.com/Blog/
© 2008 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.
Parallel Technologies from Microsoft Local computing • CDS • TPL • Plinq • Concurrency Runtime in Robotics Studio • PPL (Native) • OpenMP (Native) Distributed computing • WCF • MPI, MPI.NET
Types Key Common Types: AggregateException, OperationCanceledException, TaskCanceledException CancellationTokenSource, CancellationToken Partitioner<T> Key TPL types: Task, Task<T> TaskFactory, TaskFactory<T> TaskScheduler Key Plinq types: Extension methods IEnumerable.AsParallel(), Ienumerable<T>.AsParallel () ParallelQuery, ParallelQuery<T>, OrderableParallelQuery<T> Key CDS types: Lazy<T>, LazyVariable<T>, LazyInitializer, CountdownEvent, ManualResetEventSlim, SemaphoreSlim BlockingCollection, ConcurrentDictionary, ConcurrentQueue
Performance Tips • Early community technology preview • Keep in mind that performance will improve significantly • Compute intensive and/or large data sets • Work done should be at least 1,000s of cycles • Measure, and combine/optimize as necessary • Do not be gratuitous in task creation • Lightweight, but still requires object allocation, etc. • Parallelize only outer loops where possible • Unless N is insufficiently large to offer enough parallelism • Consider parallelizing only inner, or both, at that point • Prefer isolation and immutability over synchronization • Synchronization == !Scalable • Try to avoid shared data • Have realistic expectations • Amdahl’s Law • Speedup will be fundamentally limited by the amount of sequential computation • Gustafson’s Law • But what if you add more data, thus increasing the parallelizable percentage of the application?
Parallelism Blockers int[] values = new int[] { 0, 1, 2 };var q = from x in values.AsParallel() select x * 2;int[] scaled = q.ToArray(); // == { 0, 2, 4 } ?? • Ordering not guaranteed • Exceptions • Thread affinity • Operations with sub-linear speedup, or even speedup < 1.0 • Side effects and mutability are serious issues • Most queries do not use side effects, but… • Race condition if non-unique elements AggregateException object[] data = new object[] { "foo", null, null };var q = from x in data.AsParallel() select o.ToString(); controls.AsParallel().ForAll(c => c.Size = ...); //Problem IEnumerable<int> input = …; var doubled = from x in input.AsParallel() select x*2; var q = from x in data.AsParallel() select x.f++;
Plinq Partitioning, cont. • Types of partitioning • Chunk • Works with any IEnumerable<T> • Single enumerator shared; chunks handed out on-demand • Range • Works only with IList<T> • Input divided into contiguous regions, one per partition • Stride • Works only with IList<T> • Elements handed out round-robin to each partition • Hash • Works with any IEnumerable<T> • Elements assigned to partition based on hash code • Repartitioning sometimes necessary
Plinq Merging • Pipelined: separate consumer thread • Default for GetEnumerator() • And hence foreach loops • Access to data as its available • But more synchronization overhead • Stop-and-go: consumer helps • Sorts, ToArray, ToList, GetEnumerator(false), etc. • Minimizes context switches • But higher latency and more memory • Inverted: no merging needed • ForAll extension method • Most efficient by far • But not always applicable Thread 2 Thread 1 Thread 1 Thread 3 Thread 4 Thread 1 Thread 1 Thread 1 Thread 2 Thread 3 Thread 1 Thread 1 Thread 1 Thread 2 Thread 3
Example: “Baby Names” IEnumerable<BabyInfo> babyRecords = GetBabyRecords(); var results = new List<BabyInfo>(); foreach (varbabyRecord in babyRecords) { if (babyRecord.Name == queryName && babyRecord.State == queryState && babyRecord.Year >= yearStart && babyRecord.Year <= yearEnd) { results.Add(babyRecord); } } results.Sort((b1, b2) => b1.Year.CompareTo(b2.Year));
Manual Parallel Solution Synchronization Knowledge IEnumerable<BabyInfo> babies = …; var results = new List<BabyInfo>(); int partitionsCount = Environment.ProcessorCount * 2; int remainingCount = partitionsCount; var enumerator = babies.GetEnumerator(); try { using (ManualResetEvent done = new ManualResetEvent(false)) { for (int i = 0; i < partitionsCount; i++) { ThreadPool.QueueUserWorkItem(delegate { varpartialResults = new List<BabyInfo>(); while(true) { BabyInfo baby; lock (enumerator) { if (!enumerator.MoveNext()) break; baby = enumerator.Current; } if (baby.Name == queryName && baby.State == queryState && baby.Year >= yearStart && baby.Year <= yearEnd) { partialResults.Add(baby); } } lock (results) results.AddRange(partialResults); if (Interlocked.Decrement(ref remainingCount) == 0) done.Set(); }); } done.WaitOne(); results.Sort((b1, b2) => b1.Year.CompareTo(b2.Year)); } } finally { if (enumerator is Idisposable) ((Idisposable)enumerator).Dispose(); } Inefficient locking Lack of foreach simplicity Manual aggregation Tricks Lack of thread reuse Heavy synchronization Non-parallel sort
LINQ Solution .AsParallel() var results = from baby in babyRecords where baby.Name == queryName && baby.State == queryState && baby.Year >= yearStart && baby.Year <= yearEnd orderbybaby.Year ascending select baby; (or in different Syntax…) var results = babyRecords .Where(b => b.Name == queryName && b.State == queryState && b.Year >= yearStart && b.Year <= yearEnd) .OrderBy(b=>baby.Year) .Select(b=>b); .AsParallel()
ThreadPool Task (Work) Stealing ThreadPool Task Queues … Worker Thread 1 Worker Thread p … Task 6 . Task 3 Task 4 Task 1 Program Thread Task 5 Task 2