470 likes | 617 Views
Rethinking Parallel Execution. Guri Sohi (along with Matthew Allen, Srinath Sridharan, Gagan Gupta) University of Wisconsin-Madison. Outline. From sequential to multicore Reminiscing: Instruction Level Parallelism (ILP) Canonical parallel processing and execution
E N D
Rethinking Parallel Execution Guri Sohi (along with Matthew Allen, Srinath Sridharan, Gagan Gupta) University of Wisconsin-Madison
Outline • From sequential to multicore • Reminiscing: Instruction Level Parallelism (ILP) • Canonical parallel processing and execution • Rethinking canonical parallel execution • Dynamic Serialization • Consequences of Dynamic Serialization • Wrap up Mason Wells
Microprocessor Generations • Generation 1: Serial • Generation 2: Pipelined • Generation 3: Instruction-level Parallel (ILP) • Generation 4: Multiple processing cores Mason Wells
Microprocessor Generations Gen 2: Pipelined (1980s) Gen 1: Sequential (1970s) Gen 4: Multicore (2000s) Gen 3: ILP (1990s) Mason Wells
From One Generation to Next • Significant debate and research • New solutions proposed • Old solutions adapt in interesting ways to become viable or even better than new solutions • Solutions that involve changes “under the hood” end up winning over others
From One Generation to Next • From Sequential to Pipelined • RISC (MIPS, Sun SPARC, Motorola 88k, IBM PowerPC) vs. CISC (Intel x86) • CISC architectures learned and employed RISC innovations • From Pipelined to Instruction-Level Parallel • Statically scheduled VLIW/EPIC • Dynamically scheduled superscalar
From One Generation to Next • From ILP to Multicore • Parallelism based upon canonical parallel execution model • Overcome constraints to canonical parallelization • Thread-level speculation (TLS) • Transactional memory (TM)
Reminiscing about ILP • Late 1980s to mid 1990s • Search for “post RISC” architecture • More accurately, instruction processing model • Desire to do more than one instruction per cycle—exploit ILP • Majority school of thought: VLIW/EPIC • Minority: out-of-order (OOO) superscalar 8
VLIW/EPIC School • Parallel execution requires a parallel ISA • Parallel execution determined statically (by compiler) • Parallel execution expressed in static program • Take program/algorithm parallelism and mold it to given execution schedule for exploiting parallelism 9
VLIW/EPIC School • Creating effective parallel representations (statically) introduces several problems • Predication • Statically scheduling loads • Exception handling • Recovery code • Lots of research addressing these problems • Intel and HP pushed it as their future (Itanium) 10
OOO Superscalar • Create dynamic parallel execution from sequential static representation • dynamic dependence information accurate • execution schedule flexible • None of the problems associated with trying to create a parallel representation statically • Natural growth path with no demands on software 11
Lessons from ILP Generation • Significant consequences of trying to statically detect and express parallelism • Techniques that make “under the hood” changes are the winners • Even though they may have some drawbacks/overheads 12
The Multicore Generation How to achieve parallel execution on multiple processors? Solution critical to the long-term health of the computer and information technology industry And thus the economy and society as we know it 13
The Multicore Generation • How to achieve parallel execution on multiple processors? • Over four decades of conventional wisdom in parallel processing • Mostly in the scientific application/HPC arena • Use this as basis Parallel Execution Requires a Parallel Representation 17
Canonical Parallel Execution Model A: Analyze program to identify independencein program • independent portions executed in parallel B: Create static representation of independence • synchronization to satisfy independence assumption C: Dynamic parallel execution unwinds as per static representation • potential consequences due to static assumptions 18
Canonical Parallel Execution Model • Like VLIW/EPIC, canonical model creates a variety of problems that have lead to a vast body of research • identifying independence • creating static representation • dynamic unwinding 19
Identifying Independence • Static program analysis • Over four decades of work • Hard to identify statically • Inherently dynamic properties • Must be conservative statically • Need to identify dependence in order to identify independence Mason Wells
Creating Static Representation • Parallel representation for guaranteed independent work • Insert synchronization for potential dependences • Conservative synchronization moves parallel execution towards sequential execution Mason Wells
Dynamic Unwinding • Non-determinism • Changes to program state may not be repeatable • Race conditions • Several startup companies to deal with this problem Mason Wells
Conventional Wisdom Parallel Execution Requires a Parallel Representation Consequences: • Must create parallel representation • For correct execution, must statically identify: • Independence for parallel representation • Dependence for synchronization • Source of enormous difficulty and complexity • Generally functions of input to program • Inherently dynamic properties Mason Wells
Current Approaches • Stick with canonical model and try to overcome limitations • Thread Level Speculation (TLS) and Transactional Memory (TM) • Techniques to allow programmer to program sequentially but automatically generate parallel representation • Techniques to handle non-determinism and race conditions. Mason Wells
TLS and TM • Overcome major constraint to creating static parallel representation • Likely in several upcoming microprocessors • Our work in mid 1990s will be key enabler • Already in Sun MAJC, NEC Merlot, Sun Rock Mason Wells
Static Program Representation • Can we get parallel execution without a parallel representation? Yes • Can dynamic parallelization extract parallelism that is inaccessible to static methods? Yes Mason Wells
Serialization Sets: What? • Sequential program representation and dynamic parallel execution • No static representation of independence • No locks and no explicit synchronization • “Under the hood” run time system dynamically determines and orders dependent computations • Independence and thus parallelism falls out as a side • Comparable or better performance than conventional parallel models Mason Wells
How? Big Picture • Write program in well object-oriented style • Method operates on data of associated object (ver. 1) • Identify parts of program for potential parallel execution • Make suitable annotations as needed • Dynamically determine data object touched by selected code • Identify dependence • Program thread assigns selected code to bins Mason Wells
How? Big Picture • Serialize computations to same object • Enforce dependence • Assign them to same bin; delegate thread executes computations in same bin sequentially • Do not look for/represent independence • Falls out as an effect of enforcing dependence • Computations in different bins execute in parallel • Updates to given state in same order as in sequential program • Determinism • No races • If sequential correct; parallel execution is correct (same input) Mason Wells
Big Picture Program Thread Delegate Thread 0 Delegate Thread 2 Delegate Thread 1
Serialization Sets: How? • Sequential program with annotations • Identify potentially independent methods • Associate a serializers with objects to express dependence • Serializer groups dependent method invocations into a serialization set • Runtime executes in order to honor dependences • Independent method invocations in different sets • Runtime opportunistically parallelizes execution Mason Wells
Example: Debit/Credit Transactions # of transactions? trans_t* trans; while ((trans = get_trans ()) != NULL) { account_t* account = trans->account; if (trans->type == DEPOSIT) account->deposit (trans->amount); else if (trans->type == WITHDRAW) account->withdraw (trans->amount); } Points to? Loop-carried dependence? Several static unknowns! Mason Wells
Multithreading Strategy Oblivious to what accounts each thread may access! → Methods must lock account to → ensure mutual exclusion trans_t* trans; while ((trans = get_trans ()) != NULL) { account_t* account = trans[i]->account; if (trans->type == DEPOSIT) account->deposit (trans->amount); else if (trans->type == WITHDRAW) account->withdraw (trans->amount); } • Read all transactions into an array • Divide chunks of array among multiple threads Mason Wells
Example with Serialization Sets private <account_t> private_account_t; begin_nest (); trans_t* trans; while ((trans = get_trans ()) != NULL) { private_account_t* account = trans->account; if (trans->type == DEPOSIT) account->delegate(deposit, trans->amount); else if (trans->type == WITHDRAW) account->delegate(withdraw, trans->amount); } end_nest (); Declare wrapped account type Delegate indicates potentially-independent operations • At execution, delegate: • Creates method invocation structure • Gets serializer pointer from base class • Enqueues invocation in serialization set Initiate nesting level End nesting level, implicit barrier Mason Wells
Delegate context Program context SS #100 SS #200 SS #300 delegate deposit acct=100 $300 deposit acct=100 $2000 withdraw acct=100 $20 withdraw acct=100 $50 delegate delegate delegate withdraw acct=200 $1000 withdraw acct=200 $1000 delegate delegate delegate delegate deposit acct=300 $5000 withdraw acct=300 $350 Mason Wells
Delegate threads Delegate context Program thread Program context SS #100 Delegate 0 SS #200 Delegate 1 SS #300 delegate withdraw acct=100 $50 withdraw acct=100 $20 deposit acct=100 $2000 deposit acct=100 $300 withdraw acct=100 $50 withdraw acct=100 $20 deposit acct=100 $300 deposit acct=100 $2000 delegate delegate delegate withdraw acct=200 $1000 withdraw acct=200 $1000 withdraw acct=200 $1000 withdraw acct=200 $1000 delegate delegate delegate delegate withdraw acct=300 $350 withdraw acct=300 $350 deposit acct=300 $5000 deposit acct=300 $5000 Race-free, determinate execution without synchronization! Mason Wells
Prometheus: C++ Library for SS • Template library • Compile-time instantiation of SS data structures • Metaprogramming for static type checking • Runtime orchestrates parallel execution • Portable • x86, x86_64, SPARC V9 • Linux, Solaris Mason Wells
Prometheus Runtime • Version 1.0 • Dynamically extracts parallelism • Statically scheduled • No nested parallelism • Version 2.0 • Dynamically extracts parallelism • Dynamically scheduled • Work-stealing scheduler • Supports nested parallelism Mason Wells
Network Packet Classification packet_t* packet; classify_t* classifier; vector<int> ruleCount(num_rules); Vector<packet_queue_t> packet_queues; int packetCount = 0; for(i=0;i<packet_queues.size();i++) { while ((packet = packet_queues[i].get_pkt()) != NULL) { ruleID = classifier->softClassify (packet); ruleCount[ruleID]++; packetCount++; } }
Example with Serialization Sets Private <classify_t> private_classify_t; vector<private_classify_t> classifiers; int packetCount = 0; vector<int> ruleCount(numRules,0); int size = packet_queues.size(); begin_nest (); for (i=0;i<size;i++){ classifiers[i].delegate (&classifier_t::softClassify, packet_queues[i]); } end_nest (); for(i=0;i<size;i++){ ruleCount += classifier[i].getRuleCount(); packetCount += classifier[i].getPacketCount(); }
Network Intrusion Detection • Very common networking application • Most common program used: Snort • Open source version (like Linux) • But also commercial versions (Sourcefire) • Basic structure of computation also found in many other deep packet inspection applications • E.g., packet de-duplication (Riverbed) Mason Wells
Other Applications • Benchmarks • Lonestar, NU-MineBench, PARSEC, Phoenix • Conventional Parallelization • pthreads, OpenMP • Prometheus versions • Port program to sequential C++ program • Idiomatic C++: OO, inheritance, STL • Parallelize with serialization sets Mason Wells
Statically Scheduled Results 4 Socket AMD Barcelona (4-way multicore) = 16 total cores Mason Wells
Statically Scheduled Results Mason Wells
Summary • Sequential program with annotations • No explicit synchronization, no locks • Programmers focus on keeping computation private to object state • Consistent with OO programming practices • Dependence-based model • Determinate race-free parallel execution • Do as well or better than incumbents but without their negatives • Can do things that are very hard for incumbents Mason Wells