The eXplicit MultiThreading (XMT) Easy-To-Program Parallel Computer

The eXplicit MultiThreading (XMT) Easy-To-Program Parallel Computer Uzi Vishkin www.umiacs.umd.edu/users/vishkin/XMT Students: just remember to take ENEE459P: Parallel Algorithms, fall’10 - What is a parallel algorithm? - Why should I care?

Taste of a Parallel AlgorithmExample: Exchange Problem 2 Bins A and B. Exchange contents of A and B. Ex. A=2,B=5A=5,B=2. Algorithm (serial or parallel): X:=A;A:=B;B:=X. 3 Ops. 3 Steps. Space 1. Array Exchange Problem 2n bins A[1..n], B[1..n]. Replace A(i) and B(i), i=1..n. Serial Alg: For i=1 to n do /*serial exchange through eye-of-a-needle X:=A(i);A(i):=B(i);B(i):=X 3n Ops. 3n Steps. Space 1 Parallel Alg: For i=1 to n pardo /*2-bin exchange in parallel X(i):=A(i);A(i):=B(i);B(i):=X(i) 3n Ops. 3 Steps. Space n Discussion Parallelism tends to require some extra space Par Alg clearly faster than Serial Alg. What is “simpler” and “more natural”: serial or parallel? Small sample of people: serial, but only if you .. majored in CS Eye-of-a-needle: metaphor for the von-Neumann mental & operational bottleneck Reflects extreme scarcity of HW. Less acute now

Commodity computer systems Intel Platform 2015, March05 Chapter 1 19462003:Serial. 5KHz4GHz. Chapter 2 2004--: Parallel. #”cores”:~dy-2003 Apple 2004: 1 core 2013: >100 cores Windows 7: scales to 256 cores… How to use the other 255? Did I mention ENEE459P? BIG NEWS Clock frequency growth: flat. If you want your program to run significantly faster … you’re going to have to parallelize it Parallelism: only game in town #Transistors/chip 19802011: 29K30B! Programmer’s IQ? Flat.. 40 years of parallel computing The world is yet to see a successful general-purpose parallel computer: Easy to program & good speedups

Historic SPECint 2000 Performance Year Is performance at a plateau? ? Source: published SPECInt data Students: Make yourself ready for the job market. Serial computing <1% of computing power. Will serial computing be taught for … history majors?

Welcome to the 2010 Impasse All vendors committed to multi-cores. Yet, their architecture and how to program them for single program completion time not clear  The software spiral (HW improvements  SW imp  HW imp) – growth engine for IT (A. Grove, Intel); Alas, now broken!  SW vendors avoid investment in long-term SW development since may bet on the wrong horse. Impasse bad for business. Parallel programming education: Does CS&E degree mean: being trained for a 50yr career dominated by parallelism by programming yesterday’s serial computers? ENEE459P Teach: (i) common denominator, and (ii) main approaches.

Serial Abstraction & A Parallel Counterpart What could I do in parallel at each step assuming unlimited hardware  . . # ops Parallel Execution, Based on Parallel Abstraction Serial Execution, Based on Serial Abstraction . . # ops . . .. .. .. .. time time Time << Work Time = Work Work = total #ops • Rudimentary abstraction that made serial computing simple:that any single instruction available for execution in a serial program executes immediately Abstracts away different execution time for different operations (e.g., memory hierarchy) . Used by programmers to conceptualize serial computing and supported by hardware and compilers. The program provides the instruction to be executed next (inductively) • Rudimentary abstraction for making parallel computing simple: that indefinitely many instructions, which are available for concurrent execution, execute immediately, dubbed Immediate Concurrent Execution (ICE) Step-by-step (inductive) explication of the instructions available next for concurrent execution. # processors not even mentioned. Falls back on the serial abstraction if 1 instruction/step.

Explicit Multi-threading (XMT) 1979- : THEORY figure out how to think algorithmically in parallel Outcome in a nutshell: above abstraction 1997- : XMT@UMD: derive specs for architecture; design and build UV: Using Simple Abstraction to Guide the Reinvention of Computing for Parallelism, http://www.umiacs.umd.edu/users/vishkin/XMT/cacm2010.pdf, to appear in CACM

Not just talking Algorithms PRAM-On-Chip HW Prototypes 64-core, 75MHz FPGA of XMT (Explicit Multi-Threaded) architecture SPAA98..CF08 128-core intercon. networkIBM 90nm: 9mmX5mm, 400 MHz [HotI07] FPGA designASIC IBM 90nm: 10mmX10mm 150 MHz PRAM parallel algorithmic theory. “Natural selection”. Latent, though not widespread, knowledgebase “Work-depth”. SV82 conjectured: The rest (full PRAM algorithm) just a matter of skill. Lots of evidence that “work-depth” works. Used as framework in main PRAM algorithms texts: JaJa92, KKT01 Programming & Workflow Rudimentary yet stable compiler Architecture scales to 1000+ cores on-chip

Participants Grad students:, Aydin Balkan, PhD, George Caragea, James Edwards, David Ellison, Mike Horak, MS, Fuat Keceli, Beliz Saybasili, Alex Tzannes, Xingzhi Wen, PhD • Industry design experts (pro-bono). • Rajeev Barua, Compiler. Co-advisor of 2 CS grad students. 2008 NSF grant. • Gang Qu, VLSI and Power. Co-advisor. • Steve Nowick, Columbia U., Asynch computing. Co-advisor. 2008 NSF team grant. • Ron Tzur, Purdue U., K12 Education. Co-advisor. 2008 NSF seed funding K12:Montgomery Blair Magnet HS, MD, Thomas Jefferson HS, VA, Baltimore (inner city) Ingenuity Project Middle School 2009 Summer Camp, Montgomery County Public Schools • Marc Olano, UMBC, Computer graphics. Co-advisor. • Tali Moreshet, Swarthmore College, Power. Co-advisor. • Marty Peckerar, Microelectronics • Igor Smolyaninov, Electro-optics • Funding: NSF, NSA 2008 deployed XMT computer, NIH • Industry partner: Intel Started from core CS. Built HW+Compiler foundation. Ready for ~10 timely CS PhD theses, ~2 Education, and ~10 ECE.

More on ENEE459P, fall 2010 • Parallel algorithmic thinking (PAT) based on first principles. More challenging to self-study • Mainstream computingparallelism: chaotic. Hence: Pluralism valuable. • ENEE459: jointly taught by 2 instructors, video conferencing, U. Illinois • CS@Illinois: top 5.Parallel@Illinois: #1. • Joint course on timely topic : extremely rare opportunity. • More than “2 for the price of one“. 2 courses, each with 1 instructors would lack the interaction. • Advanced by Google, Intel and Microsoft, the introduction of parallelism into the curriculum dominated the recent flagship Computer Science Education Conference. Several speakers, including a Keynote by the Director of Education at Intel, reported that: • In job interviews, employers now expect an intelligent discussion of parallelism; and • (2) International competition recognizes that: 85% of the people that have been trained in parallel programming are outside the U.S.

Membership in Intel Academic Community Implementing parallel computing into CS curriculum 85% outside USA Source: M. Wrinn, Intel

The Pain of Parallel Programming • Parallel programming is currently too difficult: • To many users programming existing parallel computers is “as intimidating and time consuming as programming in assembly language” [NSF Blue-Ribbon Panel on Cyberinfrastructure]. • AMD/Intel: “Need PhD in CS to program today’s multicores”. • The real problem: Parallel architectures built using the following “methodology”: build-first figure-out-how-to-program-later. [J. Hennessy: “Many of the early ideas were motivated by observations of what was easy to implement in the hardware rather than what was easy to use”]

Input: (i) All world airports. (ii) For each, all its non-stop flights. Find: smallest number of flights from DCA to every other airport. Basic (actually parallel) algorithm Step i: For all airports requiring i-1flights For all its outgoing flights Mark (concurrently!) all “yet unvisited” airports as requiring i flights (note nesting) Serial: forces eye-of-a-needle queue; need to prove that still the same as the parallel version. O(T) time; T – total # of flights Parallel: parallel data-structures. Inherent serialization: S. Gain relative to serial: (first cut) ~T/S! Decisive also relative to coarse-grained parallelism. Note: (i) “Concurrently” as in natural BFS: only change to serial algorithm (ii) No “decomposition”/”partition”  Speed-up wrt GPU: same-silicon area for highly parallel input 5.4X! (iii) But, SMALL CONFIG on 20-way parallel input: 109X wrt same GPU Mental effort of PRAM-like programming 1. sometimes easier than serial 2. considerably easier than for any parallel computer currently sold. Understanding falls within the common denominator of other approaches. 2nd Example of PRAM-like Algorithm

Back to the education crisis CTO of NVidia and the official Intel leader of multi-cores at Intel: teach parallelism as early as you. Reason: we don’t only under teach. We misteach, since students acquire bad habits. Current situation is unacceptable. Sort of malpractice. Some possibilities • Teach as a major elective. • Teach all CS&E undergrads. • Teach CS&E Freshmen and invite all Eng, Math, and Science; sends message “CS&E is where the action is”.

Need A general-purpose parallel computer framework [“successor to the Pentium for the multi-core era”] that: • is easy to program; • gives good performance with any amount of parallelism provided by the algorithm; namely, up- and down-scalability including backwards compatibility on serial code; • supports application programming (VHDL/Verilog, OpenGL, MATLAB) and performance programming; and • fits current chip technology and scales with it. (in particular: strong speed-ups for single-task completion time) Main Point of talk: PRAM-On-Chip@UMD is addressing (i)-(iv).

The PRAM Rollercoaster ride Late 1970’s Theory work began UP Won the battle of ideas on parallel algorithmic thinking. No silver or bronze! Model of choice in all theory/algorithms communities. 1988-90: Big chapters in standard algorithms textbooks. DOWN FCRC’93: “PRAM is not feasible”. [‘93+ despair no good alternative! Where vendors expect good enough alternatives to come from in 2008?]; Device changed it all: UP Highlights: eXplicit-multi-threaded (XMT) FPGA-prototype computer (not simulator), SPAA’07,CF’08; 90nm ASIC tape-outs: int. network, HotI’07, XMT. # on-chip transistors How come? crash “course” on parallel computing How much processors-to-memories bandwidth? Enough: Ideal Programming Model (PRAM) Limited: Programming difficulties

How does it work “Work-depth” Algs Methodology (source SV82)State all ops you can do in parallel. Repeat. Minimize: Total #operations, #roundsThe rest is skill. • Programsingle-program multiple-data (SPMD). Short (not OS) threads. Independence of order semantics (IOS). XMTC: C plus 3 commands: Spawn+Join, Prefix-Sum Unique First parallelism. Then decomposition Programming methodology Algorithms  effective programs. Extend the SV82 Work-Depth framework from PRAM to XMTC OrEstablished APIs (VHDL/Verilog, OpenGL, MATLAB) “win-win proposition” • Compiler minimize length of sequence of round-trips to memory; take advantage of architecture enhancements (e.g., prefetch). [ideally: given XMTC program, compiler provides decomposition: “teach the compiler”] ArchitectureDynamically load-balance concurrent threads over processors. “OS of the language”. (Prefix-sum to registers & to memory. )

PERFORMANCE PROGRAMMING & ITS PRODUCTIVITY Basic Algorithm (sometimes informal) Add data-structures (for serial algorithm) Add parallel data-structures (for PRAM-like algorithm) Serial program (C) 3 Parallel program (XMT-C) 1 Low overheads! 4 Standard Computer Decomposition XMT Computer (or Simulator) Assignment Parallel Programming (Culler-Singh) • 4 easier than 2 • Problems with 3 • 4 competitive with 1: cost-effectiveness; natural Orchestration Mapping 2 Parallel computer

APPLICATION PROGRAMMING & ITS PRODUCTIVITY Application programmer’s interfaces (APIs) (OpenGL, VHDL/Verilog, Matlab) compiler Serial program (C) Parallel program (XMT-C) Automatic? Yes Maybe Yes Standard Computer Decomposition XMT architecture (Simulator) Assignment Parallel Programming (Culler-Singh) Orchestration Mapping Parallel computer

Naming Contest for New Computer • Paraleap chosen out of ~6000 submissions Single (hard working) person (X. Wen) completed synthesizable Verilog description AND the new FPGA-based XMT computer in slightly more than two years. No prior design experience. Attests to: basic simplicity of the XMT architecture  faster time to market, lower implementation cost.

Experience with High School Students, Fall’07 Gave 1-day parallel algorithms tutorial to 12 HS students. Some (2 10th graders) managed 8 programming assignments, including 5 of the 6 in the grad course. Only help: 1 office hour/week by undergrad TA. No school credit. Part of a computer club after 8 periods/day. May-June 08: 23 HS students, by self-taugh HS teacher, Alexandria, VA Spring’08: Course to non-major Freshmen (UMD Honor). How will programmers have to think by the time you graduate. Spring’08: Course to seniors.

NEW: Software release Allows to use your own computer for programming on an XMT environment and experimenting with it, including: • Cycle-accurate simulator of the XMT machine • Compiler from XMTC to that machine Also provided, extensive material for teaching or self-studying parallelism, including • Tutorial + manual for XMTC (150 pages) • Classnotes on parallel algorithms (100 pages) • Video recording of 9/15/07 HS tutorial (300 minutes) Next Major Objective Industry-grade chip and production quality compiler. Requires 10X in funding.

The eXplicit MultiThreading (XMT) Easy-To-Program Parallel Computer

The eXplicit MultiThreading (XMT) Easy-To-Program Parallel Computer

Presentation Transcript

Explicit Instruction

Hardware Multithreading

Satic Implicit Static Explicit Dynamic Explicit

Multithreading processors

Platform Independent Frameworks

Lecture 22: Multithreading

Early Experience with Out-of-Core Applications on the Cray XMT

Multithreading Programming

Hardware Multithreading

XMT BOF SC09 XMT Status And Roadmap Shoaib Mufti Director Knowledge Management

Explicit Instruction

T5-multithreading

CASS-MT Review: 6-Apr-2011 Task 3: Semantic Databases on the XMT

Supplement - Multithreading

Explicit Instruction

Hardware Multithreading

Multithreading

Benchmark #04

Multithreading

Hardware Multithreading

Multithreading