Breaking Down the Memory Wall for Future Scalable Computing Platforms

Breaking Down the Memory Wall for Future Scalable Computing Platforms Wen-mei Hwu Sanders-AMD Endowed Chair Professor with John W. Sias, Erik M. Nystrom, Hong-seok Kim, Chien-wei Li, Hillery C. Hunter, Ronald D. Barnes, Shane Ryoo, Sain-Zee Ueng, James W. Player, Ian M. Steiner, Chris I. Rodrigues, Robert E. Kidd, Dan R. Burke, Nacho Navarro, Steven S. Lumetta University of Illinois at Urbana-Champaign

1.4 1.3 30% 1.2 Normalized Frequency 130nm 1.1 5X 1.0 0.9 1 2 3 4 5 Normalized Leakage (Isb) 10000 Interconnect RC Delay 1000 Clock Period 100 Copper Interconnect Delay (ps) 10 RC delay of 1mm interconnect 1 350 250 180 130 90 65 Trends in hardware • High variability • Increasing speed and power variability of transistors • Limited frequency increase • Reliability / verification challenges • Large interconnect delay • Increasing interconnect delay and shrinking clock domains • Limited size of individual computing engines Data: Shekhar Borkar, Intel

Trends in architecture • Transistors are free… until connected or used • Continued scaling of traditional processor core no longer economically viable • 2-3X effective area yields ~1.6X performance [PollackMICRO32] • Verification, power, transistor variability • Only obvious scaling route: “Multi-Everything” • Multi-thread, multi-core, multi-memory, multi-? • CW: Distributed parallelism is easy to design • But what about software? • If you build a better mousetrap…

A “multi-everything” processor of the future • Distributed, less complex components • Variability, power density, and verification – easier to address • Who bears the SW mapping burden? • General purpose software changes prohibitivelyexpensive (cf. SIMD, IA-64) • Advanced compiler features“Deep Analysis” • New programming models / frameworks • Interactive compilers

General purpose processor component(s) • The system director • Performs traditionally-programmed tasks • software migration starts here • Likely multiple GPP’s • Less complex processor cores

Computational efficiency through customization • Goal: Offload most processing to more specialized, more efficient units • Application Processors (APP) • Specialized instruction sets, memory organizations and access facilities • Programmable Accelerators (ACC) • Think ASIC with knobs • Highly-specialized pipelines • Approximate ASIC design points • Higher performance/watt than general purpose for target applications

Memory efficiency through diversity • Traditional monolithic memory model – major power / performance sink • Need partnership of general-purpose memory hierarchy and software-managed memories • Local memories will reduce unnecessary memory traffic and power consumption • Bulk data transfer scheduled by Memory Transfer Module • Software will gradually adopt decentralized model for power and bandwidth

Tolerating communication & adding macropipelining • Bulk communication overhead often substantial for traditional accelerators • Shared memory / snooping communication approach limits available bandwidth • Compilation tools will have to seamlessly connect processors and accelerators • Accelerators will be able to operate on bulk transferred, buffered data… … or on streamed data

Micro engine Micro engine Micro engine Micro engine SPI4 / CSIX RFIFO TFIFO XScale Core Micro engine Micro engine Micro engine Micro engine PCI Hash Engine Micro engine Micro engine Micro engine Micro engine Scratch- pad SRAM Micro engine Micro engine Micro engine Micro engine CSRs RDRAM RDRAM RDRAM MPEG VIDEO ARM ACCESS CTL. MSP VLIW MICRO- MIPS ENGINES QDR SRAM QDR SRAM QDR SRAM QDR SRAM Embedded systems already trying out this paradigm Intel IXP2400 Network Processor Philips Nexperia (Viper) Intel IXP1200 Network Processor

Decentralizing parallelism in a JPEG decoder Conceptual dataflow view of two JPEG decoding steps • Convert a typical media-processing application to the decentralized model • Arrays used to implement streams • Multiple loci of computation with various models of parallelism • Memory access bandwidth a bottleneck w/o private data

Data privatization and local memory Conceptual dataflow view of two JPEG decoding steps • Accelerate color conversion first (execute in ACC or APP) • Main processor sends inputs, receives outputs • Large tables – inefficient to send data from main processor • Need tables to reside in the accelerator for efficiency of access • Tables are initialized once during program execution, and never modified again • Accurate pointer analysis necessary to determine this

Increasing parallelism • Heavyweight loop nests communicate though intermediate array • Direct streaming of data is possible, supports higher parallelism (macropipelining) • Convert() and Upsample() loops can be chained • Accurate interprocedural dataflow analysis is necessary

Heavyweight loops How the next-generation compiler will do it (1) • To-do list: • Identify acceleration opportunities • Localize memory • Stream data and overlap computation • Acceleration opportunities: • Heavyweight loops identified for acceleration • However, they are isolated in separate functions called through pointers

How the next-generation compiler will do it (2) Initialization code identified Large constant lookup tables identified • To-do list: • Identify acceleration opportunities • Localize memory • Stream data and overlap computation • Localize memory: • Pointer analysis identifies localizable memory objects • Private tables inside accelerator initialized once, saving most traffic

How the next-generation compiler will do it (3) Constant table privatized Summarize output access pattern Summarize input access pattern • To-do list: • Identify acceleration opportunities • Localize memory • Stream data and overlap computation • Streaming and computation overlap: • Memory dataflow summarizes array/pointer access patterns • Opportunities for streaming are automatically identified • Unnecessary memory operations replaced with streaming

How the next-generation compiler will do it (4) • To-do list: • Identify acceleration opportunities • Localize memory • Stream data and overlap computation • Achieve macropipelining of parallelizable accelerators • Upsampling and color conversion can stream to each other • Optimizations can have substantial effect on both efficiencyand performance

Memory dataflow in the pointer world Array of constant pointers • Arrays are not true 3D arrays (unlike in Fortran) • Actual implementation: array of pointers to array of samples • New type of dataflow problem – understanding the semantics of memory structures instead of true arrays Row arrays never overlap

Compiler vs. hardware memory walls • Hardware memory wall • Prohibitive implementation cost of memory system while trying to keep up with the processor speed under power budget • Compiler memory wall • The use of memory as a generic pool obstructs compiler’s view of true program and data structures • The decentralized and diversified memory approach is key to breaking the hardware memory wall • Breaking the compiler memory wall will be increasingly important in breaking the hardware memory wall

Pointer analysis: sensitivity, stability and safety [PASTE2004] Improved efficiency increases the scope over which unique, heap-allocated objects can be discovered Improved analysis algorithms provide more accurate call graphs (below) instead of a blurred view (above) for use by program transformation tools

Pointer analysis: sensitivity, stability and safety • Analysis is abstract execution • simplifying abstractions → analysis stability • “unrealizable dataflow” results • Many components of accuracy • Typical to cut some corners to enable “key” component for particular applications • Making the components usefully compatible is a major contribution • No need for a priori corner-cutting → better results across broad code base • Safety in “unsafe” languages • C poses major challenges • Efficiency challenge increased in safe algos.

How do sensitivity, stability and safety coexist? • Our two-pronged approach to sensitive, stable, safe pointer analysis Summarization:Only relevant details are forwarded to a higher level Containment:The algorithm can cut its losses locally (like a bulkhead) … … to avoid a globalexplosion in problem size • Example: summarization-based context sensitivity…

Context sensitivity: naïve inlining Excess statements unnecessary and costly Retention of side effect still leads to spurious results

Context sensitivity: summarization-based Compact summary of jade used Summary accounts for all side-effects. BLOCK assignment to prevent contamination Now, only correct result derived

Analyzing large, complex programs [SAS2004] Originally, problem size exploded as more contexts were encountered 1012 This results in an efficient analysis process without loss of accuracy 104 New algorithm contains problem size with each additional context

The outlook in software • Software is changing too, more gradually • Applications driving development – rich in parallelism • Physical world – medicine, weather • Video, games – signal & media processing • Source code availability • Open Source continues to grow • Microsoft’s Phoenix Compiler Project • New programming models • Enhanced developer productivity & enhanced parallelism

Beyond the traditional language environment • Domain-specific, higher-level modeling languages • More intuitive than C for inherently parallel problems • Implementation details abstracted away from developers • increased productivity, increased portability • Still an important role for the compiler in this domain • Little visibility “through” the model for low-level optimization by developers • communication, memory optimization will be critical in next-gen systems • Model can provide structured semantics for the compiler, beyond what can be derived from analysis of low-level code • As new system models are developed, compilers, modeling languages, and developers will take on new, interactive roles

Domain-specific modeling and optimization • Programming Model provides the compiler with information that one cannot extract with analysis alone • Compiler breaks the limitations that are imposed by the model, allowing for efficient, high-performance binaries

Concluding thoughts • Reaching the true potential of multi-everything hardware • Scalability requires distributed parallelism and memory models • Requires new compilation tools to break compiler memory wall • Broad suite of analyses necessary • Advanced pointer analysis • Memory dataflow analysis • New interactions of classical analyses • This is not just reinventing HPF • New distributed parallelism paradigms • New applications  new challenges! • As the field develops, new domain-specific programming models will also benefit from advanced compilation technology

Breaking Down the Memory Wall for Future Scalable Computing Platforms

Breaking Down the Memory Wall for Future Scalable Computing Platforms

Presentation Transcript

Breaking Down Breakfast

Computing Platforms for Multimedia

Breaking down the walls

Breaking down the wall of silence: the archives in the battle to retrieve Spain’s historical memory

Breaking Down the Wall: The Theological Student as Researcher

Breaking Down the Barriers

Breaking the problem down

“BREAKING IT DOWN”

Breaking Down the Prompt

Breaking Down Satire…

Breaking Down the System

Breaking down Winthrop

Breaking down the verb

Breaking the Fourth Wall

Breaking Down the Argument

Breaking Down The Wall

Breaking down the verb

Breaking the Memory Wall in MonetDB

Breaking the Memory Wall for Scalable Microprocessor Platforms

Breaking Down the Prompt

Scalable Networking for Next-Generation Computing Platforms

Breaking down the role