150 likes | 246 Views
Climate Machine Update. David Donofrio RAMP Retreat 8/20/2008 . Agenda. Project Overview Tensilica Architecture and Design Flow Tensilica Tools Demo Why we need RAMP Current Progress Next Steps. A New Approach to HPC. Current HPC Design approach:
E N D
Climate Machine Update David Donofrio RAMP Retreat 8/20/2008
Agenda • Project Overview • Tensilica Architecture and Design Flow • Tensilica Tools Demo • Why we need RAMP • Current Progress • Next Steps
A New Approach to HPC • Current HPC Design approach: • Leverage commodity processors from Intel, AMD, etc • Once machine is built, optimize problems to run on it • Power wall prevents scaling to exaflop performance • Power is the new design point Olukotun and Sutter Moore’s Law still in effect - but number of processors double every 18 months rather than clock rate
A New Approach to HPC • Our approach: • Identify application, then tailor machine using semi-custom design • Optimize CPU architecture and further extend with semi-custom ISA • Leverage auto-tuning to access architecture specific optimizations • Even if each simple core is 1/4 as computationally efficient as a complex core you can fit hundreds on a single die and be 100x more power efficient • Learn from embedded market where Flops / Watt and rapid design cycles are crucial • Start with building blocks from embedded designs rather than full custom ASIC • Preserve ability to run general purpose C code • Application Target: 1km Scale Climate Model Tailor machine architecture to application to reduce waste
Climate Model Resource Requirements • DOE has identified high-resolution climate modeling as a leading justification for exascale computing • Must express 20M way parallelism • Requires performance of 200 Pflops peak • Simulation must run 1000x faster than real time • Amenable to massively concurrent architectures composed of power efficient embedded cores. • Actively working with the climate science community to enable new Icosahedral model NASA Randall / CSU
Tensilica Processor Design Flow • Complete Solution: Hardware, Software and Verification • Fully customizable • Required base ISA ensures general purpose applications • Processor configuration submitted to Tensilica’s servers where synthesis is performed • Returned design can be spun for ASIC or FPGA • Bit file available for Avnet boards • Building block approach drastically reduces design cycle time compared to full-custom design Tensilica Inc.
Tensilica Architecture Features • Verilog-like TIE language allows for custom ISA extensions • Functional and performance verification built in • Auto generated compiler intrinsics • 64-bit IEEE-DP floating point coded up in TIE and available • Custom VLIW support • Inter-processor communication easily enabled through: • TIE Ports • TIE Queues • Access to direct HW support for interprocessor communication • TIE Lookups • Allows interface to external ROMs or other RTL block
Tensilica Architecture Overview Tensilica Inc.
Tensilica Performance Debug • Processor viewed as black box • State can be compressed (via HW) and pushed out JTAG port • Intended for program replay • Xtensa trace port gives real-time visibility into internal pipeline state with unprecedented detail • $ hit miss with virtual address • Branch taken / not taken • Call / return • Resource dependency • Etc… • Opportunity for hundreds of performance counters to be made available Tensilica Inc.
Why we need RAMP • Fast, accurate emulation enables: • Dual nested loop of HW / SW co-design • Preliminary work using Stanford SM sim shows significant improvement in power eff. using automated HW/SW co-tuning • RAMP critical to accelerate • Rapid prototyping and analysis of Tensilica architectural options • Inter-processor communication architecture exploration • Running FULL climate code providing a more complete performance picture • Cycle accurate simulator currently running at ~100 kHz vs. 50MHz on V5 • Extensive HW performance counter data enables an emulation environment with similar resolution but much greater speed Tensilica provided emulation environment kick-starts this effort
Current Status • ML505 used for initial design exploration • Basic xtensa processor + JTAG and memory controller is ~50% of a Virtex 5 50t • Runs at 50MHz • ASIC in 65G process runs at 650MHz • OnChip Debug working • Can load / run programs using main memory synthesized from BRAM • DRAM interface coded - currently being debugged • RTL license recently obtained - full simulation environment (in ModelSim) being brought up
Next Steps… • Transition to BEE3 from ML505 • Bring up XTOS environment on single xtensa processor on BEE3 • Run single column of climate code on single processor • Demo at SC’08 in November • Continue HW / SW co-tuning optimization • Begin multi-processor emulation • Emulation of single socket, 32 core, using networked BEE3s • Running full 2 Million line climate model
The Need for Exascale Computing Icosahedral • DOE has identified high-resolution climate modeling as leading justification for exascale computing • 1 km resolution targeted for accurate cloud resolving model • Difficult to scale existing systems • HPC design using commodity processors estimated to draw 179MW • BlueGene design estimated to draw 20MW • Leveraging embedded cores and more application specific design a power envelope of 3-5MW is projected Randall / CSU LBNL will seek an external vendor to build the machine if our approach is proven valid - LBNL is not entering the commercial HPC market.