Architecture Options and Challenges for Embedded Systems in HPEC 2001

Options for embedded systems.Constraints, challenges, and approachesHPEC 2001Lincoln Laboratory25 September 2001 Gordon Bell Bay Area Research Center Microsoft Corporation

More architecture options: Applications, COTS (clusters, computers… chips), Custom Chips…

The architecture challenge: “One person’s system, is another’s component.”- Alan Perlis • Kurzweil: predicted hardware will be compiled and be as easy to change as software by 2010 • COTS: streaming, Beowulf, and www relevance? • Architecture Hierarchy: • Application • Scalable components forming the system • Design and test • Chips: the raw materials • Scalability: fewest, replicatable components • Modularity: finding reusable components

The architecture levels & options • The apps • Data-types: “signals”, “packets”, video, voice, RF, etc. • Environment: parallelism, power, power, power, speed, … cost • The material: clock, transistors… • Performance… it’s about parallelism • Program & programming environment • Network e.g. WWW and Grid • Clusters • Storage, cluster, and network interconnect • Multiprocessors • Processor and special processing • Multi-threading and multiple processor per chip • Instruction Level Parallelism vs • Vector processors

Sony Playstation export limiits A problem X-Box would like to have, … but have solved.

Will the PC prevail for the next decade as a/the dominant platform? … or 2nd to smart, mobile devices? • Moore’s Law: increases performance; Bell’s Corollary reduces prices for new classes • PC server clusters aka Beowulf with low cost OS kills proprietary switches, smPs, and DSMs • Home entertainment & control … • Very large disks (1TB by 2005) to “store everything” • Screens to enhance use • Mobile devices, etc. dominate WWW >2003! • Voice and video become the important apps! C = Commercial; C’ = Consumer

Where’s the action? Problems? • Constraints from the application: Speech, video, mobility, RF, GPS, security…Moore’s Law, networking, Interconnects • Scalability and high performance processing • Building them: Clusters vs DSM • Structure: where’s the processing, memory, and switches (disk and ip/tcp processing) • Micros: getting the most from the nodes • Not ISAs: Change can delay Moore Law effect … and wipe out software investment! Please, please, just interpret my object code! • System (on a chip) alternatives… apps drivers • Data-types (e.g. video, video, RF) performance, portability/power, and cost

COTS: Anything at the system structure level to use? • How are the system components e.g. computers, etc. going to be interconnected? • What are the components? Linux • What is the programming model? • Is a plane, CCC, tank, fleet, ship, etc. an Internet? • Beowulfs… the next COTS • What happened to Ada? Visual Basic? Java?

Legacy mainframes & minicomputers servers & terms Portables Legacy mainframe & minicomputer servers & terminals ComputingSNAPbuilt entirelyfrom PCs Wide-area global network Mobile Nets Wide & Local Area Networks for: terminal, PC, workstation, & servers Person servers (PCs) scalable computers built from PCs A space, time (bandwidth), & generation scalable environment Person servers (PCs) Centralized & departmental uni- & mP servers (UNIX & NT) Centralized & departmental servers buit from PCs ??? TC=TV+PC home ... (CATV or ATM or satellite)

Five Scalabilities Size scalable -- designed from a few components, with no bottlenecks Generation scaling -- no rewrite/recompile or user effort to run across generations of an architecture Reliability scaling… chose any level Geographic scaling -- compute anywhere (e.g. multiple sites or in situ workstation sites) Problem x machine scalability -- ability of an algorithm or program to exist at a range of sizes that run efficiently on a given, scalable computer. Problem x machine space => run time: problem scale, machine scale (#p), run time, implies speedup and efficiency,

Why I gave up on large smPs & DSMs • Economics: Perf/Cost is lower…unless a commodity • Economics: Longer design time & life. Complex. => Poorer tech tracking & end of life performance. • Economics: Higher, uncompetitive costs for processor & switching. Sole sourcing of the complete system. • DSMs … NUMA! Latency matters. Compiler, run-time, O/S locate the programs anyway. • Aren’t scalable. Reliability requires clusters. Start there. • They aren’t needed for most apps… hence, a small market unless one can find a way to lock in a user base. Important as in the case of IBM Token Rings vs Ethernet.

What is the basic structure of these scalable systems? • Overall • Disk connection especially wrt to fiber channel • SAN, especially with fast WANs & LANs

SNAP Architecture----------

ISTORE Hardware Vision • System-on-a-chip enables computer, memory, without significantly increasing size of disk • 5-7 year target: • MicroDrive:1.7” x 1.4” x 0.2” 2006: ? • 1999: 340 MB, 5400 RPM, 5 MB/s, 15 ms seek • 2006: 9 GB, 50 MB/s ? (1.6X/yr capacity, 1.4X/yr BW) • Integrated IRAM processor • 2x height • Connected via crossbar switch • growing like Moore’s law • 16 Mbytes; ; 1.6 Gflops; 6.4 Gops • 10,000+ nodes in one rack! • 100/board = 1 TB; 0.16 Tflops

14" The Disk Farm? or a System On a Card? The 500GB disc card An array of discs Can be used as 100 discs 1 striped disc 50 FT discs ....etc LOTS of accesses/second of bandwidth A few disks are replaced by 10s of Gbytes of RAM and a processor to run Apps!!

The Promise of SAN/VIA/Infiniband http://www.ViArch.org/ • Yesterday: • 10 MBps (100 Mbps Ethernet) • ~20 MBps tcp/ip saturates 2 cpus • round-trip latency ~250 µs • Now • Wires are 10x faster Myrinet, Gbps Ethernet, ServerNet,… • Fast user-level communication • tcp/ip ~ 100 MBps 10% cpu • round-trip latency is 15 us • 1.6 Gbps demoed on a WAN

Top500 taxonomy… everything is a cluster aka multicomputer • Clusters are the ONLY scalable structure • Cluster: n, inter-connected computer nodes operating as one system. Nodes: uni- or SMP. Processor types: scalar or vector. • MPP= miscellaneous, not massive (>1000), SIMD or something we couldn’t name • Cluster types. Implied message passing. • Constellations = clusters of >=16 P, SMP • Commodity clusters of uni or <=4 Ps, SMP • DSM: NUMA (and COMA) SMPs and constellations • DMA clusters (direct memory access) vs msg. pass • Uni- and SMPvector clusters:Vector Clusters and Vector Constellations

Courtesy of Dr. Thomas Sterling, Caltech

The Virtuous Economic Cycle drives the PC industry… & Beowulf Attracts suppliers Greater availability @ lower cost Competition Volume Standards DOJ Utility/value Innovation Creates apps, tools, training, Attracts users

BEOWULF-CLASS SYSTEMS • Cluster of PCs • Intel x86 • DEC Alpha • Mac Power PC • Pure M2COTS • Unix-like O/S with source • Linux, BSD, Solaris • Message passing programming model • PVM, MPI, BSP, homebrew remedies • Single user environments • Large science and engineering applications

Lessons from Beowulf • An experiment in parallel computing systems • Established vision- low cost high end computing • Demonstrated effectiveness of PC clusters for some (not all) classes of applications • Provided networking software • Provided cluster management tools • Conveyed findings to broad community • Tutorials and the book • Provided design standard to rally community! • Standards beget: books, trained people, software … virtuous cycle that allowed apps to form • Industry begins to form beyond a research project Courtesy, Thomas Sterling, Caltech.

Designs at chip level…any COTS options? • Substantially more programmability versus factory compilation • As systems move onto chips and chip sets become part of larger systems, Electronic Design must move from RTL to algorithms. • Verification and design of “GigaScale systems” will be the challenge.

The Productivity Gap 10,000,000 100,000,000 .10m 1,000,000 10,000,000 58%/Yr. compound Complexity growth rate 100,000 1,000,000 Logic Transistors per Chip (K) 10,000 100,000 Productivity Trans./Staff - Month .35m 1,000 10,000 x 100 1,000 x x x x x x 100 21%/Yr. compound Productivity growth rate 10 2.5m 10 1 1991 1999 2001 2003 2007 1987 1989 1993 1995 1997 2005 2009 1983 1985 1981 Logic Transistors/Chip Source: SEMATECH Transistor/Staff Month

What Is GigaScale? • Extremely large gate counts • Chips & chip sets • Systems & multiple-systems • High complexity • Complex data manipulation • Complex dataflow • Intense pressure for correct , 1st time • TTM, cost of failure, etc. impacts ability to have a silicon startup • Multiple languages and abstraction levels • Design, verification, and software

EDA Evolution: chips to systems GigaScale Architect 2005 (e.g. Forte) GigaScale Hierarchical Verification plus SOC Designer System Architect 1995 (Synopsys & Cadence) RTL 1M gates Testbench Automation Emulation Formal Verification plus ASIC Designer Chip Architect 1985(Daisy, Mentor) Gates 10K gates Simulation IC Designer 1975 (Calma & CV) Physical design Courtesy of Forte Design Systems

If system-on-a-chip is the answer, what is the problem? • Small, high volume products • Phones, PDAs, • Toys & games (to sell batteries) • Cars • Home appliances • TV & video • Communication infrastructure • Plain old computers… and portables • Embeddable computers of all types where performance and/or power are the major constraints.

SOC Alternatives… not including C/C++ CAD Tools • The blank sheet of paper: FPGA • Auto design of a processor: Tensilica • Standardized, committee designed components*, cells, and custom IP • Standard components including more application specific processors *, IP add-ons plus custom • One chip does it all: SMOP *Processors, Memory, Communication & Memory Links,

IUnknown IUnknown IUnknown IFoo IFoo IFoo IBar IBar IBar IPGood IPGood IPGood IOleBad IOleBad IOleBad IUnknown IUnknown IOleObject IOleObject IDataObject IDataObject IUnknown Application Implementation IPersistentStorage IPersistentStorage IUnknown IOleDocument IOleDocument IOleObject IOleObject IDataObject IDataObject Time to Develop/Iterate New Application Cost to Develop/Iterate New Application IPersistentStorage IPersistentStorage IOleDocument IOleDocument High High Lower Lower Architecture Programmability Low High Structured Custom RTL Flow FPGA FPGA & GPP ASIP DSP GPP Microarchitecture MOPS/mW Low High Platform Exportation Tradeoffs and Reuse Model System Application Silicon Process

System-on-a-chip alternatives

Xilinx 10Mg, 500Mt, .12 mic

Tensillica Approach: Compiled Processor Plus Development Tools ALU I/O Timer Pipe Cache Register File MMU Tailored, HDL uP core Using theprocessor generator, create... Describe the processor attributes from a browser-like interface Standard cell library targetted to the silicon process Customized Compiler, Assembler, Linker, Debugger, Simulator Courtesy of Tensilica, Inc. http://www.tensilica.com Richard Newton, UC/Berkeley

EEMBC Networking Benchmark • Benchmarks: OSPF, Route Lookup, Packet Flow • Xtensa with no optimization comparable to 64b RISCs • Xtensa with optimization comparable to high-end desktop CPUs • Xtensa has outstanding efficiency (performance per cycle, per watt, per mm2) • Xtensa optimizations: custom instructions for route lookup and packet flow Colors: Blue-Xtensa, Green-Desktop x86s, Maroon-64b RISCs, Orange-32b RISCs

EEMBC Consumer Benchmark • Benchmarks: JPEG, Grey-scale filter, Color-space conversion • Xtensa with no optimization comparable to 64b RISCs • Xtensa with optimization beats all processors by 6x (no JPEG optimization) • Xtensa has exceptional efficiency (performance per cycle, per watt, per mm2) • Xtensa optimizations:custom instructions for filters, RGB-YIQ, RGB-CMYK Colors: Blue-Xtensa, Green-Desktop x86s, Maroon-64b RISCs, Orange-32b RISCs

Free 32 bit processor core

Complex SOC architecture Synopsys via Richard Newton, UC/B

UMS Architecture • Memory bandwidth scales with processing • Scalable processing, software, I/O • Each app runs on its own pool of processors • Enables durable, portable intellectual property

Cradle UMS Design Goals • Minimize design time for applications • Efficient programming model • High reusability accelerates derivative development • Cost/Performance • Replace ASICs, FPGAs, ASSPs, and DSPs • Low power for battery powered appliances • Flexibility • Cost effective solution to address fragmenting markets • Faster return on R&D investments

Global Bus Universal Microsystem (UMS) Quad 1 Quad 2 Quad 3 Quad 2 Quad 3 I/O Quad Quad ‘n” SDRAMCONTROL I/O Quad PLA Ring Quad “n” Each Quad has 4 RISCs, 8 DSPs, and Memory Unique I/O subsystem keeps interfaces soft

The Universal Micro System (UMS) An off the shelf “Platform” for Product Line Solutions Universal Micro System Superior Digital Signal Processing (Single Clock FP-MAC) Local Memory that scales with additional processors Scalable real time functions in software using small fast processors (QUAD) Intelligent I/O Subsystem (Change Interfaces without changing chips) 250 MFLOPS/mm2

VPN Enterprise Gateway • Five quads; Two 10/100 Ethernet ports at wire speed; one T1/E1/J1 interface • Handles 250 end users and 100 routes • Does key handling for IPSec • Delivers 100Mbps of 3DES • Firewall • IP Telephony • O/S for user interactions • Single quad; Two 10/100 Ethernet ports at wire speed; one T1/E1/J1 interface • Handles 250 end users and 100 routes • Does key handling for IPSec • Delivers 50Mbps of 3DES

Table 2: Performance of Kernels on UMS Application MSPs Comments MPEG Video Decode 4 720x480, 9Mbits/sec 6 720x480, 15Mbits/sec MPEG Video Encode 10-16 322/1282 Search Area AC3 Audio Decode 1 Modems 0.5 V90 3 G.Lite 4 ADSL Ethernet Router (Level 3 + QOS) 0.5 Per 100Mb channel 4 Per Gigabit channel Encryption 1 3DES 15Mb/s 1 MD5 425Mb/s 3D geom, lite, render 4 1.6M Polygons/sec DV Encode/Decode 8 Camcorder UMS Application Performance • Architecture permits scalable software • Supports two Gigabit Ethernets at wire speed; four fast Ethernets; four T-1s, USB, PCI, 1394, etc. • MSP is a logical unit of one PE and two DSEs

Cradle: Universal Microsystemtrading Verilog & hardware for C/C++ UMS : VLSI = microprocessor : special systemsSoftware : Hardware • Single part for all apps • App spec’d@ run time using FPGA & ROM • 5 quad mPs at 3 Gflops/quad = 15 Glops • Single shared memory space, caches • Programmable periphery including: 1 GB/s; 2.5 GipsPCI, 100 baseT, firewire • $4 per flops; 150 mW/Gflops

Silicon Landscape 200x • Increasing cost of fabrication and mask • $7M for high-end ASSP chip design • Over $650K for masks alone and rising • SOC/ASIC companies require $7-10M business guarantee • Physical effects (parasitics, reliability issues, power management) are more significant design issues • These must now be considered explicitly at the circuit level • Design complexity and “context complexity” is sufficiently high that design verification is a major limitation on time-to-market Fewer design starts, higher-design volume…implies more programmable platforms Richard Newton, UC/Berkeley

The End

General-Purpose Computing Platform-Based Design Application(s) Application(s) Application(s) … SPARC 360 3000 … Instruction Set Architecture Platform … … Synthesizeable RTL Microarchitecture & Software “Physical Implementation” … … … Physical Implementation Verilog, VHDL, … ASIC FPGA

The Energy-Flexibility Gap 1000 Dedicated HW MUD 100-200 MOPS/mW 100 ReconfigurableProcessor/Logic Pleiades 10-50 MOPS/mW Energy Efficiency MOPS/mW (or MIPS/mW) 10 ASIPs DSPs 1 V DSP 3 MOPS/mW 1 Embedded mProcessors LPArm 0.5-2 MIPS/mW 0.1 Flexibility (Coverage) Source: Prof. Jan Rabaey, UC Berkeley

Approaches to Reuse • SOC as the Assembly of Components? • Alberto Sangiovanni-Vincentelli • SOC as a Programmable Platform? • Kurt Keutzer

Assembly language for Processor Component-Based Programmable Platform Approach • Application-Specific Programmable Platforms (ASPP) • These platforms will be highly-programmable • They will implement highly-concurrent functionality Intermediate language that exposes programmability of all aspects of the microarchitecture Integrate using programmable approach to on-chip communication Assemble Components from parameterized library Richard Newton, UC/Berkeley

Compact Synthesized Processor, Including Software Development Environment • Use virtually any standard cell library with commercial memory generators • Base implementation is less than 25K gates (~1.0 mm2 in 0.25m CMOS) • Power Dissipation in 0.25m standard cell is less than 0.5 mW/MHz to scale on a typical $10 IC (3-6% of 60mm^2) Courtesy of Tensilica, Inc. http://www.tensilica.com

Architecture Options and Challenges for Embedded Systems in HPEC 2001

Architecture Options and Challenges for Embedded Systems in HPEC 2001

Presentation Transcript

Gordon Bell Microsoft Research Gbell@microsoft research.microsoft/~gbell

Gordon Bell Microsoft Research Gbell@microsoft research.microsoft/~gbell

July 16, 1998 Gordon Bell Microsoft Corporation

WINS Workshop Gordon Bell Microsoft Research

TACO BELL Corporation Research

The Global Grid Forum 25 June 2003 Gordon Bell Microsoft Corporation

Microsoft Corporation

Microsoft And Device Bay Dan Shapiro Program Manager Microsoft Corporation

Microsoft Corporation

Seattle, WA 22 October 2003 Gordon Bell Microsoft Research

Gordon Bell

Gordon Bell (gbell@microsoft) Bay Area Research Center Microsoft Research

Gordon Bell Microsoft Bay Area Center Research

The Global Grid Forum 25 June 2003 Gordon Bell Microsoft Corporation

Gordon Bell Microsoft Research Gbell@microsoft research.microsoft/~gbell

Gordon Bell Microsoft Research Gbell@microsoft research.microsoft/~gbell

Microsoft Research Jim Gray Researcher Microsoft Research Microsoft Corporation

WINS Workshop Gordon Bell Microsoft Research