All the chips outside… and around the PC what new platforms? Apps? Challenges, what’s interesting, and what needs doing

All the chips outside… and around the PCwhat new platforms? Apps?Challenges, what’s interesting, and what needs doing? Gordon Bell Bay Area Research Center Microsoft Corporation

Architecture changes when everyone and everything is mobile!Power, security, RF, WWW, display, data-types e.g. video & voice… it’s the application of architecture!

The architecture problem • The apps • Data-types: video, voice, RF, etc. • Environment: power, speed, cost • The material: clock, transistors… • Performance… it’s about parallelism • Program & programming environment • Network e.g. WWW and Grid • Clusters • Multiprocessors • Storage, cluster, and network interconnect • Processor and special processing • Multi-threading and multiple processor per chip • Instruction Level Parallelism vs • Vector processors

IP On Everything

poochi

Sony Playstation export limiits

Non-PCdevices and Internet PC At An Inflection Point? It needs to continue to be upward. These scalable systems provide the highest technical (Flops) and commercial (TPC) performance. They drive microprocessor competition! PCs

Consumer PCs Mobile Companions TV/AV The Dawn Of The PC-Plus Era, Not The Post-PC Era…devices aggregate via PCs!!! Household Management Communications Automation & Security

PC will prevail for the next decade as a dominant platform … 2nd to smart, mobile devices • Moore’s Law increases performance; and alternatively reduces prices • PC server clusters with low cost OS beat proprietary switches, smPs, and DSMs • Home entertainment & control … • Very large disks (1TB by 2005) to “store everything” • Screens to enhance use • Mobile devices, etc. dominate WWW >2003! • Voice and video become important apps! C = Commercial; C’ = Consumer

Where’s the action? Problems? • Constraints: Speech, video, mobility, RF, GPS, security…Moore’s Law, including network speed • Scalability and high performance processing • Building them: Clusters vs DSM • Structure: where’s the processing, memory, and switches (disk and ip/tcp processing) • Micros: getting the most from the nodes • Not ISAs: Change can delay Moore Law effect … and wipe out software investment! Please, please, just interpret my object code! • System on a chip alternatives… apps drive • Data-types (e.g. video, video, RF) performance, portability/power, and cost

High Performance Computing A 60+ year view

High performance architecture/program timeline 1950 . 1960 . 1970 . 1980 . 1990 . 2000 Vtubes Trans. MSI(mini) Micro RISC nMicr Sequential programming---->------------------------------ (single execution stream) <SIMD Vector--//--------------- Parallelization--- Parallel programs aka Cluster Computing <--------------- multicomputers <--MPP era------ ultracomputers 10X in size & price!10x MPP “in situ” resources 100x in //sm NOW VLSCC geographically dispersed Grid

Computer types -------- Connectivity-------- WAN/LAN SAN DSM SM Netwrked Supers… GRID VPPuni NEC mP NEC super Cray X…T (all mPv) Clusters micros vector Legion Condor Beowulf NT clusters T3E SP2(mP) NOW SGI DSM clusters & SGI DSM Mainframes Multis WSs PCs

Technical computer types WAN/LAN SAN DSM SM Old World ( one program stream) New world: Clustered Computing (multiple program streams) Netwrked Supers… GRID VPPuni NEC mP T series NEC super Cray X…T (all mPv) micros vector Legion Condor Beowulf SP2(mP) NOW SGI DSM clusters & SGI DSM Mainframes Multis WSs PCs

Dead Supercomputer Society

ACRI Alliant American Supercomputer Ametek Applied Dynamics Astronautics BBN CDC Convex Cray Computer Cray Research Culler-Harris Culler Scientific Cydrome Dana/Ardent/Stellar/Stardent Denelcor Elexsi ETA Systems Evans and Sutherland Computer Floating Point Systems Galaxy YH-1 Goodyear Aerospace MPP Gould NPL Guiltech Intel Scientific Computers International Parallel Machines Kendall Square Research Key Computer Laboratories MasPar Meiko Multiflow Myrias Numerix Prisma Tera Thinking Machines Saxpy Scientific Computer Systems (SCS) Soviet Supercomputers Supertek Supercomputer Systems Suprenum Vitesse Electronics Dead Supercomputer Society

SCI Research c1985-1995 • 35 university and corporate R&D projects • 2 or 3 successes… • All the rest failed to work or be successful

How to build scalables? To cluster or not to cluster… don’t we need a single, shared memory?

Application Taxonomy General purpose, non-parallelizable codes(PCs have it!) Vectorizable Vectorizable & //able(Supers & small DSMs) Hand tuned, one-ofMPP course grainMPP embarrassingly //(Clusters of PCs...) DatabaseDatabase/TP Web Host Stream Audio/Video Technical Commercial If central control & rich then IBM or large SMPs else PC Clusters

SNAP … c1995Scalable Network And PlatformsA View of Computing in 2000+We all missed the impact of WWW! Gordon Bell Jim Gray

Legacy mainframes & minicomputers servers & terms Portables Legacy mainframe & minicomputer servers & terminals ComputingSNAPbuilt entirelyfrom PCs Wide-area global network Mobile Nets Wide & Local Area Networks for: terminal, PC, workstation, & servers Person servers (PCs) scalable computers built from PCs A space, time (bandwidth), & generation scalable environment Person servers (PCs) Centralized & departmental uni- & mP servers (UNIX & NT) Centralized & departmental servers buit from PCs ??? TC=TV+PC home ... (CATV or ATM or satellite)

*IBM Bell Prize and Future Peak Tflops (t) Petaflops study target NEC CM2 XMP NCube

Top 10 tpc-c Top two Compaq systems are:1.1 & 1.5X faster than IBM SPs;1/3 price of IBM1/5 price of SUN

Courtesy of Dr. Thomas Sterling, Caltech

Five Scalabilities Size scalable -- designed from a few components, with no bottlenecks Generation scaling -- no rewrite/recompile or user effort to run across generations of an architecture Reliability scaling… chose any level Geographic scaling -- compute anywhere (e.g. multiple sites or in situ workstation sites) Problem x machine scalability -- ability of an algorithm or program to exist at a range of sizes that run efficiently on a given, scalable computer. Problem x machine space => run time: problem scale, machine scale (#p), run time, implies speedup and efficiency,

Why I gave up on large smPs & DSMs • Economics: Perf/Cost is lower…unless a commodity • Economics: Longer design time & life. Complex. => Poorer tech tracking & end of life performance. • Economics: Higher, uncompetitive costs for processor & switching. Sole sourcing of the complete system. • DSMs … NUMA! Latency matters. Compiler, run-time, O/S locate the programs anyway. • Aren’t scalable. Reliability requires clusters. Start there. • They aren’t needed for most apps… hence, a small market unless one can find a way to lock in a user base. Important as in the case of IBM Token Rings vs Ethernet.

FVCORE PerformanceFinite Volume Community Climate Model, Joint Code development NASA, LLNL and NCAR 50 SX-5 SX-4 Max C90-16 Max T3E

Vector System Microprocessor System Vector registers 8 KBytes 1st & 2nd Lvl Caches 8 MBytes Memory Memory CPU CPU Architectural Contrasts – Vector vs Microprocessor 500Mhz 600Mhz Two results per clock Two results per clock (Will be 4 in next Gen SGI) Vector lengths arbitrary Vector lengths fixed Vectors fed at low speed Vectors fed at high speed Cache based systems are nothing more than “vector” processors with a highly programmable “vector” register set (the caches). These caches are 1000x larger than the vector registers on a Cray vector system, and provide the opportunity to execute vector work at a very high sustained rate. In particular, note 512 CPU Origins contain 4 GBytes of cache. This is larger than most problems of interest, and offers a tremendous opportunity for high performance across a large number of CPUs. This has been borne out in fact at NASA Ames.

Convergence to one architecture mPs continue to be the main line

“Jim, what are the architectural challenges … for clusters?” • WANS (and even LANs) faster than backplanes at 40 Gbps • End of busses (fc=100 MBps)… except on a chip • What are the building blocks or combinations of processing, memory, & storage? • Infiniband http://www.infinibandta.orgstarts at OC48, but it may not go far or fast enough if it ever exists. OC192 is being deployed.

What is the basic structure of these scalable systems? • Overall • Disk connection especially wrt to fiber channel • SAN, especially with fast WANs & LANs

Modern scalable switches … also hide a supercomputer • Scale from <1 to 120 Tbps of switch capacity • 1 Gbps ethernet switches scale to 10s of Gbps • SP2 scales from 1.2 Gbps

SNAP Architecture----------

ISTORE Hardware Vision • System-on-a-chip enables computer, memory, without significantly increasing size of disk • 5-7 year target: • MicroDrive:1.7” x 1.4” x 0.2” 2006: ? • 1999: 340 MB, 5400 RPM, 5 MB/s, 15 ms seek • 2006: 9 GB, 50 MB/s ? (1.6X/yr capacity, 1.4X/yr BW) • Integrated IRAM processor • 2x height • Connected via crossbar switch • growing like Moore’s law • 16 Mbytes; ; 1.6 Gflops; 6.4 Gops • 10,000+ nodes in one rack! • 100/board = 1 TB; 0.16 Tflops

14" The Disk Farm? or a System On a Card? The 500GB disc card An array of discs Can be used as 100 discs 1 striped disc 50 FT discs ....etc LOTS of accesses/second of bandwidth A few disks are replaced by 10s of Gbytes of RAM and a processor to run Apps!!

Map of Gray Bell Prize results Redmond/Seattle, WA single-thread single-stream tcp/ip via 7 hops desktop-to-desktop …Win 2K out of the box performance* New York Arlington, VA San Francisco, CA 5626 km 10 hops

Ubiquitous 10 GBps SANs in 5 years 1 GBps • 1Gbps Ethernet are reality now. • Also FiberChannel ,MyriNet, GigaNet, ServerNet,, ATM,… • 10 Gbps x4 WDM deployed now (OC192) • 3 Tbps WDM working in lab • In 5 years, expect 10x, wow!! 120 MBps (1Gbps) 80 MBps 5 MBps 40 MBps 20 MBps

The Promise of SAN/VIA:10x in 2 years http://www.ViArch.org/ • Yesterday: • 10 MBps (100 Mbps Ethernet) • ~20 MBps tcp/ip saturates 2 cpus • round-trip latency ~250 µs • Now • Wires are 10x faster Myrinet, Gbps Ethernet, ServerNet,… • Fast user-level communication • tcp/ip ~ 100 MBps 10% cpu • round-trip latency is 15 us • 1.6 Gbps demoed on a WAN

Processor improvements… 90% of ISCA’s focus

We get more of everything

Mainframes, minis, micros, and risc

Computer ops/sec x word length / $

Micros Supers Growth of microprocessor performance 10000 Cray T90 Cray C90 Cray Y-MP Cray 2 1000 Alpha RS6000/590 Cray X-MP Alpha 100 RS6000/540 Cray 1S i860 10 R2000 Performance in Mflop/s 1 80387 0.1 6881 80287 8087 0.01 1998 1980 1982 1986 1988 1990 1992 1994 1996

Albert Yu predictions ‘96 When 2000 2006 Clock (MHz) 900 4000 4.4x MTransistors 40 350 8.75x Mops 2400 20,000 8.3x Die (sq. in.) 1.1 1.4 1.3x

µProc 60%/yr.. 1000 CPU 100 Processor-Memory Performance Gap:(grows 50% / year) Performance 10 DRAM 7%/yr.. DRAM 1 1992 2000 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1993 1994 1995 1996 1997 1998 1999 Processor Limit: DRAM Gap “Moore’s Law” • Alpha 21264 full cache miss / instructions executed: 180 ns/1.7 ns =108 clks x 4 or 432 instructions • Caches in Pentium Pro: 64% area, 88% transistors • *Taken from Patterson-Keeton Talk to SigMod

The “memory gap” • Multiple e.g. 4 processors/chip in order to increase the ops/chip while waiting for the inevitable access delays • Or alternatively, multi-threading (MTA) • Vector processors with a supporting memory system • System-on-a-chip… to reduce chip boundary crossings

If system-on-a-chip is the answer, what is the problem? • Small, high volume products • Phones, PDAs, • Toys & games (to sell batteries) • Cars • Home appliances • TV & video • Communication infrastructure • Plain old computers… and portables

SOC Alternatives… not including C/C++ CAD Tools • The blank sheet of paper: FPGA • Auto design of a basic system: Tensilica • Standardized, committee designed components*, cells, and custom IP • Standard components including more application specific processors *, IP add-ons and custom • One chip does it all: SMOP *Processors, Memory, Communication & Memory Links,

All the chips outside… and around the PC what new platforms? Apps? Challenges, what’s interesting, and what needs doing