170 likes | 307 Views
The CRAY-1 Computer System. Richard M. Russell Presented by Andrew Waterman ECE259 Spring 2008. Background. CRAY-1 by no means first vector machine 1960s: Westinghouse Solomon/ILLIAC IV 1974: CDC STAR 100 “I never, ever want to be a pioneer” --Cray
E N D
The CRAY-1 Computer System Richard M. Russell Presented by Andrew Waterman ECE259 Spring 2008
Background CRAY-1 by no means first vector machine 1960s: Westinghouse Solomon/ILLIAC IV 1974: CDC STAR 100 “I never, ever want to be a pioneer” --Cray STAR 100, ILLIAC IV: who's this Amdahl dude? 1972: Cray Research formed after spat with CDC Seymour Cray wanted to start from scratch on 8600; CDC brass, not so much 1976: first CRAY-1 deployed at Livermore
CRAY-1 Architecture • 5-ton, vector uniprocessor • Word size = 64 bits • 80 MHz clock • 8MB RAM in 16 banks @ 20 MHz • fcpu/fmem = 4 (!!) • Fairly RISCy 16- or 32-bit instructions • Load/store; register-register operations
Scalar Operation and Octal Annoyance • 108 A-registers for 24-bit address calculations • 1008 B-registers serve as backing store for A-registers • 108 S-registers for source/dest of scalar integer/FP insns • T is to S as B is to A • 118 pipelined scalar FUs • Address add, mult • Integer add, shift, logic, pop count • FP add, mult, reciprocal
Scalar Operation • Protection without virtual memory • Base & limit address regs • Ld $dest,$addr actually loads from $base+$addr • Program killed if $base+$addr >= $limit • A handful of registers for interrupts, exceptions, etc.
OS and Front End • cos (CRAY OS) handles job scheduling, storage management (tapes!), other I/O, checkpointing • Packaged with CAL (assembler) • ...and CFT (Fortran compiler), more later • Command-line interface and job submission via separate front-end computer, e.g. VAX
Vector Operation (Finally!) • 8x64-word V-registers • Vector Length Register • Indicates # ops performed by vector insns • Set from contents of an A-register • Vector Mask Register • Indicates which elements in vector to operate on • Set by vector test insns (e.g. VM[i] := ($Vk[i] == 0)) • 6 Vector FUs • integer add, shift, bitwise logic • FP via scalar FPU: add, mult, reciprocal
Vector Load/Store Architecture • Big departure from STAR 100: register-register ops • CRAY-1 memory bandwidth == 80Mword/s == 1word/cycle • If all 2-source insns are memory-memory, then IPC=1/3! (and that assumes no bank conflicts!) • Solution: the RISC approach • Combined with chaining (next), can sustain >> 1 flop/cycle
Chaining • Pipeline bypass meets vectors • Consider SAXPY vector expression a*X+Y • Slow approach: compute a*X (64 mults), then compute a*X+Y (64 adds) • Total latency: 128+mult latency+add latency • since, in CRAY-1, all FUs are pipelined • But... no fundamental serialization requirement • As soon as a*X[0] is computed, can compute a*X[0]+Y[0] • Total latency: 64+mult latency+add latency (speedup of almost 2)
Chaining Example • Assume: 8-element vectors, single-cycle ops mul.ds $v2,$v3,$s1 add.d $v1,$v2,$v1 • Without chaining: m m m m m m m m a a a a a a a a • With chaining: m m m m m m m m a a a a a a a a
Vector Startup Times • For vector ops to be efficient enough to justify, startup overhead must be small • CRAY-1 can issue a vector insn every cycle, assuming no structural hazards on FUs • Result: vector performance > scalar performance for as few as four elements/vector
Cray Fortran Compiler (CFT) • Important insight: hand-coding assembly sucks • The actual important insight: most vectorizable code is of the embarrassingly-parallel variety • Even with 1970s compiler technology, innermost-loop parallelism is low-hanging fruit • Exploit this—make the compiler do the heavy lifting • CFT is pretty good for branchless inner loops • ...but doesn't even attempt to vectorize code with IFs • So any use of the Vector Mask register must be hand-coded • Upshot: a good start, but not quite there
Analysis • Extremely fast computer for 1976 • Thought experiment: what if CRAY-1's parameters scaled with Moore's Law? (32 years == 21 doublings) • 200,000 transistors => 400 billion transistors • 8MB main memory => 16TB main memory • 80 MHz clock => petahertz? (if only) • For a (merely) 2nd-generation vector processor, the CRAY-1 was ahead of its time (I think) • I'm not the only one: it was commercially phenomenal • However, design techniques (discrete logic) are totally unscalable
Questions? Richard M. Russell Presented by Andrew Waterman ECE259 Spring 2008
The CRAY-1 Computer System Richard M. Russell Presented by Andrew Waterman ECE259 Spring 2008