Computing beyond a Million Processors - bio-inspired massively-parallel architectures

Computing beyond a Million Processors- bio-inspired massively-parallel architectures Andrew BrownThe University of Southampton adb@ecs.soton.ac.uk Steve FurberThe University of Manchester steve.furber@manchester.ac.uk SBF is supported by a Royal Society-Wolfson Research Merit Award

Outline • Computer Architecture Perspective • Building Brains • Living with Failure • Design Principles • The SpiNNaker system • Concurrency • Conclusions

Multi-core CPUs • High-end uniprocessors • diminishing returns from complexity • wire vs transistor delays • Multi-core processors • cut-and-paste • simple way to deliver more MIPS • Moore’s Law • more transistors • more cores … but what about the software?

Multi-core CPUS • General-purpose parallelization • an unsolved problem • the ‘Holy Grail’ of computer science for half a century? • but imperative in the many-core world • Once solved • few complex cores, or many simple cores? • simple cores win hands-down on power-efficiency!

Back to the future • Imagine… • a limitless supply of (free) processors • load-balancing is irrelevant • all that matters is: • the energy used to perform a computation • formulating the problem to avoid synchronisation • abandoning determinism • How might such systems work?

Bio-inspiration • How can massively parallel computing resources accelerate our understanding of brain function? • How can our growing understanding of brain function point the way to more efficient parallel, fault-tolerant computation?

Outline • Computer Architecture Perspective • Building Brains • Living with Failure • Design Principles • The SpiNNaker system • Concurrency • Conclusions

Building brains • Brains demonstrate • massive parallelism (1012 neurons) • massive connectivity (1015 synapses) • excellent power-efficiency • much better than today’s microchips • low-performance components (~ 100 Hz) • low-speed communication (~ metres/sec) • adaptivity – tolerant of component failure • autonomous learning

Neurons • Multiple inputs (dendrites) • Single output (axon) • digital “spike” • fires at 10s to 100s of Hz • output connects to many targets • Synapse at input/output connection (www.ship.edu/ ~cgboeree/theneuron.html)

Neurons • A flexible biological control component • very simple animals have a handful • bees: 850,000 • humans: 1012 (photo courtesy of the Brain Mind Institute, EPFL)

Neurons • Regular high-level structure • e.g. 6-level cortical microachitecture • low-level vision, to • language, etc. • Random low-level structure • adapts over time (faculty.washington.edu/ rhevner/Miscellany.html)

w1 w2 w3 w4 x1 x2 x3 x4 f y Neural Computation • To compute we need: • Processing • Communication • Storage • Processing:abstract model • linear sum ofweighted inputs • ignores non-linearprocesses in dendrites • non-linear output function • learn by adjusting synaptic weights

Processing • Leaky integrate-and-fire model • inputs are a series of spikes • total input is a weighted sum of the spikes • neuron activation is the input with a “leaky” decay • when activation exceeds threshold, output fires • habituation, refractory period, …?

v u Processing ( www.izhikevich.com ) • Izhikevich model • two variables, one fast, one slow: • neuron fires whenv > 30; then: • a, b, c & d select behaviour

Communication • Spikes • biological neurons communicate principally via ‘spike’ events • asynchronous • information is only: • which neuron fires, and • when it fires

Storage • Synaptic weights • stable over long periods of time • with diverse decay properties? • adaptive, with diverse rules • Hebbian, anti-Hebbian, LTP, LTD, ... • Axon ‘delay lines’ • Neuron dynamics • multiple time constants • Dynamic network states

Outline • Building Brains • Computer Architecture Perspective • Living with Failure • Design Principles • The SpiNNaker system • Concurrency • Conclusions

100 Pentium 4 Pentium III 10 Pentium Pentium II 486 1 386 286 Millions of transistors per chip 0.1 8086 0.01 4004 8080 8008 0.001 1970 1975 1980 1985 1990 1995 2000 Year The Good News... Transistors per Intel chip

...and the Bad News • Device variability • & • Component failure

Atomic Scale devices The simulation Paradigm now A 22 nm MOSFET In production 2008 A 4.2 nm MOSFET In production 2023

A view from Intel • The Good News: • we will have 100 billion transistor ICs • The Bad News: • billions will fail in manufacture • unusable due to parameter variations • billions more will fail over the first year of operation • intermittent and permanent faults (ShekharBorkar, Intel Fellow)

A view from Intel • Conclusions: • one-time production test will be out • burn-in to catch infant mortality will be impractical • test hardware will be an integral part of the design • dynamically self-test, detect errors, reconfigure, adapt, ... (ShekharBorkar, Intel Fellow)

Design principles • Virtualised topology • physical and logical connectivity are decoupled • Bounded asynchrony • time models itself • Energy frugality • processors are free • the real cost of computation is energy

SpiNNaker project • Multi-core CPU node • 20 ARM968 processors • to model large-scale systems of spiking neurons • Scalable up to systems with 10,000s of nodes • over a million processors • >108 MIPS total • Power ~ 25mw/neuron

SpiNNaker project

Fault-tolerant architecture for large-scale neural modelling Abillion neurons in real time Astep-function increase in the scale of neural computation Cost- and energy-efficient SpiNNaker project

SpiNNaker system

CMP node

ARM968 subsystem

GALS organization • clocked IP blocks • self-timed interconnect • self-timed inter-chip links

din (2 phase) dout (4 phase) ¬reset ¬ack Circuit-level concurrency Rx Tx data ack • Delay-insensitive comms • 3-of-6 RTZ on chip • 2-of-7 NRZ off chip • Deadlock resistance • Tx & Rx circuits have high deadlock immunity • Tx & Rx can be reset independently • each injects a token at reset • true transition detector filters surplus token

System-level concurrency • Breaking symmetry • any processor can be Monitor Processor • local ‘election’ on each chip, after self-test • all nodes are identical at start-up • addresses are computed relative to node with host connection (0,0) • system initialised using flood-fill • nearest-neighbour packet type • boot time (almost) independent of system scale

sleeping event DMA Completion Interrupt Packet Received Interrupt Timer Millisecond Interrupt fetch_ Synaptic_ Data(); update_ Neurons(); update_ Stimulus(); Priority 3 Priority 1 Priority 2 goto_Sleep(); Application-level concurrency • Event-driven real-time software • spike packet arrived • initiate DMA • DMA of synaptic data completed • process inputs • insert axonal delay • 1ms Timer interrupt

Application-level concurrency • Cross-system delay << 1ms • hardware routing • ‘emergency’ routing • failed links • Congestion • if all else fails • drop packet

Biological concurrency • Firing rate population codes • N neurons • diverse tuning • collective coding ofa physical parameter • accuracy • robust to neuronfailure (Neural Engineering,Eliasmith & Anderson 2003)

Biological concurrency • Single spike/neuron codes • choose N to fire from a population of M • order of firing may or may not matter

Software progress • ARM SoC Designer SystemC model • 4 chip x 2 CPU top-level Verilog model • Running: • boot code • Izhikevich model • PDP2 codes • Basic configuration flow • …it all works!

Where might this lead? • Robots • iCub EU project • open humanoid robot platform • mechanics, but no brain!

Conclusions • Many-core processing is coming • soon we will have far more processors than we can program • When (if?) we crack parallelism… • more small processors are better than fewer large processors • synchronization, coherent global memory, determinism, are all impediments • Biology suggests a way forward! • but we need new theories of biological concurrency

UoM SpiNNakerteam

Computing beyond a Million Processors - bio-inspired massively-parallel architectures