800 likes | 953 Views
Course contents. Basic Concepts Computer Design Computer Architectures for AI Computer Architectures in Practice. Types of Parallelism. Focus: parallel systems and languages Types of parallelism (//): Functional //: deriving from the logic of a problem solution
E N D
Course contents • Basic Concepts • Computer Design • Computer Architectures for AI • Computer Architectures in Practice
Types of Parallelism • Focus: parallel systems and languages • Types of parallelism (//): • Functional //: deriving from the logic of a problem solution • Data //: based on data structures that can be regarded as a (huge) number of independent units of work • Various grains: images, matrices, neurons, genes, associations… • Custom data-parallel languages: DAP prolog, Connection Machine LISP and C*
Level of Functional Parallelism • At different grains • Instruction-level (fine-grained) // • Loop-level (middle-grain //) • Procedure-level (large-grain //) • Program-level (coarse-grain //) • ILP and LLP: discussed in Part 2 • Procedure-level //: the unit executed in // is a procedure • Program-level //: basic units are the user programs (inherently independent)
Types of parallel architectures • Flynn’s taxonomy • Single stream of instructions, executed by a single processor (SISD) • Single stream of instructions, executed by multiple processors (SIMD) • Multiple streams of instructions, executed by multiple processors (MIMD) • Multiple streams of instructions executed by a single processor (MISD) • Only defined for sake of symmetry (though…) • Or for dependability strategies based on time-redundancy • Versions • Redundant computations with voting
Classification • Data parallel (in the following) • Vector architectures • Associative and neural • SIMD • Systolic • Functional parallel • ILP • VLIW and superscalar (part 2) • Thread-level // • Process-level // • Distributed-memory MIMD (multi-computers) • Shared-memory MIMD (multi-processors)
Data-parallel architectures • Source: part III of Sima/Fountain/Kaksuk • Main idea: the computer works sequentially, but on a parallel data set • E.g. on a per-matrix vs. a per-byte basis(numerical analysis) • E.g. portions of images(image processing, machine vision, pattern recognition...) • E.g. portions of databases(associative and neural)
Data-parallel architectures • This corresponds to four “classes” of data parallel architectures: • SIMD • Systolic and pipelined • Vector • Associative and Neural • The rest of this part focuses on SIMD, systolic/pipelined, associative and neural architectures
Data-parallel architectures • Data-parallel architectures • SIMD • Associative and Neural • Systolic and pipelined
SIMD systems • SIMD = Single Instruction, Multiple Data • “First” ideas: von Neumann’s cellular automata (‘51) • Classical structure (Unger, ‘58): • Two-dimensional array of PEs, each connected to its four nearest neighbors • All PEs execute the same instruction at the same time • Each PE includes local memory • PE can be programmed so to perform various functions • Data propagate “quickly” through the array • SIMD computers added • A host computer / instruction sequencer • A dedicated I/O & mass storage system
SIMD systems • 1963: Solomon (Westinghouse) • Mesh of 32x64 PEs. • Constructed? • 1963-68: Illiac III and IV (Illinois U.) • III: 32x32 simple PEs; IV: 8x8 complex PEs. • Used effectively by NASA • 1960s: CLIP cellular logic image processor (U. London) • CLIP4 (‘80): 96x96 bit-serial PEs • Used effectively for low-level image processing • 1973: DAP distributed array processor (Reddaway) • Different sizes; prototype: 32x32, 1KB RAM (!) • Long range and near-neighbor interconnections • Custom languages (DAP-Prolog) • Still in production!
SIMD systems • 1983: MPP massively parallel processor (Goodyear) • 128x128, bit-serial PEs • Improved data I/O • 1980s: Connection Machine (Thinking Machines) • Thousands bit-serial PEs, then hundreds SPARCs • Custom languages (CM/Lisp, C*, C/PARIS…) • Widely used • 1990: MasPar systems (MasPar) • Complex PEs • Long range connections via multi-level crossbar switch • An active commercial field today
SIMD design space • Granularity • How many data elements per each PEs? • Few: fine-grained, high granularity • Many: coarse-grained, low granularity • PEs complexity • Often fine-grained implies simple PEs, and v/v • Application-dependent design choice • Economical, technological constraints • Connectivity • Minimum number of communication steps in the worst case - DIAMETER • Maximum amount of data that can be transported by the network from a set of PEs to another one, in one operation - BANDWIDTH
SIMD design space • Autonomy: either • All PEs execute the same function • All active PEs execute the same function • PEs can autonomously select a portion of the input space and autonomously decide where their output shall be stored • PEs can autonomously choose their data paths • PEs can autonomously choose their function (from a predefined set) • Different, very simple algorithms may be executed synchronously by different PEs • Instruction sequencing is always global!
Connection Machine Connection Machine CM-1 in the Computer Museum, Boston, MA
Connection Machine • The Connection Machine was the first commercial computerdesigned expressly to work on simulating intelligence and life • A massively parallel supercomputer with up to 65,536 processors • Conceived by Danny Hilliswhile he was agraduate student under Marvin Minsky at the MIT ArtificialIntelligence Lab
Connection Machine • Modelled after the structure of a human brain: Rather thanrelying on a single powerful processor to perform calculationsone after another, the data was distributed over the tens ofthousands of processors, all of which could performcalculationssimultaneously • The structures for communication and transfer ofdata between processors could change as needed depending on thenature of the problem, making the mutability of the connectionsbetween processors more important than the processorsthemselves, hence the name “Connection Machine” • Initially, 1-bit processors!
Connection Machine • CM1: fine-grained • CM5: coarse-grained: • Internal 64-bit organization • 40MHz (now much higher…) • Data and control networks • 4 FP vector processors with separate data paths to memory • 32 MB memory (now much more…) • Parallel versions of C and Fortran
Connection Machine • Custom programming languages • C/Paris (common C plus Paris library) • Each processor executes the same action and communicates with its nearest neighbours • *C (SIMD extension of C) • *Lisp (SIMD extension of functional language LISP) • The higher the number of processors, the lower the system reliability • MTBF very high
Connection Machine • References • http://mission.base.com/tamiko/theory/cm_txts/di-frames.html • Hillis, W. Daniel. The Connection Machine, MIT Press, Cambridge, MA., 1985 • Hillis, W. Daniel. "The Connection Machine," Scientific American, Vol. 256, June 1987, pp. 108-115
Connection Machine Connection Machine CM-2 in the Museum of American History,Smithsonian Institution, Washington DC.
Data-parallel architectures • Data-parallel architectures • SIMD • Associative and Neural • Systolic and pipelined
Associative Architectures Fig. 12.1 from SFK’s book • Common basic idea: • We need to classify an unknown subject matter in order to perform the “right” action on it • This classification is based on its complete or partial similarity to known data
Associative processing • How does associative processing work? • Incoming data (images, genes, grammar tokens, matches) are regarded simply as, e.g., a string of bits (their binary representation) • Each bit of this representation is compared with the corresponding bit of the representations of a database of known itemsfigure_of_merit = 0; for (i = 0; i<SIZEDB; i++) if (input[i] == DB[k][i]) figure_of_merit += Weight[i]; else figure_of_merit -= Weight[i]; • A weight is used to give different values to specific bits
Associative processing • Key choice: data item of the basic matching process • Working with bits may be difficult and less appropriate w.r.t., e.g., working with more complex data items • Images -> pixels • Spell checker -> letters of alphabet • Speech recognition -> phonetics atoms
Parallel Associative Processing • Where is parallelism in associative processing? • At multiple levels • One may compare all letters of each word in parallel • Or do the same, but comparing sets of words in parallel • Or do the same, but with multiple databases • …and so forth • This is like working with parallelism at multiple levels: • ILP • Loop level • Process level etc. • But the steady state can be sustained more easily
The Associative String Processor • The Brunel ASP: a massively parallel, fine-grain associative architecture • Currently developed by Aspex Microsystems Ltd., UK • Building block: the ASP module • A set of ASP modules is attached to a control bus and data communication networks • The latter can be used to set up general-purpose and application-specific topologies • Crossbar, mesh, binary n-cube, …
The Associative String Processor • “The” ASP Programming language is PROLOG • Scheme:
Neural Computing • Brief description (NN are part of other courses in MAI programme) • Neural computers are trained vs. programmed • After training, they can, e.g., classify an input data as belonging to a certain class out of a set of possible classes • Paradigms of Neural Computing include • Classification • Reduction in the quantity of information necessary to classify some entities • Training set = (parameters, classifications) * • Processing = (parametric description, ?) • Multi-layer Perceptron (Rosenblatt, ‘58)
Neural Computing • Transformation • Moving from one representation to another (e.g., converting a written text into speech) • Same amount of data, but with a different representation • Minimization • Hopfield networks (Hopfield, ’82) • A net of all-to-all connected binary threshold logic units with weighted connections b/w units • A number of patterns are stored in the network • A partial or corrupted pattern is input; the network recognizes it as one of its stored patterns
Neural Computing …converted in analogform and multiplied with input… …and compared witha threshold (activate/don’t activate neuron) Weights: stored in digital form… • Masumoto (Fujitsu, 1993) developed a multi-purpose analog neuro-processor • One neuron on a single chip
Data-parallel pipeline systems • Derived from the basic functions and algorithms required by low-level image processing, in which • The same function needs to be applied to a long stream of data • The output value of a pixel is a function of the values of a few neighbor pixels
Data-parallel pipeline systems • The idea: the input image is represented as a linear stream of pixels • Custom configured shift registers transform logically neighboring pixels into physically neighboring pixels
Data-parallel pipeline systems • Example: the DIP-1 system (Gerritsen, ’78, TUDelft) • Characteristics: • Rather than shift registers, actual computation of the addresses of the neighboring pixels • A mixture of fixed-function and programmable elements • A set of binary function, 3x3 neighborhood look up tables • Performance: much faster than leading edge general purpose workstations of 1978 – for image processing only • Usability: • Too specialized • Too difficult to program (microprogrammed machine)
Sistolic architectures • A specialization of SIMD, devised by Kung and Leisersor in 1978 • Named after an analogy with the pumping action of the heart: • Data are “pumped” through the system, and fed to PEs that produce a stream of results • What is systolic computing?
Sistolic computing: An example • Matrix multiplication • Simple, constant computation • All values of rown and col m areneeded • All data in row nand col m can beaccessed sequen-tially (with fixedstride)… • …and computedindependently • …i.e. in parallel
Sistolic computing: An example • Sistolic computer: • A square matrix of PEs is available • Simple PE = multiplier, adder, storage unit • Fixed function PEs • Each PE is only connected to its nearest neighbors • Connections are unidirectional • The clock pulses are the only control signal, pacing all PEs
Sistolic computing: An example • Rows and cols values enter according to a fixed, predefined scheme • Each “move” takes 1 cycle time t
Sistolic computing: An example Time PE1,1 PE1,2 PE1,3 PE1,4 PE1,5 PE1,6 tA11xB11 wait wait wait wait wait2t A12xB12 A11xB21 wait wait wait wait 3t A13xB13 A12XB22 A11xB31 wait wait wait 4t A14xB14 A13xB23 A12xB32 A11xB41 wait wait 5t A15xB15 A14xB24 A13xB33 A12xB42 A11xB51 wait 6t A16xB16 A15xB25 A14xB34 A13xB43 A12xB52 A11xB61
Q1 from Exam of 2 Sept 03 • A Genetic Algorithm (GA) is an evolutionary algorithm which generates eachindividual from some encoded form known as a “chromosome”. Chromosomesare combined or mutated to breed new individuals. “Crossover”, the kindof recombination of chromosomes found in reproduction in nature, isoften also used in GAs. Here, an offspring's chromosome is created byjoining segments choosen alternately from each of two parents' chromosomeswhich are of fixed length. The resulting individual can then be evaluatedthrough a fitness function. Only best fitting individuals are allowed tobecome officially part of the next generation. Generations followgenerations until the fitness function reaches some threshold value. • Assume a fixed population of n individuals and a fitness function f. • Assume that CPUtime(f) = t ms
Q1… (continued) • Sketch a computer architecture for this family of GAs for nand t very large • Motivate your choices.
Solution • for nand t very large • This means “a lot of computations” • Is this a data parallel or a functional parallel problem? • Same function is applied to each work unit -> data parallel • Which grain for the data? • n is very large, so course grain seems OK • Which grain for the processing units? • t is very large, so course grain seems OK
Solution • Architecture: one characterised by a data parallel structure, able to operate on large chuncks of data with powerful PE’s • Tightly coupled MIMD (multiprocessor) or network of workstations • Each processor does the same work on different data (SIMD scheme mapped onto MIMD arch.) • Each processor is quite powerful and can “crunch” a large work unit • Reduced need for synchronisation • A master PE could dispatch work units on demand
An exercise for you • Now sketch a computer architecture for this family of GAs for nand t very small • Motivate your choices.
Program • Part 1 • Part 2.1, 2.2 • Part 2.3 • Intro to parallel processing • Instruction level parallelism • Intro • VLIW • Dynamic scheduling • Slides 2.3: 79 — 99 • Slides 2.3: 130 — 137, 157 — 166 • Part 3