Master Program (Laurea Magistrale) in Computer Science and Networking

Master Program (Laurea Magistrale) in Computer Science and Networking High Performance ComputingSystems and EnablingPlatforms Marco Vanneschi • PrerequisitesRevisited • 1.3. Assemblerlevel, CPU architecture, and I/O

Uniprocessorarchitecture The firmwareinterpretationofanyassemblylanguageinstructionshas a decentralizedorganization, evenforsimpleuniprocessorsystems. The interpreterfunctionalities are partitionedamong the various processing units (belongingto CPU, M, I/O subsystems), eachunithavingitsownControl Part and itsownmicroprogram. Cooperationofunitsthroughexplicitmessagepassing . A uniprocessor architecture Applications Processes Cache Memory Cache Memory Assembler Main Memory Firmware DMA Bus ... CPU Hardware Memory Management Unit I/O Bus I/O1 I/Ok ... . . . Processor Interrupt Arbiter MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms

Assemblylevelcharacteristics • RISC (ReducedInstriction Set Computer) vs CISC (ComplexInstriction Set Computer) • “The MIPS/Power PC vs the x86 worlds” RISC: • basicelementaryarithmeticoperations on integers, • simpleaddressingmodes, • intensive useofGeneralRegisters, LOAD/STORE architecture • complexfunctionalities are implemented at the assemblerlevelby (long) sequencesofsimpleinstructions • rationale: powerfuloptimizations are (feasible and) mademucheasier: • Firmwarearchitecture • Code compilation forCaching and InstructionLevelParallelism MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms

Assemblylevelcharacteristics CISC: • includescomplexistructionsforarithmeticoperations (reals, and more), • complexnestedaddressingmodes, • instructionscorrespondingto (partsof) systemsprimitives and services • processcooperation and management • memory management • … • graphics, • … • networking • …. MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms

RISC and CISC approaches Applications RISC architecture Primitive 1 Processes CISC architecture Primitive 2 Assembler Long sequence of instructions Short sequenceofproperinstructions Sequence of instructions Single instruction • Firmware microprogram Hardware MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms

Test: tobesubmitted and discussed Basically, CISC implements at the firmwarelevelwhat RISC implements at the assemblerlevel. Assume: • No InstructionLevelParallelism • Sametechnology: clock cycle, memoryaccesstime, etc • Why the completiontimecouldbereducedwhenimplementing the samefunctionality at the firmwarelevel(“in hardware” … !)comparedto the implementation at the assemblerlevel(“in software” … !)? • Under whichconditionsis the ratioofcompletiontimessignificant? Whichorderofmagnitude? • Exercise: answerthesequestions in a non trivial way, i.e. exploting the concepts and featuresof the levels and levelstructuring. • Write the answer(1-2 pages), submit the work to me, and discussit at QuestionTime. • Goal: tohighlight background problems, weaknesses, … togainaptitude. MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms

Performance parameters Average SERVICE TIME per instruction Average service time of individual instructions Probability MIX(occurrence frequencies of instructions) for a given APPLICATION AREA Global average SERVICE TIME per instruction CPI = average number of Clock cycles Per Instruction Performance = average number of executed instructions per second (MIPS, GIPS, TIPS, …) = processing bandwidth of the system at the assembler-firmware levels MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms

Performance parameters Completiontimeof a program • Benchmarkapproachto performance evalutation (e.g., SPEC benchmarks) • Comparisonofdistinctmachines,differing at the assembler (and firmware) level • On the contary: the Performanceparametersismeaningfulonlyfor the comparisonofdifferentfirmwareimplementationsof the sameassemblermachine. m = average number of executed instructions This simple formula holds rigorously for uniprocessors without Instruction Level Parallelism It holds with good approximation for Instruction Level Parallelism (m = stream length), or for parallel architectures. MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms

RISC vs CISC • RISCdecreasesT and increasesm • CISCincreasesT and decreasesm • Whereis the best trade-off? • ForInstructionLevelParallelismmachines: RISC optimizationsmaybeabletoachieve a T decreasegreaterthan the m increase • cases in which the executionbandwidthhasgreater impact than the latency • Whereexecutionlatencydominates, CISC achieveslowercompletiontime • e.g. processcooperation and management, services, specialfunctionalities, … MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms

A possibletrade-off • Basically a RISC machine with the addition of specialized co-processors or functional unitsfor complex tasks • Graphic parallel co-processors • Vectorized functional units • Network/Communication co-processors • … • Concept: modularity can lead to optimizations • separation of concerns and its effectiveness • Of course, a CISC machine with “very optimized” compilers can achieve the same/better results • A matter of COMPLEXITY and COSTS / INVESTMENTS MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms

Modularity and optimizationofcomplexity / efficiencyratio A uniprocessor architecture In turn, anadvanced I/O unit can be a parallelarchitecture (vector / SIMD co-processor / GPU / …), or a powerfulmultiprocessor-based network interface, or …. Cache Memory Cache Memory Main Memory Forinterprocesscommunication and management: communicationco-processors ... CPU Memory Management Unit In turn, the Processor can beimplementedas a (large) collectionofparallel, pipelinedunits E.g. pipelined hardware implementationofarithmeticoperations on reals, or vectorinstructions I/O1 I/Ok ... . . . Processor Interrupt Arbiter MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms

A “didactic RISC” • See Patterson – Hennessy: MIPS assembler machine • In the courses of Computer Architecture in Pisa: D-RISC • a didactic version (on paper) of MIPS with few, inessential simplifications • useful for explaining concepts and techniques in computer architecture and compiler optimizations • includes many features for code optimizations MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms

D-RISC in a nutshell • 64 GeneralRegisters: RG[64] • Anyinstructionrepresented in a single 32-bit word • Allarithmetic-logical and branch/jumpinstructions operate on RG • Integerarithmeticonly • Target addressesofbranch/jump; relative toProgramCounter , or RG contents • Memoryaccesses can bedoneby LOAD and STORE instructionsonly • Logicaladdressesare generated, i.e. CPU “sees” the VirtualMemoryofrunningprocess • Logicaladdresscomputedas the sum oftwo GR contents (Base + Index mode) • Input/output: MemoryMapped I/O through LOAD/STORE • No specialinstructions • Specialinstructionsfor • Interrupt control: mask, enable/disable • Processtermination • Contextswitching: minimal supportfor MMU • Cachingoptimizations: annotationsforprefetching, reuse, de-allocation, memoryre-write • Indivisiblesequencesofmemoryaccesses and annotations (multiprocessorapplication) MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms

Exampleof compilation intA[N], B[N]; for(i = 0; i < N; i++) A[i] = A[i] + B[i]; • R = keyword denoting “registeraddress” • Forclarityofexamples, registeraddresses are indicatedbysymbols (e.g. RA = R27, then base addressof A is the content RG[27]) • REGISTER ALLOCATIONS and INITIALIZATIONS at COMPILE TIME: • RA, RB: addressesof RG initialized at the base addressof A, B • Ri: addressof RG containingvariablei, initialized at 0 • RN: addressof RG initialized at constant N • Ra, Rb, Rc: addressesof RG containingtemporaries (notinitialized) • VirtualMemorydenotedbyVM • ProgramCounterdenotedbyIC • “NORMAL” SEMANTICS of SOME D-RISC INSTRUCTIONS: • LOAD RA, Ri, Ra :: • RG[Ra] = MV[ RG[RA] + RG[Ri] ], IC = IC + 1 • ADD Ra, Rb, Rc :: • RG[Rc] = RG[Ra] + RG[Rb], IC = IC + 1 • IF < Ri, RN, LOOP :: • ifRG[Ri] < RG[RN] then IC = IC + offset(LOOP) else IC = IC + 1 LOOP: LOAD RA, Ri, Ra LOAD RB, Ri, Rb ADD Ra, Rb, Rc STORE RA, Ri, Rc INCR Ri IF < Ri, RN, LOOP END MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms

Compilation rules compile(if C then B)  IF (not C) CONTINUE compile(B) CONTINUE: … compile(if C then B1 else B2)  IF (C) THEN compile(B2) GOTO CONTINUE THEN: compile(B1) CONTINUE: … compile(while C do B)  LOOP: IF (not C) CONTINUE compile(B) GOTO LOOP CONTINUE: … compile(do B while C)  LOOP: compile(B) IF (C) LOOP CONTINUE: … MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms

I/O transfers Cache Memory Cache Memory MI/O1 MI/Ok Main Memory DMA Bus • DIRECT MEMORY ACCESS: • I/O units can access the MainMemory ... CPU • MEMORY MAPPED I/O: • InternalMemoriesof I/O units are extensionsof the MainMemory • Fullytranspartentto the process: • LOAD and STORE containlogicaladdresses • Compiler annotations drive the physicalmemoryallocation Memory Management Unit I/O Bus • SHARED MEMORY: • BothtechniquessupportsharingofmemorybetweenCPU-processes and I/O-processes (I/O Memory and/or MainMemory) I/O1 I/Ok ... . . . Processor Interrupt Arbiter MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms

Interrupt handling • Signaling asynchronous events (asynchronous wrt the CPU process) from the I/O subsystem • In the microprogram of every instruction: • Firmware phase: Test of interrupt signal and call of a proper assembler procedure (HANDLER) • HANDLER: specific treatment of the event signaled by the I/O unit • Instead, exceptions are synchronous events (e.g. memory fault, protection violation, arithmetic error, and so on), i.e. generated by the running process • Similar technique for Exception Handling (Firmware phase + Handler) MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms

Ttr MMU t P Arbiter I/O Firmwarephaseof interrupt handling Interrupt acknowledgement to the selected I/O unit Communication of 2 words from I/O unit, and procedure call • I/O unit identifier is not associated to the interrupt signal • Once the interrupt has been acknowledged by P, the I/O unit sends a message to CPU consisting of 2 words (via I/O Bus): • An event identifier • A parameter MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms

Master Program (Laurea Magistrale) in Computer Science and Networking

Master Program (Laurea Magistrale) in Computer Science and Networking

Presentation Transcript