390 likes | 574 Views
A Characterization of Processor Performance in the VAX-11/780. From the ISCA Proceedings 1984 Emer & Clark. Overview. It used a micro-PC histogram which was, at the time, a novel approach to evaluating processor performance.
E N D
A Characterization of Processor Performance in the VAX-11/780 From the ISCA Proceedings 1984 Emer & Clark
Overview • It used a micro-PC histogram which was, at the time, a novel approach to evaluating processor performance. • This approach attempted to quantify actual performance at the microinstruction level. • How did they do this and what were the results?
What is the motivation? • The VAX 8200 was being designed as the first VAX microprocessor and the design included a CPU which was spread across 3 chips with the microcode on another 5 chips. • Chip crossing were expensive. • Would a two-level hierarchical microcode store work better? (Some microcode on the processor chip and the rest on the microcode chip.)
Motivation Continued • The question they were trying to answer was “With different latencies for different microinstructions, what would the performance be? • As we will see this approach did not answer the question.
Why is the VAX a good candidate for this type of study? • Think about studying a CISC. Would this help in designing a RISC? • Yes – if you can get quantitative numbers on frequency of instructions and whether its complex components are used. Then you can remove instructions that are rarely used and possibly map these into a combination of instructions which are remaining. However, this comes at the cost of not having as many instructions from which to chose.
Definitions What terms does this paper use and what do they mean?
What Is a Microinstruction? The actual instruction could be 'add the contents of registers X and Y', a microinstruction would be more like 'write out register X to bus Z', or 'read data bus into register Q'. These are very basic actions and can be woven together to implement the actual instruction set of the machine. When taken in total this set of microinstructions was called the microprogram, or microcode. Thanks to Neal Harman at www-compsci.swan.ac.uk/~csneal/HPM/into.html
What is a Read Stall? Occurs when there is a cache miss on a D-stream read, the requesting microinstruction waits while the data is being retrieved. This takes a minimum of 6 cycles on the VAX.
What is a Write Stall? Occurs anytime there is a write attempted less than 6 cycles after a previous write. Can be minimized by a microprogram which only writes every 6 cycles.
What is an IB Stall ? Occurs when there are not enough bytes to satisfy the microcode’s request. Occurs during I-stream processing.
What is an Architectural Event? An event that would occur in any implementation of the VAX architecture.
What is an Implementation Event? An event that is dependent on the particular implementation of the VAX architecture.
What are Operand Specifiers? The operand specifiers follow the opcode in the I-stream and indicate the type of operand. For example, whether a read operand is located in memory addressed by a register or in a register.
Micro-PC Histogram Techniques • They developed their own special purpose hardware that was able to gather data in up to 16,000 addressable count locations which incremented a select location based on the microcode execution. • The counter capacity was sufficient to collect data for up to 1 to 2 hours of heavy processing on the CPU. • The great strength of this approach is its ability to classify every processor cycle and thus to establish duration of events.
Micro-PC Histogram Techniques Continued • This technique captures data that is under the direct control of the microcode. • Both live and synthetic environments were used for data collection. • The VMS Null process, which runs when the system is idle, was excluded.
Disadvantages of the Micro-PC Histogram Technique • Data such as instruction stream memory references are not part of the data collected because these events are controlled by the hardware. • To save space, some microcode shares microinstructions. In these cases it is not possible to differentiate between the sharers of the code. • Only average behavior is captured because there are no mechanisms to capture the variations of the statistics during the measurement.
Opcodes • Can not distinguish all opcodes. • Opcodes can be grouped together. • Those classified as Simple (moves, branches and other simple instructions) occur much more frequently than other opcodes. This is no real surprise.
How did they determine Opcode frequency? • They counted the number of times each microinstruction was seen. • They knew which microinstructions were included in each Opcode. • Then by developing a linear algebra equation they were able to “back-in” to the Opcode frequencies.
How did they determine Opcode frequency? – Cont. • Since some Opcodes shared microinstructions or even contained exactly the same microinstructions (such as add and subtract) the best data they were able to glean from the results of the linear algebra equation were groups of Opcodes.
PC-Changing Instructions • The most interesting for the purposes of this paper are the conditional branches that actually branch. • These account for 38.5% of all instructions.
Memory Operations • The ratio of reads to writes is about 2 to 1 on the VAX. • To directly measure the contribution of each type of instruction group on overall performance, the results are in terms of events per average instruction.
Average Instruction Size • This is the only true architectural feature of the I-stream. • To calculate this we use the average operand specifier size which they were able to determine was 1.48 and displacement figures from a previous paper by the authors which is 1.68 bytes. • Branch Displacements are given at .31
I-stream References • The VAX uses an 8-byte Instruction Buffer (IB). • This buffer makes a cache reference any time one or more bytes are empty. • The IB is controlled by hardware so the micro-PC histogram does not include counts of IB references. • A previous paper found that there were 2.2 cache references per instruction on average.
Cache Misses • Cache is controlled by hardware so we have to rely on previous work which concluded that cache read misses is .28, of which .18 was due to I-stream and .10 due to D-stream. • These misses are referred to as microcode stalls.
Translation Buffer (TB) Misses • The TB is controlled by microcode, so it can be measured. • The TB miss triggers a trap and the cycles are counted by a micro-routine for the duration of the trap. • The results were .029 misses per instruction (.02 for D-stream and .009 for I-stream). • The average number of cycles used to service a miss was 21.6 of which 3.5 were read stalls because the page table was not in cache.
Stalls • The occurrence and duration of read, write and IB stalls are all specific to the implementation. • The duration is measurable by the micro-PC technique but the frequency is not.
Summary – What is Important and Applicable Today? • This paper was one of the first to begin to quantify performance. • Today quantifying performance is expected and there are many benchmarks that have been developed by which systems can be compared.
Summary – What is Important and Applicable Today? • Based on this and other work, architectural design has moved from and art form to a science. • The fact that 83.6% of all instructions are simple, supports the idea of RISC.
Summary – What is not as Relevant Today? • Floating point was a very small percentage of instructions (3.62%). Today the number of floating point instructions would probably be much higher because of all the graphics used in modern computers. • Call/Ret which was 3.22% would probably be much higher today with the advent of object oriented languages which use numerous function calls, overloaded and virtual function etc.