200 likes | 381 Views
ARM1176JZF-S ( iPhone 3G). Jeff Brantley Chris Gregg Bill Stitson. Processor Overview. Features Designed for consumer and wireless products RISC Processor with Harvard Architecture Vector Floating Point coprocessor Branch prediction “ TrustZone ” security built-in to the CPU
E N D
ARM1176JZF-S(iPhone 3G) Jeff Brantley Chris Gregg Bill Stitson
Processor Overview Features Designed for consumer and wireless products RISC Processor with Harvard Architecture Vector Floating Point coprocessor Branch prediction “TrustZone” security built-in to the CPU Instruction and data caches 8-stage pipeline 32-bit and 16-bit (“Thumb”) instruction sets, and “Jazelle” technology for Java execution
Memory Hierarchy • Harvard architecture: separate data and instruction caches • Allows simultaneous access • 64-bit datapaths • L1 Cache • up to 64KB in size • 4-way set associative • virtual index, physical tag • 8 words per line, critical word first on miss • Round robin or pseudo-random replacement policy [1]
Level 2 Interface • “high-bandwidth interface to second level caches, on-chip RAM, peripherals, and interfaces to external memory” [1] • Level 2 interconnect 64-bit wide interfaces: • Instruction Fetch • Data Read/Write • DMA • Peripheral Interface is 32 bits wide
Translation Lookaside Buffer (TLB) • MicroTLBs • One each for instructions, data • 10 entries • Fully associative • Round-robin or random replacement • Single Main TLB • Contains a fully-associative region of 8 lockable elements • Misses handled by two-level page table
Coprocessor interface Core processor can interface to on-chip coprocessors Instruction set supports up to 16 coprocessors Two of these are used by the VFP Coprocessors intended to run in-step with core, share data Two-cycle delay: “generous timing margins” [1] Loose synchronization via token queues Core may flush coprocessor pipeline or cancel instructions Only one coprocessor “active” at one time Not so bad: calls to driver software = core instructions Allows much of the interface to be shared ($$$)
VFP Coprocessor • Uses a dedicated interface to the processor • IEEE 754 Standard for Binary Floating-Point Arithmetic • 64-bit load and store buses • 3 independent, parallel pipelines: • Load and store • Multiply and accumulate • Divide and square root • Short vector instructions: 8 single precision, 4 double • No branch instructions
Branch Prediction • Branch Prediction (BP) can be turned on and off with a control register. • Provides high level of control • The ARM processor performs two types of BP • Dynamic: performed in the Prefetch Unit • Static: performed by the integer core (and the first time, before historical data exists) • Branch folding • After prediction, the branch instruction is completely removed from the instruction stream presented to the pipeline.
Dynamic Branch Prediction • Dynamic Branch Prediction is the “first line” of branch prediction: if history exists, it will be used. • The Branch Target Address Cache (BTAC) holds virtual target addresses of previous branches • 128-entry, direct mapped cache • Includes a 2-bit branch prediction history. • A BTAC hit produces a branch prediction with zero cycle delay • Both branches (resolved taken and not taken) are stored in the BTAC, which improves performance. • Branch folding is done for almost all dynamically predicted branches.
Static Branch Prediction • Static Branch Prediction is only based on the branch instruction characteristics (i.e., it does not utilize history) • Simple: • All forward conditional branches are not taken, and all backward branches are taken. • “Around 65% of all branches are preceded by enough non-branch cycles to be completely predicted.” [1] • The static branch predictor is used • on compulsory misses (i.e., the first time a branch is encountered) • when there are capacity or conflict misses in the BTAC
TrustZone • The ARM1176 processors implement “TrustZone” security extensions that “provide a secure environment for software” [1] • dddd • The hardware is partitioned so that the resources are physically separated on the chip, creating a strong boundary between the Normal World and the Secure World [2] • Two virtual processors are created from the one physical processor, removing the need for a separate processor dedicated to security • TrustZone aware hardware such as DMA controllers allow secure data transfer • Examples of how TrustZone can be used include secure PIN entry from the keyboard, to Digital Rights Management of multimedia data.
Integer Pipeline • Up to 4 instructions fetched • Static branch prediction in Fe2 • Decode/Issue can hold branch alongside other instruction • Non-blocking loads • Hit Under Miss (HUM) buffer
Jazelle • Java hardware acceleration • Java bytecode translated to ARM instruction(s) • Extra decode logic between Fetch and Decode stages • Extension of ARM instruction set • Limited (unpublished) subset of Java bytecodes • Instructions to enter and exit Jazelle state • Unsupported bytecodes interpreted in software by JVM • Requires Jazelle-aware JVM • Relatively proprietary • Free/Open Source JVM’s cannot take advantage
Thumb • 16-bit extension to 32-bit ARM ISA • “Most commonly used” ARM instructions in 16-bit form • Enables higher code density • “Reduces memory bandwidth and size requirements by up to 35%” [4] • Like Jazelle, requires extra pre-decode translation hardware • Can link Thumb-compiled code optimized for space against performance-critical code compiled to 32-bit ARM
References • “ARM1176JZF-S Processor Technical Reference Manual”, ARM Limited, Lit.-Nr.: ARM DDI 0301F, 2004--2007. • “TrustZone Hardware Architecture”, ARM Limited, http://www.arm.com/products/security/trustzone/hardware.html, downloaded Dec. 4, 2009. • “Trust Zone System Design”, http://www.arm.com/products/security/trustzone/systemdesign.html, downloaded Dec. 4, 2009. • “ARM1176JZ(F)-S”, ARM Limited, http://www.arm.com/products/CPUs/ARM1176.html, downloaded Dec. 4, 2009.