A Fast On-Chip Profiler Memory

A Fast On-Chip Profiler Memory Roman Lysecky, Susan Cotterell, Frank Vahid* Department of Computer Science and Engineering University of California, Riverside *Also with the Center for Embedded Computer Systems, UC Irvine This work was supported in part by the National Science Foundation

Outline • Introduction • Problem Definition • Profiling Techniques • Pipelined Binary Search Tree • ProMem • Conclusions

Our Solution: Add On-Chip Profiler Memory to Monitored Bus Monitor Embedded Bus Monitor Embedded Bus Goal: Determine # of Times Each Target Pattern Appears on the Bus ProMem Introduction Mem Processor I$ • Accepts 1 pattern/cycle D$ • Keeps Exact Counts Bridge Per. Per. Per. Per.

Most Instructions Executed void compute() { // small Loop A for(i=0;…;…) … // small Loop B for(x=0;…;…) } Instructions Profile … … Loop A Loop N prog.c Profile Information Move Loop A to HW Processor Mem Configure FPGA Synthesis Per. Per. Per. FPGA FPGA Introduction

Introduction • Profiling Can Be Used to Solve Many Problems • Optimization of frequently executed subroutines • Mapping frequently executed code and data to non-interfering cache regions • Synthesis of optimized hardware for common cases • Identifying frequent loops to map to a small low-power loop cache • Many Others!

Input Patterns P={pi , …, pm} Bus B Target Patterns TP = {tpi, …, tpm} Target Pattern Counts CTP = {ctpi, …, ctpm} Problem Definition • Objective • Count number of times each target pattern appears on bus B • Requirements • Accept input patterns on every clock cycle • Monitoring any bus, e.g., deeply embedded buses in SOCs • Non-intrusive • Exact target pattern count Processor Mem p1 p2 … pm Per. Per. Per. Per. TP CTP tp1 11203 tp2 8876 … … tpm ctpm

Processor Mem p1 p2 … pm Per. Per. Per. Per. Profiling Techniques - Software • Instrumenting Software • Adding code to count frequencies of desired code regions • Problems • Incurs runtime overhead • Possibly changes program behavior • Increase in code size prog.c for( … ){ … ctpm++; }

Processor Mem p1 p2 … pm prog.c Per. Per. Per. Per. // ISR period = 10ms ISR{ //update profile info } Profiling Techniques - Software • Periodic Sampling • Interrupt processor at periodic interval • Read program counter and other internal registers • Problems • Disruption of runtime behavior during interrupt • Inaccurate

Profiling Techniques - Software • Simulation • Execute application on instruction set simulator • Simulator keeps track of profile information • Problems • Difficult to model external environment which leads to inaccuracy • Extremely slow prog.c ISS profile information

Processor Mem p1 p2 … pm Per. Per. Per. Per. Profiling Techniques - Hardware • Logic Analyzer • Probes placed directly on bus to be monitored • Problems • Cannot monitor embedded buses

Processor Mem p1 p2 … pm Per. Per. Per. Per. Profiling Techniques - Hardware • Processor Support • Mainly event counters • Monitored events include cache misses, pipeline stalls, etc. • Problems • Few registers available • Reconfiguration needed to obtain a complete profile • Leads to inaccuracy

p1 p2 … pm Mem Processor CAM Per. Per. Per. Per. Profiling Techniques - Hardware • Content-addressable memories (CAMs) • Fast search for a key in a large data set • Returns the address at which the key resides in a memory • Types • Fully Associative • RAM coupled with a smart controller

p1 p2 … pm Mem Processor CAM Per. Per. = tp1 = tp2 = Per. Per. tp3 … = tpm Profiling Techniques - Hardware • Fully Associative CAMs • Simultaneously compares every location with the key • Problems • Does not scale well to larger memories • Increased access time as CAM size grows • Large Power Consumption

p1 p2 … pm Mem Processor CAM Per. Per. SRAM Ctrl Per. Per. Profiling Techniques - Hardware • RAM coupled with a smart controller • Efficient lookup data structure in memory such as a binary tree or Patricia Trie • Problems • Multiple cycle lookup

Observations • Not necessary to have 1 cycle look up • Only need to accept one input pattern every cycle

Bus B FIFO CAM SRAM Ctrl Queueing • Hold input patterns in queue until we are able to process them • Problems • Does not work with patterns arriving every clock cycle

Pipelining • Implemented in processors such that instructions can be executed every cycle • Can we use pipelining to solve our problem?

CAM CAM CAM CAM Pipeline Reg Pipeline Reg Pipeline Reg Pipeline Reg Pipelined CAM • Large CAMs required long access times • Partition large CAM into several smaller CAMs • Requires pipelining to reduce access time • Provides solution to access time problem • Requires Large Area • Large Power Consumption CAM

Pipelined CAM • Entries can be stored in a CAM in any order • requires sequential lookup in pipelined CAM approach • Is there a benefit to sorting the entries? • not necessary to search all entries • leads to faster lookup time • Tree structure provides a inherently sorted structure • Search time remains a problem • Can we pipeline the structure?

= = = = Pipelined Tree • Solves access time problem • One memory access per level • Solves area problem • Single comparator per level • Each level grows by factor of two • For large memories, comparators are negligible

Root Node Each node has at most two children h Right child < Parent Left child > Parent j d k i f b g e c a Pipelined Binary Search Tree

f < h, go right h Stage 0 h h h Stage 1 d d j d f > d, go left f Stage 2 k i f b f = f, Found! Stage 3 g e c a Pipelined Binary Search Tree Searching for Input Pattern: f

e < h, append 0 to address h h h Stage 0 h h 0 0 0 0 1 0 e > d, append 1 to address d d Stage 1 d j d 01 01 01 11 10 01 00 e < f, append 0 to address f f Stage 2 k i f b 010 010 011 010 001 000 e = e, Found! Stage 3 e g e c a Pipelined Binary Search Tree Searching for Input Pattern: e

f < h, append 0 to address h e < h, append 0 to address Stage 0 h h 0 0 e > d, append 1 to address Stage 1 j d d f > d, append 1 to address d 01 Stage 2 f = f, Found! k i f b f 01 e < f, append 0 to address f Stage 3 e = e, Found! g e c a e 010 Pipelined Binary Search Tree Searching for Input Pattern: e, f

Stage 0 h 1 0 Standard Memories Stage 1 j d 11 10 01 00 Stage 2 k i f b 011 010 001 000 011 010 001 000 Stage 3 - - - - g e c a - - - - Pipelined Binary Search Tree

Enable Input Pattern Search Address ps_i As_i cen_i > ps > As > cen Pipeline regs Enable (Next Stage) Search Address (Next Stage) Input Pattern ProMem stage s ps_o As+1_o cen_o ProMem – Module Design

ps_i As_i cen_i > ps > As > cen Pipeline regs Target Pattern Memory TPMs (2s×w) rd addr dout ProMem stage s ps_o As+1_o cen_o ProMem – Module Design

ps_i As_i cen_i > ps > As > cen Pipeline regs Target Pattern Found Search for Target Pattern Compare > = Target Pattern Not Found – Enable Next Stage ProMem stage s ps_o As+1_o cen_o ProMem – Module Design TPMs (2s×w) rd addr dout

ps_i As_i cen_i > ps > As > cen Pipeline regs Target Pattern Count Memory CMs (2s×c) rd addr wr dout ProMem stage s ps_o As+1_o cen_o ProMem – Module Design TPMs (2s×w) rd addr dout Compare > =

ps_i As_i cen_i > ps > As > cen Pipeline regs When Target Pattern Found - Update Count Value 1 +1 ProMem stage s ps_o As+1_o cen_o ProMem – Module Design TPMs (2s×w) CMs (2s×c) rd rd addr addr wr dout dout Compare > =

ps_i As_i cen_i Pipeline Register > ps > As > cen Pipeline regs Memories Module Controller ModuleController ProMem stage s ps_o As+1_o cen_o ProMem – Module Design TPMs (2s×w) CMs (2s×c) 1 rd rd addr addr wr dout dout Compare +1 > =

p1 p2 … pm Mem Processor clk Per. Per. cen addr ProMem ren wen Per. Per. ProMem - Interface • Simple Interface • Internal interface • Enable signal • Connection to monitored bus • External interface • Read enable • Write enable • Connection to ProMem pattern input bus

ps_i As_i cen_i > ps > As > cen Pipeline regs TPMs (2s×w) CMs (2s×c) 1 rd rd addr addr wr dout dout Compare +1 > = ProMem stage s ps_o As+1_o cen_o ProMem - Layout • Efficient Layout • Achieved by simply abutting each module with the next • Results in very short bus wires between each module

Module overhead only 1% ProMem Results – Area* *Area obtained using UMC .18 technology library provided by Artisan Components

CAM design is 46% larger than ProMem ProMem Results – vs. CAM

CAM access time grows with CAM size ProMem access time remains constant (Due to Pipelining) ProMem Results – Timing vs. CAM

Conclusions • Introduced a new memory structure specifically for fast on-chip profiling • One pattern per cycle throughput • Simple interface to monitored bus • Efficient design is very scalable

A Fast On-Chip Profiler Memory

A Fast On-Chip Profiler Memory

Presentation Transcript

SIMD-Scan: Ultra Fast in-Memory Table Scan using on-Chip Vector Processing Units

ChIP on ChIP

Lab-On-A-Chip

Designing On-chip Memory Systems for Throughput Architectures

Designing On-chip Memory Systems for Throughput Architectures

System on a Programmable Chip (System on a Reprogrammable Chip)

Spectrometer on a Chip

Lab on a Chip

System-on-a-Chip (SoC)

A Chip On One’s Shoulder

System-On-a-Chip Design

Lab-on-a-Chip

Profiler

Giga-Scale System-On-A-Chip International Center on System-on-a-Chip (ICSOC)

System-On-a-Chip Design

Lab-On-A-Chip Overview

System-On-a-Chip: A case study based on the ELIET Chip

Photonic On-Chip Networks for Performance-Energy Optimized Off-Chip Memory Access

Lab-on-a-Chip

Unlock chip memory