Cache Automaton

Cache Automaton Arun Subramaniyan Jingcheng Wang EzhilR.M.Balasubramanian David Blaauw Dennis Sylvester Reetuparna Das University of Michigan – Ann Arbor MICRO-50, Boston, USA – October 17th, 2017

Pattern matching in abundance … 127.0.0.1 - - [07/Dec/2016:11:04:58 +0100] "GET /HTTP/1.1" 304 0 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:49.0) Gecko/20100101 Firefox/49.0" <bookid="bk101"><author>Gambardella, Matthew</author><title>XML Developer's Guide</title><genre>Computer</genre><price>44.95</price><publish_date>2000-1001</publish_date><description>An in-depth look at creating applications with XML.</description></book> Log Processing / right/JJ to/[^\s]+ / / the/[^\s]+ back/RB / / [^/]+/DT longer/[^\s]+ / / ,/[^\s]+ have/VB / / [^/]+/VBD by/[^\s]+ / #alert tcp$ExXTERNAL_NETany -> $HOME_NET $HTTP_PORTS (msg:"PROTOCOL-SCADA Cogent DataHub server-side information disclosure"; flow:to_server,established; content:".asp."; nocase; http_uri; pcre:"/\x2easp\x2e($|\?)/iU"; metadata:servicehttp; reference:cve,2011-3502; classtype:web-application-attack; sid:20174; rev:4;)xx XML Parsing Natural Language Processing Network Intrusion Detection Donald Duck Donald Fauntleroy Duck Donald F Duck D F Duck Video Decoding /([STX])(.{2}?)([DBEZX])/ /([RKX])(.{2,3}?)([DBEZX])(.{2,3}?)([YX])/ /([GX])([^EDRKHPFYW])(.{2}?)([STAGCNBX])([^P])/ Data Analytics Motif search Particle Path Tracking

Finite State Automata extract patterns \n \n x \s S2 S0 S1 x \s, \n x,\s Σ = {x, \s, \n}

Compute-Centric Architectures Switch-Case 1 while(c != EOF) { 2 if (c == ‘\n’) { putchar('\n'); *state = S0; } 3 else 4 switch(*state) { 5 caseS0: 6 if(c != ‘\s’) { 7 putchar(c); *state = S1; 8 } break; 9 caseS1: if(c == ‘\s’) { *state = S2; } 10 else{ putchar(c); } break; 11 caseS2: break; 12 } 13 } ! ! Irregular memory accesses Branch Mispredictions

Compute-Centric Architectures Table-Lookup 1 structT_table [3][3] = { 2 { {S0, S1}, {S0, S0}, {S0, S0} }, 3 { {S1, S1}, {S1, S2}, {S1, S0} }, 4 { {S2, S2}, {S2, S2}, {S2, S0} } }; 5 6 while(c != EOF) { 7 int id1 = (c == ‘\s’) ? 0 : (c == ‘\n’) ? 1 : 2 8 *state = T_table [*state] [id] [1] 9 putchar(c); 10 } ! ! Limited state transitions per cycle Memory bandwidth bottlenecked

Memory-Centric Architectures 1 rank with 8 chips Micron’s Automata Processor (AP) 48k state transitions per cycle per chip 256xfaster than CPU 170xfaster than GPU (iNFAnt2) [ANMLZoo, Wadden et al. ’16 ]

In-memory automata processing S1 S2 S3 S4 S5 S6 b i b i b report report start b a a i start Equivalent S1 i a a a report report a a a a report b

In-memory automata processing S6 S1 S2 S3 S4 S5 b i b i b report report start b a a i start Equivalent S1 report i a a a report report a a a a Each state has incoming transitions on one input symbol b

In-memory automata processing S6 S1 S2 S3 S4 S5 b b i b i b report report start report b report a a i start start Equivalent S1 S1 i i a a a report report a b a a a a a report Equivalent a a b Homogeneous NFA

State representation 0 0 0 1 1 1 .. .. .. .. .. 1 1 1 0 0 .. .. 1 .. .. .. 0 0 0 0 0 .. 1 .. .. .. .. 0 0 0 97 One-hot encoding 98 * b a 8-bit ASCII alphabet 255 255 255

In-memory automata processing Input symbols b report report start S1 i a b a report a a

In-memory automata processing aba. . . Input symbols b report report start S1 i a b a report a a State-Match 1

Massively Parallel State-Match Repurpose memory columns as FSM states S2 S1 S0 Sn b a a 0 Repurpose row address as input symbol Input symbols . . a ba 255 Active State Vector 110 1 0 1 1

Massively Parallel State-Match Repurpose memory columns as FSM states S2 S1 S0 Sn b a a 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 1 0 0 0 0 Repurpose row address as input symbol 0 1 0 0 0 0 0 1 0 1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 Input symbols 0 0 0 0 0 0 0 . . a ba 0 0 0 0 0 0 0 255 Active State Vector 110 1 0 1 1

Massively Parallel State-Match Repurpose memory columns as FSM states S2 S1 S0 Sn b a a 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 1 0 0 0 0 Repurpose row address as input symbol 0 1 0 0 0 0 0 1 0 Bit-parallel state-match 1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 Input symbols 0 0 0 0 0 0 0 . . a ba 0 0 0 0 0 0 0 255 Active State Vector 1 0 1 0 1 1 0 110 1 0 1 1

In-memory automata processing aba. . . Input symbols b report report start S1 i a b a report a a State-Match 1 State-Transition 2

Massively Parallel State-Transition Repurpose memory columns as FSM states S2 S1 S0 Sn b a a 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 1 0 0 0 0 Repurpose row address as input symbol 0 1 0 0 0 0 0 1 0 Bit-parallel state-match 1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 Input symbols 0 0 0 0 0 0 0 . . a ba 0 0 0 0 0 0 0 255 Active State Vector 1 0 1 0 1 1 0 1 10 1 0 1 1 &

Massively Parallel State-Transition Repurpose memory columns as FSM states S2 S1 S0 Sn b a a 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 1 0 0 0 0 Repurpose row address as input symbol 0 1 0 0 0 0 0 1 0 Bit-parallel state-match 1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 Input symbols 0 0 0 0 0 0 0 . . a ba 0 0 0 0 0 0 0 255 Bit-parallel state-transition Active State Vector Custom interconnect

Massively Parallel State-Transition Repurpose memory columns as FSM states S2 S1 S0 Sn b a a 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 1 0 0 0 0 Repurpose row address as input symbol 0 1 0 0 0 0 0 1 0 Bit-parallel state-match 1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 Input symbols 0 0 0 0 0 0 0 . . a ba 0 0 0 0 0 0 0 255 Active State Vector 0 0 1 0 0 1 1 Custom interconnect

Cache Automaton ? ! DRAM technology energy inefficient and AP is slow (~133 MHz) Low packing density (12 MB AP = 200 MB regular DRAM) Off-chip host to accelerator communication overhead ! ! Are SRAM-based caches suitable for automata processing ?

Outline • Motivation • In-Memory Automata Processing • Opportunity • Challenges • Cache Automaton • Compiler • Evaluation

Opportunity ✓ SRAM-based caches are faster, more energy efficient than DRAM ✓ Integrated on processor dies with performance-optimized logic ! Is cache capacity an issue for large NFA ?

Opportunity

Challenges 1 Using cache hierarchy is slow

Using the cache hierarchy is slow ! L1 cache ~ 2 cycles L2 cache ~ 12 cycles L3 cache ~ 36 cycles Credits: Intel Low operating frequency ~ 111 MHz

Challenges 2 Efficient state-match and state-transition in cache

Column multiplexing reduces bit-parallelism Column-multiplexing Repurpose memory columns to store FSM states 0 Repurpose row address as Input symbol 1 0 1 0 1 1 0 Row decoder 255 Bit-level parallelism State-match

Scalable interconnect for state-transition in cache Crossbar ? ! Overprovisioned w/ dynamic arbitration, multi-bit ports Infeasible Crossbar 100,000 x 100,000 100,000 states Network Architecture ? ? ? ? ?

1 In-situ computation cognizant of cache geometry can provide benefits

Intel Xeon Last-Level Cache 2.5 MB Slice CBOX Way 2 Way 1 Way 20 Way 19 32kB data bank 16kB subarray 16kB subarray Tag, State, LRU LS

Intel Xeon Last-Level Cache 2.5 MB Slice Chunk 63 2:1 Chunk 62 16kB subarray I/O SRAM SRAM 4:1 4:1 2:1 CBOX Decoder 2:1 SRAM I/O 4:1 SRAM 4:1 Chunk 1 2:1 Chunk 0 Way 2 Way 1 Way 20 Way 19 32kB data bank 16kB subarray 16kB subarray Tag, State, LRU LS

Intel Xeon Last-Level Cache Chunk 63 2.5 MB Slice 2:1 Chunk 62 16kB subarray 8kB SRAM array I/O SRAM SRAM 4:1 4:1 255 2:1 CBOX 0 Decoder WL 2:1 Row decoder SRAM I/O 4:1 SRAM 4:1 Chunk 1 2:1 Chunk 0 255 Way 2 /BLB BL Way 1 Way 20 Way 19 32kB data bank 16kB subarray 16kB subarray LS Tag, State, LRU

Intel Xeon Last-Level Cache Chunk 63 2.5 MB Slice 2:1 Chunk 62 16kB subarray 8kB SRAM array I/O SRAM SRAM 4:1 4:1 255 2:1 CBOX 0 Decoder WL 2:1 Row decoder SRAM I/O 4:1 SRAM 4:1 Chunk 1 2:1 Chunk 0 255 Way 2 /BLB BL Way 1 Way 20 Way 19 32kB data bank 16kB subarray 16kB subarray @ 4 GHz LS Tag, State, LRU

2 Sense-amplifier cycling enables highly parallel, low latency state-match

Accelerating state-match is challenging … Column-multiplexing Read sequence ! 4 cycles to read 4 adjacent bits (state-match results)

Sense-amplifier cycling Read sequence ! 4 cycles to read 4 adjacent bits (state-match results) Read sequence (optimized) ✓ < 2 cycles to read 4 adjacent bits (state-match results)

3 8T-based SRAM arrays can be repurposed to compactly encode state-transitions

Accelerating state-transition 2T read stack Crosspoint Enable bit Compact 8T SRAM-based switches as building block of programmable interconnect

4 Real-world NFA can be grouped into dense regions with sparse connectivity

Hierarchical network architecture L-switch L-switch L-switch G-switch Hierarchical switch topology scales to large NFA Real-world NFA Dense Sparse

Putting it together … L-switch CBOX Way 2 Way 1 Way 20 Way 19 16kB subarray Tag, State, LRU LS Array_HPA[16] = 1 Local Switch Array_LPA[16] = 0

Cache Automaton Architecture G-switch-1 L-switch CBOX Way 2 Way 1 Way 20 Way 19 16kB subarray Tag, State, LRU LS Array_HPA[16] = 1 Local Switch Array_LPA[16] = 0

Cache Automaton Architecture G-switch-4 G-switch-1 L-switch Output buffer CBOX Input buffer H-bus Way 2 Way 2 Way 1 Way 1 Way 19 Way 20 Way 20 Way 19 16kB subarray Tag, State, LRU LS Array_HPA[16] = 1 Local Switch Array_LPA[16] = 0

Cache Automaton Architecture G-switch-4 G-switch-1 L-switch 256 state partition (2 x 4KB SRAM Arrays) STE(255) STE(0) STE(1) STE(2) 0 Chunk 63 2:1 Chunk 62 8-bit Input I/O SRAM SRAM 4:1 4:1 Output buffer CBOX 2:1 Input buffer Output bit Decoder 2:1 Row decoder 255 256 b Match Vector I/O SRAM 4:1 SRAM To G-Switch-1 256 b 4:1 16b To G-Switch-4 8b 16b From G-Switch-1 Chunk 1 2:1 From G-Switch-4 Active State Vector Chunk 0 8b 256 b Way 2 Way 1 Way 20 Way 19 256 b 280x256 L-Switch Report Vector 16kB subarray Tag, State, LRU LS Array_HPA[16] = 1 Local Switch Array_LPA[16] = 0

Merging connected components a a r a c c a r a c m e c a r a m e c t e l l e t r * start * start * start * start * start Merging connected components

Cache Automaton designs Performance-optimized (CA_P) ✓ Small connected components -> low radix switches Redundant state activity Space-optimized (CA_S) ✓ Low footprint, no redundant state activity Large connected components -> high radix switches and careful mapping

Cache Automaton

Cache Automaton

Presentation Transcript

Cache

Cellular Automaton

Finite state automaton (FSA)

Airbag Automaton

Cache

Cache

Cache-Based Scalable Deep Packet Inspection with Predictive Automaton

Predator prey cellular automaton

Deterministic Finite Automaton (DFA)

Deterministic Finite Automaton

Cache

automaton —

Cellular Automaton

Non-deterministic finite automaton

Pushdown Automaton (PDA)

Lecture7 Pushdown Automaton

Generalized Buchi automaton

Finite Automaton: Example 1

Finite state automaton (FSA)

Cache?

Cache