Beyond Bloom Filters: From Approximate Membership Checks to Approximate State Machines

Beyond Bloom Filters: From Approximate MembershipChecks to Approximate State Machines By F. Bonomi et al. Presented by Kenny Cheng, Tonny Mak Yui Kuen

Introduction • Motivation • Objectives • Problem statements

A) Motivation • Increasing trend to keep flow state in routers • Large memory space (~100 bits per flow) is needed for storing a large amount of flow states • If memory space can be reduced, using fast on-chip memory is feasible to improve performance

B) Objectives • Introduce the idea of an Approximate Concurrent State Machine (ACSM), it sacrifices some accuracy for memory size. • Introduce and compare several solutions to ACSM problem • To find an approach with the highest accuracy to memory ratio

C) Problem statements • Describe 3 techniques based on Bloom filters and hashing, and evaluate them using both theoretical analysis and simulation

Bloom Filter • A data structure proposed by Bloom in 1970 • Designed for membership test, i.e. to test whether an element exists in a set • Fast and compact • Chance of false positive, i.e. an element not in the set may be wrongly identified • No false negative, i.e. an element in the set must be identified correctly

0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 2 3 k ... How a Bloom Filter Works • A bit array with all zeros initially • k hash functions

0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 1 0 1 2 3 k ... x How a Bloom Filter Works Insertion • Hash the element using the hash functions, get k indices in the bit array • Mark the bits to 1

0 1 0 0 0 0 0 1 1 0 1 1 1 0 1 1 0 1 1 0 1 0 0 0 0 0 1 0 1 2 3 k ... x How a Bloom Filter Works Lookup • Hash the element using the hash functions • If all corresponding bits are 1, it’s in the set

0 1 0 0 0 0 0 1 1 0 ? 1 1 0 1 ? 0 1 1 0 ? 0 0 0 0 0 ? 0 1 2 3 k ... x How a Bloom Filter Works Deletion • Sorry, no deletion • You don’t know whether the bits are used by other elements or not, cannot simply clear them

0 0 0 0 0 0 0 0 0 1 0 1 0 1 0 0 1 0 0 0 0 3 1 0 0 0 0 0 3 3 0 2 0 0 0 0 0 1 1 0 1 0 1 2 3 k ... x Counting Bloom Filter • Use a counter to replace a bit • For insertion, increment the counters • For deletion, decrement the counters • Problems: more space, overflow counters

3 Approaches to ACSM • Approaches:1. Direct Bloom Filter2. Stateful Bloom Filter3. Fingerprint-compressed Filter • Operations need to implement:1. Insert(flow, state)2. Lookup(flow) returns (state)3. Delete(flow)4. Update(flow, new_state)

Direct Bloom Filter Approach • Use counting Bloom filter • 4 operations:Insert – insert (flow_id, state) pairLookup – if state is not provided, have to lookup every state, return “don’t know” if more than one state is foundDelete – lookup + decrement countersUpdate – delete old + insert new • Improvement: use timing-based deletion to handle non-terminated flows

0 0 1 0 0 3 3 0 1 2 1 1 1 0 0 0 0 0 1 0 0 0 0 0 2 3 0 0 1 1 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 1 2 3 k ... x Timing-based Deletion Timing Bits • Add a timing bit to each cell • Set the bit if the cell is touched • Clear untouched cells periodically, and reset timing bits • Alternative to DBF: use standard Bloom filter instead of counting, delete elements only by time-based deletion

Stateful Bloom Filter Approach • Direct Bloom Filter doesn’t store the state of a flow, need to lookup every state • Improvement: add a state value for each cell for faster lookup • Hash flow_id only, instead of (flow_id, state) pair • Introduce a “don’t know” (DK) state when collision occurs • Keep timing-based deletion

Stateful Bloom Filter Approach • Insert, modify, delete – similar to Direct Bloom Filter, set the cell value to DK for collision (counter > 1) • Lookup:If all cells are DK, return DKIf all cells are either state i or DK, return state iIf more than one state other than DK, return “not found”

1 2 d ... Fingerprint State 0110111010 3 2 4 1 1 3 3 1 1100000110 1100110000 1001010110 0111010100 1110001000 0000111101 1110011101 ... Fingerprint-compressed Filter Approach • Store a fingerprint of flow + state in a d-left hashtable ... x

Fingerprint-compressed Filter Approach • Insert - hash the element, and find the corresponding bucket in each hash table, insert the fingerprint + state in the bucket with least number of elements (choose the left-most one to break ties) • Lookup – retrieve the state of the fingerprint • Delete – remove the fingerprint • Update – direct update or remove old + add new • Make use of DK when a fingerprint is found in multiple buckets • Timing-based deletion can still be applied

Simulation • To investigate the size/accuracy trade-off for the 3 approaches • State machine: 10 states • Legal state changes: 1 → 2 → 3 → … → 10 • Run for 1 million flows • About 60000 simultaneous flows • 100 ± 40 packets for each flow • Some packets trigger state change

Simulation • 3 kinds of simulation flows • Interesting flows (30%) – flows with legal state changes only, always complete • Noise flows (30%) – flows with random (can be legal or illegal) state changes, never complete • Random flows (40%) – flows without state change

Simulation False positive rate: % of completed flows which is not-interesting False negative rate: % of interesting flows without completion

Applications Place in the application level QoS:- • Video congestion control • Peer-to-Peer (P2P) traffic identification

Video congestion control • Apply to MPEG video streaming • 3 kinds of frames for MPEG video:I frame – scene informationP frame – differential informationB frame – least important information • Can drop B frames up to 30% with acceptable quality • Need to keep track of current frame

Video congestion control • Use FCF ACSM to keep track of state • Experimentally the highest false positive rate acceptable is 0.37% • This requires a memory size of 27 bits per flow (about ¼ compared to original 100 bits)

P2P Traffic Identification • To limit P2P flows to increase quality for other applications • One possible way to identify a P2P flow:concurrent TCP and UDP flows • Use ACSM for real-time P2P identification

Conclusion • It’s feasible for ACSM • FCF approach is the best approach • Two potential applications are introduced for ACSM • ACSM may be beneficial to QoS applications, which are fault-tolerant

Comments • Authors focus on accuracy and memory size, but not real performance • FCF approach may not perform well on hardware

- End - Question & Answer

Beyond Bloom Filters: From Approximate Membership Checks to Approximate State Machines