250 likes | 342 Views
Space-efficient Tracking of Persistent Items in a Massive Data Stream. Bibudh Lahiri and Srikanta Tirthapura. Electrical & Computer Engg ., Iowa State University. Jaideep Chandrashekar. Technicolor Labs, Palo Alto. ACM DEBS 2011.
E N D
Space-efficient Tracking of Persistent Items in a Massive Data Stream Bibudh Lahiri and Srikanta Tirthapura Electrical & Computer Engg., Iowa State University Jaideep Chandrashekar Technicolor Labs, Palo Alto ACM DEBS 2011
Temporal Persistence: A Not-so-Discussed Problem in Data Stream • Motivation from security, formulation as a problem in streams • Botnets, port scans, click fraud • Appear in a temporally regular manner • Do the damage, yet evade the radar • Not necessarily in large volume (stealthy) • Heavy-hitter algorithms do not work
State of the Art in Data Stream Research • Frequency moments, heavy-hitter, entropy, variance • Enough to know how many times i Є {1,…m} occurs in stream, for all i • Persistence: When does i occur in the stream? In how many slots, in total?
Persistent Behavior in Botnet Traffic • Giroireet al1 • Consecutive connections to same destination often separated by an hour or more • Most bots occur in 100% slots in a window when slot-length (s) = 1 hr • MyBot-8926 in 100% slots when s = 16 hrs! • Li et al2 • Periodic botnet events about every ½ hr • “Exploiting Temporal Persistence to Detect Covert Botnet Channels”, RAID 2009 • “Automating Analysis of Large-scale Botnet Probing Events”, ASIACCS 2009
Problem Definition • Time is split into slots 1,2,…n of equal length • Stream S = {<di, ti>}; di: itemID, tiЄ {1,2,…n} • Window Slr over [l, r] = {(di, ti) Є S | l ≤ ti ≤ r} • pd(l,r) = persistence of d in Slr = #distinct slots in [l,r] in which d appears pa(4,7) = 2, pb(4,7) = 3, pc(4,7) = 3, pd(4,7) = 1 a, d, b c, d, e a, c, d, b a, b, a, c b, c a, b, b c, c, d, c 1 2 3 4 5 6 7
Problem Definition • Item d is α-persistent in Slr : appears in at least α(r-l+1) slots • With α = 0.5, a, b and c are α-persistent in [4,7], d is not • Goal: To detect α-persistent items pa(4,7) = 2, pb(4,7) = 3, pc(4,7) = 3, pd(4,7) = 1 a, d, b c, d, e a, c, d, b a, b, a, c b, c a, b, b c, c, d, c 1 2 3 4 5 6 7
Our Contributions • Lower bound: Exact tracking needs Ω(|D|.log nα) space • Approximate tracking: • Detect items with pd ≥ (α-ε)n with high probability • Items with pd < (α-ε)n not reported as persistent
Our Contributions • First algorithm for this problem with any provable guarantee • Small-space algorithm • Space complexity O(1/ε) for Zipfian distributions • Upto 85% less physical memory than naïve algorithm • Typical FPR < 1%, FNR < 4%
Talk Organization • Introduction • Fixed-window algorithm • Sliding-window algorithm • Evaluation
Approximate Tracking • Detect items with pd ≥ (α-ε)n whp • Do not report items with pd < (α-ε)n • Fixed window: pd computed over slots [1,n]
Talk Organization • Introduction • Fixed-window algorithm • Sliding-window algorithm • Evaluation
Intuition: Fixed-Window Algorithm • “Sample and count” • Sample a random element in stream • Once sampled, count occurrences of the item exactly • Persistence: count only one occurrence/slot • Sampling method • Send every (d,t) through a hash-based filter • Chance of passing filter = h(d,t) << 1 (in fact, 2/εn)
Intuition: Fixed-Window Algorithm • Same d, same t: h(d,t) remains same • Re-occurrences in same slot does not help • Same d, different t: h(d,t)’s are independent • (d,td,nd) initialized when (d,t) first passes filter • Persistent item: Enough chances to cross filter • Transient item: Fewer chances
Intuition: Fixed-Window Algorithm a b b b c c a a f c a a Slot 1 Slot 2 Slot 3 Slot 4 (b,1) (a, 1) No No d Є S? h(d,t) < 1/2? (c,1) (c,2) Yes Yes (a, 2) (a, 3) (a, 4) td < t ? (f, 3) (c,4, 2) Yes (c,2, 1) (c, 4) (a, 4,2) (a, 4) (a, 3,1) No
Performance: Fixed-Window Algorithm • False Neg.: pd ≥ αn => Pr(reported transient) ≤ e-2 = 13% • Drops to δ with O(log(1/δ)) parallel instances • pd < (α-ε)n => d never reported as persistent • Space = O(P.log(1/δ)/εn), where P = ∑d Є D(S) pd • Reduces to O(1/ε) for Zipfian distribution • Processing time per element O(log(1/δ))
Talk Organization • Introduction • Fixed-window algorithm • Sliding-window algorithm • Evaluation
Sliding Window Algorithm • pdc: persistence of d in [c-n+1,c] • Detect items with pdc ≥ (α-ε)n whp • Do not report items with pdc < (α-ε)n • Intuition • Start a new fixed-window data structure St in every distinct slot t where d occurs • Won’t that take too much space? • No…
Intuition: Sliding-Window Algorithm • Observations • Only in few slots, d will pass filter and initialize St • In [c-n+1,…, j,…, c], if d passes filter first in j, then Sj represents pdc most accurately • Note: We save the space for Sc-n+1,Sc-n+2,…Sj-1 • At c, we can discard any Sr where r ≤ c-n • Sketch is {(d, t, nd,t,td,t)} • when initialized, how many slots, most recent slot
Intuition: Sliding-Window Algorithm a b c c a a f c c a Slot 1 Slot 3 Slot 2 Slot 4 (b,1) (a, 1) No No (d,t) Є S? h(d,t) < 1/2? (c,1) (c,2) Yes Yes (a, 2) (a, 3) (f, 3) (c,3) (a,4, 1,4) (c,3, 1,3) (c,2, 1,2) (a,3, 1,3) (c,3, 2, 4) (c, 4) (a,3, 2,4) (c,2, 2,3) (a,2, 1,2) (a,2, 2,3) (a, 4)
Talk Organization • Introduction • Fixed-window algorithm • Sliding-window algorithm • Evaluation
Evaluation • Typically skewed distn • 885 million packets, 30-sec slots => 350 slots in ~ 3 hrs data • Query windows: [1,100], [26,125],…,[251,350] • In [1,100] window, ~570k distinct IPs, but ~500k of them occur in < 10 slots • Storing a counter for every distinct item is a waste of space
Evaluation • FNR is mostly within 5%, even when ε = 0.49 for α = 0.7 • Even the highest FPR is < 3% • Small-space algo saves up to 85% space compared to naïve • 445 MB instead of 3 GB
Summary • Persistent items: important on its own • Motivation: botnet detection, port scans • Exact solution needs storing all distinct items • Approximate, small-space solutions for fixed and sliding windows • Asymptotically same space for both • 70-85% saving in memory for typical values of α (0.5, 0.7) and ε (0.4α – 0.6α)