Statistical Reconstruction of Largest Contributors to Network Traffic (Fisherman’s Dilemma)

Statistical Reconstruction of Largest Contributors to Network Traffic(Fisherman’s Dilemma) VALERY KANEVSKY Agilent Laboratories

Fisherman’s Dilemma How does this catch represent the most numerous species in the sea? Pike Trout Salmon Agilent Technologies

• • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • Packet Samples Sample 1 Sample 2 • • • Destination Agilent Technologies

Fisherman’s Formulation: If a certain % of the fish he catches (samples) are in a set of species, then how likely is it to find that fish from the same set of species constitute “almost” the same % in the entire sea? Agilent Technologies

Mathematical Formulation Let S be a finite or enumerable set of features/characteristics of Internet traffic and F an a priori probability distribution over S. Let arbitrary n be the size of a sample from S (made with replacement) and s be the set of different features therein. Letube a subset of s andFn(u) the empirical distribution which is the fraction of the sample with features in u. If u is a subset of high contributors observed in a sample, i.e., Fn(u)  a + , (0 < a < 1,  > 0), then how likely is it that F(u) > a?I.e.,what is the confidence of the inference: Fn(u)  (a + ) ”F(u) > a, or in other words what is the probability P(F(u) > a) and how does it depend on sample size n, contribution level a and error margin ? Agilent Technologies

What’s the difference? Classical Statistics: Given some a priori assumption about underlying distribution and a sample, estimate the probability of an event E (e.g., E={traffic related to a given set of features w whichconstitute at least a% of the total}) along with the confidence interval and corresponding confidence level. Current context: A set of featuresw, whose corresponding trafficconstitutes at least a% of the total, depends on a random sample as opposed to be fixed in the classical case. Agilent Technologies

We do not estimate the “true” fraction F(u) of the traffic related to a given setu of features! Neither do we estimate the probability P(F(u) > a) of a true fraction F(u), for agivenu, to contribute at the level a, since the latter is either 1 or 0. We estimate the confidence in the inference Fn(u)  (a + ) ”F(u) > a. Agilent Technologies

Statistical Game Every sample of size n yields a set of contributors to a certain level a+  .After N of such sampling we generate a collection: set1,…, setN. 99% of times these sets are contributors to the level a. What we want to do is for given  to find such a sample size n that makes the previous assertion true. Agilent Technologies

Test underlying assumptions We can’t test a theorem, provided it is correct, but we can test the underlying assumptions by actually looking at various packet traffic data records and find out how the frequency of the inference Fn(u)  (a +  ) ”F(u) > a deviates from the guaranteed by the theorem value. LBL-TCP-3 Description This trace contains two hours' worth of all wide-area TCP traffic between the Lawrence Berkeley Laboratory and the rest of the world. Format The trace was reduced from tcpdump format to ASCII using the sanitize-tcp and sanitize-syn-fin scripts. The first script was used to produce lbl-tcp-3.tcp, which has six columns: timestamp, (renumbered) source host, (renumbered) destination host, source TCP port, destination TCP port, and number of data bytes (zero for "pure-ack" packets). The second script generated lbl-tcp-3.sf, which includes the same first five columns, plus TCP flags (SYN/FIN/RST/PSH etc.), sequence number, and acknowledgement number (0 for initial SYN). Agilent Technologies

What do we need? We need an uniform estimate for the confidence of the inference Fn(u)  (a + ) ”F(u) > a, spread over a collection of subsets of features u, which may appear as subsets of “large contributors”. Warning! Given everything equal, the greater is the collection of subsets the lower confidence level may be. Agilent Technologies

How to select a Collection? A collection has to be: “Well defined” “Tractable”- “easy to compute” As small as possible Agilent Technologies

Minimal subsets One candidate is the collection of all subsets of features found in a sample. Though obvious, this choice may not be terribly good. Can we do better? Definition: Given contribution level ata%, a subset of features u is called minimal if there is no other subset in s of smaller cardinality that contribute to the same level. For a given set of features there can be more than one minimal subset present in s, though their multitude, generally speaking, shrinks as a%approaches 100%. Agilent Technologies

An answer: Let p1, …, pk, … be an ordereda priori distribution of features. For a given sample size n, k is defined as the smallest solution of the inequality p1pk > 1-1/n . . . + + Average # of minimal subsets e-2n P(F(u) > a) > 1- 2 EL EL<Constant·2k/k1/2, where Constant  2.36 ––– e-2n Can offset the growing term 2k/k1/2? 2 It depends on how fast k grows with n. Agilent Technologies

Examples and Analysis Sample2 0% 25% 5% 15% 35% 5% 15% Destination Sample1 1 20% 2 15% 3 10% 4 15% 5 30% 6 5% 7 5% 90% (1,2,3,4,5) (5,2,4,7) Contributors to 60% (5,1,3) (5,2,4) (5,1,2) (5,2) Agilent Technologies

Exponential distribution Cumulative distribution function: F(l)=1- l , (0< <1) Confidence level: 99% Error margin:  = 5% k(n)  ln(n)/ln +1 EL< e13/12 (2/)1/2 (ln )1/2 n-ln2/ln/(ln(n))1/2 Example: When  = 1/2, n  5150 Agilent Technologies

Confidence: exponential Agilent Technologies

Power Law Qualitatively different result follows if the tail of the distribution is “heavier” then exponential, e.g., obeys the power law: F(l)=1 - l1/. In this case k(n)=n1/, . For an arbitrary value of  and when  > 1 . For instance if  = 2 , under the same conditions as in the previous case, n  79,000 . n1/ EL=O(2 / n1/2 ) Agilent Technologies

Confidence: power law Agilent Technologies

Weibull Distribution: F(x) = 1- e -x c In this case k = (ln(n))1/c. To achieve the same 99% in confidence level with c=0.3 n should be around 2,083,500. The remarkable increase in the sample size is due to the fact that though tail of the Weibull distribution goes to zero faster than, say the power law with =2 , the decay kicks in for large values of n. E.g.,e -n becomes smaller than 1/n2 only when n is about 22,000. 0.3 Agilent Technologies

Confidence: Weibull’s Agilent Technologies

Acknowledgments: Andrei Broido Jim Davis Sergey Nagaev Graham Pollock Joe Sventek Lance Tatman Agilent Technologies

Statistical Reconstruction of Largest Contributors to Network Traffic (Fisherman’s Dilemma)

Statistical Reconstruction of Largest Contributors to Network Traffic (Fisherman’s Dilemma)

Presentation Transcript

An introduction to Network Analyzers

PRISONER’S DILEMMA

Reconstruction

Design of Geofoam Embankment for the I-15 Reconstruction

Network biology

Reconstruction in the South

Network Connectivity Options

Peer-to-peer networks

Network Traffic Self-Similarity

Adaptive Traffic Light Control For Traffic Network

On Network Traffic Modeling Framework: A Case Study with Public Safety Network

GREEN WAVE TRAFFIC OPTIMIZATION

Network Traffic Self-Similarity

Connection Admission Control Schemes for Self-Similar Traffic

Protocol Identification via Statistical Analysis (PISA)

Network Management

Deriving IP Traffic Demands for an ISP Backbone Network

Network Sniffing

WAN Traffic Measurements

Statistical modelling challenges – approaches used in Network Rail

Network Characteristics