210 likes | 222 Views
Explore how capturing a certain percentage of fish in a sample can infer similar occurrences in the entire sea. This method is applied to analyze Internet traffic features using empirical distributions.
E N D
Statistical Reconstruction of Largest Contributors to Network Traffic(Fisherman’s Dilemma) VALERY KANEVSKY Agilent Laboratories
Fisherman’s Dilemma How does this catch represent the most numerous species in the sea? Pike Trout Salmon Agilent Technologies
• • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • Packet Samples Sample 1 Sample 2 • • • Destination Agilent Technologies
Fisherman’s Formulation: If a certain % of the fish he catches (samples) are in a set of species, then how likely is it to find that fish from the same set of species constitute “almost” the same % in the entire sea? Agilent Technologies
Mathematical Formulation Let S be a finite or enumerable set of features/characteristics of Internet traffic and F an a priori probability distribution over S. Let arbitrary n be the size of a sample from S (made with replacement) and s be the set of different features therein. Letube a subset of s andFn(u) the empirical distribution which is the fraction of the sample with features in u. If u is a subset of high contributors observed in a sample, i.e., Fn(u) a + , (0 < a < 1, > 0), then how likely is it that F(u) > a?I.e.,what is the confidence of the inference: Fn(u) (a + ) ”F(u) > a, or in other words what is the probability P(F(u) > a) and how does it depend on sample size n, contribution level a and error margin ? Agilent Technologies
What’s the difference? Classical Statistics: Given some a priori assumption about underlying distribution and a sample, estimate the probability of an event E (e.g., E={traffic related to a given set of features w whichconstitute at least a% of the total}) along with the confidence interval and corresponding confidence level. Current context: A set of featuresw, whose corresponding trafficconstitutes at least a% of the total, depends on a random sample as opposed to be fixed in the classical case. Agilent Technologies
We do not estimate the “true” fraction F(u) of the traffic related to a given setu of features! Neither do we estimate the probability P(F(u) > a) of a true fraction F(u), for agivenu, to contribute at the level a, since the latter is either 1 or 0. We estimate the confidence in the inference Fn(u) (a + ) ”F(u) > a. Agilent Technologies
Statistical Game Every sample of size n yields a set of contributors to a certain level a+ .After N of such sampling we generate a collection: set1,…, setN. 99% of times these sets are contributors to the level a. What we want to do is for given to find such a sample size n that makes the previous assertion true. Agilent Technologies
Test underlying assumptions We can’t test a theorem, provided it is correct, but we can test the underlying assumptions by actually looking at various packet traffic data records and find out how the frequency of the inference Fn(u) (a + ) ”F(u) > a deviates from the guaranteed by the theorem value. LBL-TCP-3 Description This trace contains two hours' worth of all wide-area TCP traffic between the Lawrence Berkeley Laboratory and the rest of the world. Format The trace was reduced from tcpdump format to ASCII using the sanitize-tcp and sanitize-syn-fin scripts. The first script was used to produce lbl-tcp-3.tcp, which has six columns: timestamp, (renumbered) source host, (renumbered) destination host, source TCP port, destination TCP port, and number of data bytes (zero for "pure-ack" packets). The second script generated lbl-tcp-3.sf, which includes the same first five columns, plus TCP flags (SYN/FIN/RST/PSH etc.), sequence number, and acknowledgement number (0 for initial SYN). Agilent Technologies
What do we need? We need an uniform estimate for the confidence of the inference Fn(u) (a + ) ”F(u) > a, spread over a collection of subsets of features u, which may appear as subsets of “large contributors”. Warning! Given everything equal, the greater is the collection of subsets the lower confidence level may be. Agilent Technologies
How to select a Collection? A collection has to be: “Well defined” “Tractable”- “easy to compute” As small as possible Agilent Technologies
Minimal subsets One candidate is the collection of all subsets of features found in a sample. Though obvious, this choice may not be terribly good. Can we do better? Definition: Given contribution level ata%, a subset of features u is called minimal if there is no other subset in s of smaller cardinality that contribute to the same level. For a given set of features there can be more than one minimal subset present in s, though their multitude, generally speaking, shrinks as a%approaches 100%. Agilent Technologies
An answer: Let p1, …, pk, … be an ordereda priori distribution of features. For a given sample size n, k is defined as the smallest solution of the inequality p1pk > 1-1/n . . . + + Average # of minimal subsets e-2n P(F(u) > a) > 1- 2 EL EL<Constant·2k/k1/2, where Constant 2.36 ––– e-2n Can offset the growing term 2k/k1/2? 2 It depends on how fast k grows with n. Agilent Technologies
Examples and Analysis Sample2 0% 25% 5% 15% 35% 5% 15% Destination Sample1 1 20% 2 15% 3 10% 4 15% 5 30% 6 5% 7 5% 90% (1,2,3,4,5) (5,2,4,7) Contributors to 60% (5,1,3) (5,2,4) (5,1,2) (5,2) Agilent Technologies
Exponential distribution Cumulative distribution function: F(l)=1- l , (0< <1) Confidence level: 99% Error margin: = 5% k(n) ln(n)/ln +1 EL< e13/12 (2/)1/2 (ln )1/2 n-ln2/ln/(ln(n))1/2 Example: When = 1/2, n 5150 Agilent Technologies
Confidence: exponential Agilent Technologies
Power Law Qualitatively different result follows if the tail of the distribution is “heavier” then exponential, e.g., obeys the power law: F(l)=1 - l1/. In this case k(n)=n1/, . For an arbitrary value of and when > 1 . For instance if = 2 , under the same conditions as in the previous case, n 79,000 . n1/ EL=O(2 / n1/2 ) Agilent Technologies
Confidence: power law Agilent Technologies
Weibull Distribution: F(x) = 1- e -x c In this case k = (ln(n))1/c. To achieve the same 99% in confidence level with c=0.3 n should be around 2,083,500. The remarkable increase in the sample size is due to the fact that though tail of the Weibull distribution goes to zero faster than, say the power law with =2 , the decay kicks in for large values of n. E.g.,e -n becomes smaller than 1/n2 only when n is about 22,000. 0.3 Agilent Technologies
Confidence: Weibull’s Agilent Technologies
Acknowledgments: Andrei Broido Jim Davis Sergey Nagaev Graham Pollock Joe Sventek Lance Tatman Agilent Technologies