230 likes | 363 Views
An Effective Similarity Metric for Application Traffic Classification. Jae Yoon Chung 1 , Byungchul Park 1 , Young J. Won 1 John Strassner 2 , and James W. Hong 1, 2 {dejavu94, fates, yjwon, johns, jwkhong}@postech.ac.kr April 20, 2010
E N D
An Effective Similarity Metric for Application Traffic Classification Jae Yoon Chung1, Byungchul Park1, Young J. Won1John Strassner2, and James W. Hong1, 2 {dejavu94, fates, yjwon, johns, jwkhong}@postech.ac.kr April 20, 2010 1Dept. of Computer Science and Engineering, POSTECH, Korea 2Division of IT Convergence Engineering, POSTECH, Korea
Contents • Introduction • Related Work • Research Goal • Proposed Methodology • Evaluation • Conclusion and Future Work
Introduction • Traffic classification for network management • Network planning • QoS management • Security • Etc. • Diversity of today’s Internet traffic • New types of network applications • Increase of P2P traffic • Various techniques for avoiding detection • Document classification Traffic classification • Document classification in natural language processing • Comparing packet payload vectors is analogous to document classification
Related Work • Well-known port-based classification • Low complexity • Low accuracy (approximately 50~70%) • Signature-based classification • High reliability • Exhaustive tasks for searching signatures • E.g.) Snort, LASER • Behavior-based classification • Focusing on traffic patterns and connection behaviors • Questionable accuracy • E.g.) BLINC • Machine Learning-based classification • Utilize statistical information • A huge computing resource consumption • E.g.) SVM, Bayesian Network • Similarity-based classification • Utilize document classification approach • Questionable scalability • E.g.) Flow similarity calculation [IPOM ‘09]
Summary of IPOM 2009 • Proposed new traffic classification approach • Utilize document classification approach using Cosine similarity calculation • Propose new packet representation using Vector Space Model • Propose flow similarity calculation methodology which is to compare packets in flow sequentially • Methodology validation using real-world traffic on our campus backbone network • Cannot classify flows in asymmetric routing environment • No comparison of Cosine similarity and other similarity metrics • Cosine similarity that is common similarity metric for human-document classification • High variation of similarity value according to term-frequency
Research Goals • Propose new traffic classification algorithm • Automation of signature generation step • Generate application vector, which is an alternative signature, using simple vector operation • Make groups according to traffic type and operation within single-application traffic • Accurate and feasible traffic classification algorithm • Classify application traffic using similarity calculation • Solve asymmetric routing classification problem • Validation using real-world network traffic to compare similarity metrics • Complexity analysis • Compare three similarity metrics for traffic classification • Jaccard similarity – counting fragment of signature • Cosine similarity – high weighting scheme for signature • RBF similarity – Euclidean distance between packets
Vector Space Modeling • Vector Space Modeling • An algebraic model representing text documents as vectors • Widely used to document classification • Categorize electronic document based on its content (e.g. E-mail spam filtering) • Document classification vs. Traffic classification • Document classification • Find documents from stored text documents which satisfy certain information queries • Traffic classification • Classify network traffic according to the type of application based on traffic information
Payload Vector Conversion (1/2) • Definition of word in payload • Payload data within an i-bytes sliding window • |Word set| = 2(8*sliding window size) • Definition of payload vector • A term-frequency vector in NLP Payload Vector = [w1 w2 … wn]T
Payload Vector Conversion (2/2) Word Word Word • The word size is 2 and the word set size is 216 • The simplest case for representing the order of content in payloads
Similarity Metrics for Traffic Classification • Jaccard similarity • The size of the intersection of the sample sets X and Y divided by the size of the union of the sample sets X and Y • Cosine similarity • Two vectors X and Y of n dimensions by fining the cosine angle between them • RBF similarity • Radius based function of Euclidean distance between two vectors X and Y
Application Vector Heuristics • Application vector • Represent typical packets that are generated by target applications as the center (basis) of each cluster • Application vector generator • Read packets from the target application trace • Divide the packets into several types of clusters without any pre-processing Traffic cluster 1 Application vector generator Application vector 1 Application trace Application vector 2 Application vector 3 Traffic cluster 2
Application Vector Generation • Unsupervised grouping within single-application traffic • Provide fine-grained classification • Classify single-application traffic according to traffic types Application Traffic Cluster 1 packet6 Application vector 1 packet5 packet4 packet3 packet2 packet1 Cluster 2 Application vector 2
Two-stage Traffic Classification • Packet level clustering • Classify signal packets regardless of flow information • Compare payload vectors with application vectors by calculating similarity value • Mark on each packet with its application and priority • Allow the permutation of packet sequence • Flow level classification • Rearrange packets according to flow information • Ignore mis-clustered packets that are caused by protocol ambiguities • HTTP for Web • HTTP for P2P
Two-stage Traffic Classification BackboneTraffic Stage 1 Stage 2 BitTorrent Traffic BitTorrent FileGuri Traffic Cluster 1 F1 P1 BitTorrent FileGuri Application Vector 1 F1 P2 F1 P3 F1 P1 Flow 1 Flow 2 F1 P4 F1 P2 F2 P1 F2 P1 F1 P1 FileGuri Cluster 2 F1 P3 F2 P2 F1 P2 F2 P2 F1 P3 F2 P3 F2 P1 F2 P3 Application Vector 2 F1 P4 F2 P4 F2 P2 Mis-clustered F1 P4 F2 P4 F2 P4 Melon Cluster 3 Application Vector 3 F2 P3
Classifying Real-world Traffic • Fix-port Applications • Traffic trace on one of two Internet junctions at POSTECH using optical tap • Ground-truth traffic • Some active flows among application traffic distinguished by usage of active port number • Target Applications • FileGuri, ClubBox, Melon, BigFile • Untraceable-port Applications • Traffic Measurement Agent (TMA) • Monitoring the network interface of the host • Recording log data (five-flow tuples, process name, packet count, etc) • Target Applications • eMule, BitTorrent Backbone Traffic Ground-truth Traffic Target Application Traffic Target Application Traffic Ground-truth Traffic
Classification Accuracy • Classification accuracy comparison • Fixed-port application • FileGuri, ClubBox, Melon, BigFile • Untraceable-port application • eMule, BitTorrent • Jaccard similarity • Reliable – count common segment • Cosine similarity • Emphasize common segment – cannot distinguish ambiguous packets • RBF similarity • Difficulty of setting parameter – need guideline how to set parameter • BitTorrent traffic on Backbone network • Traffic over-classification by Cosine similarity • High false positive rate of Cosine similarity
Conclusion and Future Work • Develop new traffic classification research • Utilizing document classification approach to traffic classification • Unsupervised classification to make cluster within a single-application traffic • Two-stage classification algorithm to solve asymmetric routing classification problem • Linear time complexity • Compare three similarity metrics • Provide guideline for selecting similarity metrics for traffic classification • Provide soft-classification that represents similarity as a numerical value ranges from 0 to 1 • Future Work • Enhance unsupervised classification methodology for automated signature generation • Extract orthogonal application vectors to improve scalability