Valentín Carela-Español Pere Barlet-Ros Josep Solé-Pareta {vcarela, pbarlet, pareta}@ac.upc

Identification of Network Applications based on Machine Learning TechniquesCOST-TMA Meeting, Samos 2008 Valentín Carela-Español Pere Barlet-Ros Josep Solé-Pareta {vcarela, pbarlet, pareta}@ac.upc.edu

Outline • Scenario and objectives • Existing solutions • Well-known ports • Payload based (pattern matching) • Machine Learning • Supervised • Unsupervised • Proposed method • Results • Conclusions and Future work

Scenario and objectives • Scenario: SMARTxAC Traffic Monitoring and Analysis System for the Anella Científica • Real-time classification • Independent from packet contents • High-speed link • Objectives: • Development of a ML Technique to identify applications in SMARTxAC • Automate the ML training phase • Adapt our solution to Netflow • Study how it affects the sampling

Existing Solutions • Well-known ports • Computationally lightweight • Very low accuracy • Payload based (pattern matching) • High accuracy • Packet contents are required • Computationally expensive • Content encryption • Privacy legislations • Consequence: Not a feasible solutions

Existing Solutions • Machine Learning Techniques • Difficult training phase • Packet contents are not required • High accuracy • Computationally viable • Two main possibilities: • Supervised methods: • Better accuracy for classes expected • Need a complete pre-labeled dataset • Difficult detection of retraining necessity • No detection of new classes • Unsupervised methods: • Do not need a full labeled dataset • Automatic detection of new classes • Better accuracy for new classes

Proposed method • Supervised identification based on C4.5 algorithm • Developed by Ross Quinlan as extension of ID3 • Based on the construction of a classification tree • Training set • Actual traffic flows • Pairs <flow features, applications> • Feature vector contains relevant characteristics of traffic flows • Application is identified using L7-filter

Machine Learning process 1) Collection of the training set • Representative flows of the environment to be monitored 2)Automatic flow classification → application class • Pattern matching using L7-filter • It can be simplified if an artificial training set is used in 1) 3) Feature extraction from the training flows 4) Construction of a C4.5 classification tree • E.g. using Weka 5) Deployment of the tree obtained in 4) in the monitoring system 6) Retraining of the system • Starting from phase 1)

Accuracy

Netflow Accuracy

Accuracy

Features Accuracy · Best Normal FeatureSubset : dport, bytes_out, avg_out_size, sport, avg_in_size, push_in. · BestNetflowFeatureSubset: dport, bytes, push

How it affects the sampling?

Conclusions and Future Work • Machine learning techniques are a good solution to identify applications • The identification in sampled scenarios are still very open Future work: • Find a more accurate automatic system to label the dataset • Build early decision trees to identify the flow as soon as possible • Find features that achieves more accuracy and more resilient to sampling • Test with traces from another networks to check the generality of the solution.

Thank you for your attention Questions?

Valentín Carela-Español Pere Barlet-Ros Josep Solé-Pareta {vcarela, pbarlet, pareta}@ac.upc