1 / 18

Valentín Carela-Español Pere Barlet-Ros Josep Solé-Pareta {vcarela, pbarlet, pareta}@ac.upc

Identification of Network Applications based on Machine Learning Techniques COST-TMA Meeting, Samos 2008. Valentín Carela-Español Pere Barlet-Ros Josep Solé-Pareta {vcarela, pbarlet, pareta}@ac.upc.edu. Outline. Scenario and objectives Existing solutions Well-known ports

lexi
Download Presentation

Valentín Carela-Español Pere Barlet-Ros Josep Solé-Pareta {vcarela, pbarlet, pareta}@ac.upc

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Identification of Network Applications based on Machine Learning TechniquesCOST-TMA Meeting, Samos 2008 Valentín Carela-Español Pere Barlet-Ros Josep Solé-Pareta {vcarela, pbarlet, pareta}@ac.upc.edu

  2. Outline • Scenario and objectives • Existing solutions • Well-known ports • Payload based (pattern matching) • Machine Learning • Supervised • Unsupervised • Proposed method • Results • Conclusions and Future work

  3. Scenario and objectives • Scenario: SMARTxAC Traffic Monitoring and Analysis System for the Anella Científica • Real-time classification • Independent from packet contents • High-speed link • Objectives: • Development of a ML Technique to identify applications in SMARTxAC • Automate the ML training phase • Adapt our solution to Netflow • Study how it affects the sampling

  4. Outline • Scenario and objectives • Existing solutions • Well-known ports • Payload based (pattern matching) • Machine Learning • Supervised • Unsupervised • Proposed method • Results • Conclusions and Future work

  5. Existing Solutions • Well-known ports • Computationally lightweight • Very low accuracy • Payload based (pattern matching) • High accuracy • Packet contents are required • Computationally expensive • Content encryption • Privacy legislations • Consequence: Not a feasible solutions

  6. Existing Solutions • Machine Learning Techniques • Difficult training phase • Packet contents are not required • High accuracy • Computationally viable • Two main possibilities: • Supervised methods: • Better accuracy for classes expected • Need a complete pre-labeled dataset • Difficult detection of retraining necessity • No detection of new classes • Unsupervised methods: • Do not need a full labeled dataset • Automatic detection of new classes • Better accuracy for new classes

  7. Outline • Scenario and objectives • Existing solutions • Well-known ports • Payload based (pattern matching) • Machine Learning • Supervised • Unsupervised • Proposed method • Results • Conclusions and Future work

  8. Proposed method • Supervised identification based on C4.5 algorithm • Developed by Ross Quinlan as extension of ID3 • Based on the construction of a classification tree • Training set • Actual traffic flows • Pairs <flow features, applications> • Feature vector contains relevant characteristics of traffic flows • Application is identified using L7-filter

  9. Machine Learning process 1) Collection of the training set • Representative flows of the environment to be monitored 2)Automatic flow classification → application class • Pattern matching using L7-filter • It can be simplified if an artificial training set is used in 1) 3) Feature extraction from the training flows 4) Construction of a C4.5 classification tree • E.g. using Weka 5) Deployment of the tree obtained in 4) in the monitoring system 6) Retraining of the system • Starting from phase 1)

  10. Outline • Scenario and objectives • Existing solutions • Well-known ports • Payload based (pattern matching) • Machine Learning • Supervised • Unsupervised • Proposed method • Results • Conclusions and Future work

  11. Accuracy

  12. Netflow Accuracy

  13. Accuracy

  14. Features Accuracy · Best Normal FeatureSubset : dport, bytes_out, avg_out_size, sport, avg_in_size, push_in. · BestNetflowFeatureSubset: dport, bytes, push

  15. How it affects the sampling?

  16. Outline • Scenario and objectives • Existing solutions • Well-known ports • Payload based (pattern matching) • Machine Learning • Supervised • Unsupervised • Proposed method • Results • Conclusions and Future work

  17. Conclusions and Future Work • Machine learning techniques are a good solution to identify applications • The identification in sampled scenarios are still very open Future work: • Find a more accurate automatic system to label the dataset • Build early decision trees to identify the flow as soon as possible • Find features that achieves more accuracy and more resilient to sampling • Test with traces from another networks to check the generality of the solution.

  18. Thank you for your attention Questions?

More Related