280 likes | 427 Views
Passive inference: Troubleshooting the Cloud with Tstat. Alessandro Finamore < alessandro.finamore@polito.it >. TMA Traffic monitoring a nd A nalysis. 4th TMA PHD School - London – Apr 16 th , 2014. Active - vs - passive inference. Active inference:
E N D
Passive inference: Troubleshooting the Cloud with Tstat Alessandro Finamore <alessandro.finamore@polito.it> TMA Traffic monitoring and Analysis 4th TMA PHD School - London – Apr 16th, 2014
Active -vs- passive inference • Active inference: • Study cause/effect relationships, i.e., injectsome traffic in the network to observe a reaction • PRO: world-wide scale (e.g., Planetlab) • CONS: synthetic benchmark suffer from lack of generality • Passive inference: • study traffic properties just by observing it and without interfering with it • PRO: study traffic generated from actual Internet users • CONS: limited number of vantage points
The network monitoring playground Collect some measurements Challenges? • Automation • Flexibility/Openness Extract analytics Post-processing passiveprobe data What are the performance ofa cache? Deploy some vantage points What are the performance of YouTube video streaming?
Pushing the paradigm further with • FP7 European project about the design and implementationof a measurement plane for the Internet • Large scale • Vantage points deployed on a worldwide scale • Flexible • Offers APIs for integrating existing measurement frameworks • Not strictly bounded to specific “use cases” • Intelligent • Automate/simplify the process of “cooking” raw data • Identify anomalies and unexpected events • Provide root-cause-analysis capabilities
mPlane consortium Coordinator WP7 • FP7 IP • 3 years long • 11Meuro • 16 partners • 3 operators • 6 research centers • 5 universities • 2 small enterprises Marco Mellia POLITO SaverioNicolini NEC Dina Papagiannaki Telefonica WP2 WP1 Ernst Biersack Eurecom TivadarSzemethy NetVisor Brian Trammell ETH WP5 WP6 FabrizioInvernizzi Telecom Italia Andrea Fregosi Fastweb Dario Rossi ENST WP3 WP4 Pedro Casas FTW Guy Leduc Univ. Liege PietroMichiardi Eurecom
Pushing the paradigm further with Integration with existing monitoring frameworks active probe Post-processing passiveprobe data control Active and passive analysis for iterative root-cause-analysis
What else beside ? • “From global measurements to local management” • Specific Targeted Research Projects (STReP) • 3 years 2 left, 10 partners, 3.8 Meuros • Build a measure framework out of probes • IETF, Large-Scale Measurement of Broadband Performance (LMAP) • Standardization effort on how to do broadband measurements • Defining the components, protocols, rules, etc. • It does not specifically target adding “a brain” to the system … is a sort of “mPlane use case” Strong similarities for the architecture core
The network monitoring trinity Try not to focus on just one aspect but rather on “mastering the trinity” Post-processing Repository Raw measurements Focus on How to process network traffic? How to scale at 10Gbps?
http://tstat.polito.it • Is the passive sniffer developed @Polito over the last 10 years IN Rest of the world Private Network Border router Question: Which are the most popular accessed services? Question: How CDNs/datacenters are composed? Traffic stats
http://tstat.polito.it • Is the passive sniffer developed @Polito over the last 10 years • Per-flow stats including • Several L3/L4 metrics (e.g., #pkts, #bytes, RTT, TTL, etc.) • Traffic classification • Deep Packet Inspection (DPI) • Statistical methods (Skype, obfuscated P2P) • Different output formats (logs, RRDs, histograms, pcap) • Run on off-the-shelf HW • Up to 2Gb/s with standard NIC • Currently adopted in real network scenarios (campus and ISP)
research/technology challenge • Challenge: Is it possible to build a “full-fledged” passive probe that cope with >10Gbps? • Ad-hoc NICs are too expensive (>10keuro) • Software solutions build on top of common Intel NICs • ntop DNA • netmap • PFQ [ACM Queue] Revisiting network I/O APIs: The netmaps Framework [PAM’12] PFQ: a Novel Engine for Multi-Gigabit Packet Capturing With Multi-Core Commodity Hardware [IMC’10] High Speed Network Traffic Analysis with Commodity Multi-core Systems By offering direct access to the NIC (i.e., bypassing the kernel stack) the libraries can count packets at wire speed …but what about doing real processing?
Possible system architecture merge out1 out2 outN If needed, design “mergeable” output consumerN How to organize the analysis modules workflow? • N identical consumer instances? • Within each consumer, single execution flow? consumer1 consumer2 2 Tstat + libDNA (synth. traffic) Margin to improve Per-flow packet scheduling is the simplest option, but • What about correlating multiple flows (e.g., DNS/TCP)? • What about scheduing per traffic class? % pkts drop Dispatch / Scheduling Read pkts Wire speed [Gbps] Under testing a solution based on libDNA One or more process for reading? Depends…
Other traffic classification tools? • WAND (Shane Alcock) - http://research.wand.net.nz • Libprotoident, traffic classification using 4 bytes of payload • Libtrace, rebuilds TCP/UDP and other tools for processing pcaps • ntop (Luca Deri) - http://www.ntop.org/products/ndpi • nDPI, a super set of OpenDPI • l7filter, but is known to be inaccurate • The literature is full of statistical/behavioral traffic classification methodologies [1,2] but AFAIK • no real deployment • no open source tool released It doesn’t matter having a fancy classifier if you do not have proper flow characterization [1] “A survey of techniques for internet traffic classification using machine learning” IEEE Communications Surveys & Tutorials, 2009 [2] “Reviewing Traffic Classification”, LNCS Vol. 7754, 2013
Measurement frameworks • RIPE Atlas – http://ripe.atlas.net • World wide deployment of inexpensive active probes • User Defined Measurement (UDM) credit based • Ping, traceroute/traceroute6, DNS, HTTP • Google mLAB Network Diagnostic Test (NDT)http://mlab-live.appspot.com/tools/ndt • Connectivity and bandwidth speed • Public available data … but IMO not straightforward to use
Recent research activities Focus on How to export/consolidate data continuously? What about BigData? Post-processing Repository Raw measurements Focus on How to process network traffic? How to scale at 10Gbps?
(Big)Data export frameworks • Overcrowded scenario https://wikitech.wikimedia.org/wiki/Analytics/Kraken/Logging_Solutions_Recommendation
(Big)Data export frameworks • Overcrowded scenario • All general purpose frameworks • Data center scale • Emphasis on throughput and/or real-time and/or consistency, etc. • Typically designed/optimized for HDFS • log_sync, “ad-hoc” solution @ POLITO • Designed to manage a few passive probes • Emphasis on throughput and data consistency
Data management @ POLITO NAS ~40TB (3TB x 12) = 1year data • 11 nodes = 9 data nodes +2 namenode • 416GB RAM = 32GBx9 + 64GBx2 • ~32TB HDFS • Single 6-core = 66 cores (x2 with HT) • Debian 6 + CDH 4.5.0 Gateway NAS probe1 cluster log_sync (server) Cluster probeN gateway • log_sync (client) • pre-processing (dual 4-core, 3TB disk, 16GB ram) log_sync (server) ISP/Campus
BigData = Hadoop? • Almost true but there are other NoSQL solutions • MongoDB, REDIS, Cassandra, Spark, Neo4J, etc. http://nosql-database.org • How to choose? Not so easy to say, but • Avoid BigData frameworks if you have just few GB of data • Sooner or later you are going to do some coding so pick something that seems “confortable” • Fun fact:MapReduce is a NoSQL paradigm but people are used to SQL queries • Rise of Pig, Hive, Impala, Shark, etc. which allow to do SQL-like queries on top of MapReduce
Recent research activities Focus on Case study of an Akamai “cache” performance How to export/consolidate data continuously? What about BigData? Post-processing Repository “DBStream: an Online Aggregation, Filtering and Processing System for Network Traffic Monitoring” TRAC’14 Raw measurements Focus on Focus on How to process network traffic? How to scale at 10Gbps?
Monitoring an cache • Focusing on vantage point of ~20k ADSL customers • 1 week of HTTP logs (May 2012) • Content served by Akamai CDN • The ISP hosts an Akamai “preferred cache” (a specific /25 subnet) ? ? ?
Reasoning about the problem • Q1: Is this affecting specific FQDN accessed? • Q2: Are the variations due to “faulty” servers? • Q3: Was this triggered by CDN performance issues? • Etc… How to automate/simplify this reasoning? DBStream (FTW) • Continuous big data analytics • Flexible processing language • Full SQL processing capabilities • Processing in small batches • Storage for post-mortem analysis
Q1: Is this affecting a specific FQDN? NO!! • Select the top 500 Fully Qualified Domain Names (FQDN) served by Akamai • Check if they are served by the preferred /25 subnet • Repeat every 5 min FQDN not served by the preferred cache FQDN hosted by the preferred cache, except during the anomaly Other subnets Preferred /25 subnet • The two sets have “services” in common • Same results extending to more than 500 FQDN
Q2: Are the variations due to “faulty” servers? NO!! • Compute the traffic volume per IP address • Check the behavior during the disruption • Repeat each 5 min
Q3: Was this triggered by performance issues? NO!! • Compute the distribution of server query elaboration time • It is the time between the TCP ACK of the HTTP GET and the reception of the first byte of the reply • Focus on the traffic of the /25 preferred subnet • Compare the quartiles of the server elaboration time every 5 min passive probe client server SYN Performance decreases right before the anomaly @6pm SYN+ACK YES!! NO!! ACK YES!! GET ACK query processing time DATA
Reasoning about the problem NO • Q1: Is this affecting only specific services? • Q2: Are the variations due to “faulty” servers? • Q3: Was this triggered by CDN performance issues? • What else? • Other vantage points report the same problem? YES! • What about extending the time period? • The anomaly is present along the whole period we considered • On going extension of the analysis on more recent data sets (possibly exposing also other effects/anomalies) • Routing? TODO route views • DNS mapping? TODO RipeAtlas + ISP active probing infrastructure • Other suggestions are welcomed NO NO
Reasoning about the problem NO • Q1: Is this affecting only specific services? • Q2: Are the variations due to “faulty” servers? • Q3: Was this triggered by CDN performance issues? • What else? • Other vantage points report the same problem? YES! • What about extending the time period? • The anomaly is present along the whole period we considered • On going extension of the analysis on more recent data sets (possibly exposing also other effects/anomalies) • Routing? TODO route views • DNS mapping? TODO RipeAtlas + ISP active probing infrastructure • Other suggestions are welcomed NO …ok, but what are the final takeaways? • Try to automate your analysis • Think about what you measure and be creative especially for visualization • Enlarge your prospective • multiple vantage points • multiple data sources • analysis on large time windows • Don’t be afraid to ask opinions NO
?? || ## <alessandro.finamore@polito.it> TMA Traffic monitoring and Analysis