440 likes | 580 Views
Crowdsourcing Service-Level Network Event Detection. David Choffnes Fabián Bustamante Zihui Ge Northwestern University* Northwestern University AT&T Labs * currently at U. Washington. Internet driven by services.
E N D
Crowdsourcing Service-Level Network Event Detection David ChoffnesFabián Bustamante ZihuiGe Northwestern University* Northwestern University AT&T Labs *currently at U. Washington
Internet driven by services • Internet activity increasingly driven by services • VoIP, video streaming, games, file downloads • User experience as a key benchmark • Largely determined by frequency, duration and severity of network problems (events) • To minimize impact on users • Identify problems affecting end-to-end performance • Online, reliable detection and isolation -- potentially across networks Crowdsourcing Network Monitoring
Detecting network events Need a scalable solution that captures what user sees • Variety of existing detection approaches • Internal link/router monitoring • Limited to single administrative domain • BGP monitoring • Identifies only control-plane issues • Distributed active probing • Overhead/cost scales with size of monitored network • Limited or no visibility into end-to-end performance • Particularly problematic for edge networks Crowdsourcing Network Monitoring
Crowdsourcing Event Monitoring (CEM) If enough hosts see • Same performance problem, • at the same time, • in the same network…. The problem is likely to be the network • Push monitoring to the edge systems themselves • Monitor from inside network-intensive applications • Detect drops in performance Crowdsourcing Network Monitoring
Outline • … • Crowdsourcing network monitoring • General approach • Case study using confirmed network problems • Wide area evaluation • System implementation • BitTorrent extension (NEWS) installed by >48k users • Conclusion Crowdsourcing Network Monitoring
System Requirements • Scalability • Use passive monitoring, fully distribute detection • Localization in time and space • Online detection • Isolation to network regions • Privacy • Use only network location, not identity • Reliability from uncontrolled hosts • Probability analysis to identify likely real events • Adoption • Build inside popular applications and/or use incentives Crowdsourcing Network Monitoring
Approach and architecture Performance signals Local EventDetection [e.g., upload/ download rate] Distributed System Crowdsourcing Network Monitoring
Approach and architecture • Local detection • Passively monitor local performance information (signals) • General (e.g. transfer rates) and application specific (e.g. content availability in BitTorrent) • Detect drops in performance • E.g., dropped video frame, sudden drop in throughput • Filter out cases that are normal application behavior • E.g., BitTorrent peer finishes downloading but still seeds • Publish information only about these suspected local events Crowdsourcing Network Monitoring
System Architecture Performance signals LocalEvents Distributed Storage Local EventDetection Distributed System Crowdsourcing Network Monitoring
Approach and architecture • Group corroboration • Gather information about local events in same network • Identify synchronous problems that are unlikely to occur by chance • Likelihood ratio to distinguish networkevents from coincidence • Who can identify network events? • Each user can detect events separately • Any third party with access to distributed storage can do the same (e.g., network operators) Crowdsourcing Network Monitoring
System Architecture Performance signals LocalEvents Distributed Storage Local EventDetection RemoteEvents GroupCorroboration Confirmed Tap on DS Distributed System ISP Operator Crowdsourcing Network Monitoring
Evaluating the approach • Participatory monitoring challenges • Needs large-scale adoption, active users, dist. storage • Edge traces are rare • P2P applications are a natural fit • Used worldwide, generates diverse flows • BitTorrent is one of the most popular • Consumes large amounts of bandwidth • Vuze client allows extensibility, piggyback on existing users • Built-in distributed storage (DHT) • Ono dataset for traces • Installed by more than 1 million BitTorrent users • Network and BitTorrent-specific information from hundreds of thousands of users worldwide Crowdsourcing Network Monitoring
CEM Case Study “Enough users complained about the network being slow and we’re looking into it.” “As of 9PM, we’re pretty sure we fixed the problem so we marked it resolved.” Crowdsourcing Network Monitoring • Evaluate effectiveness of our approach using BitTorrent • How (well) does it work? • Case study: British Telecom (BT) network • Provides confirmed events through a Web interface • 27 April 2009, 3:54 PM • “We are aware of a network problem which may be affecting access to the internet in certain areas. Our engineers are working to resolve the problem. We apologize for any inconvenience this may cause.” • Resolved: 27 April 2009, 8:50 PM • Similar to events seen in other networks
Local detection in BitTorrent Crowdsourcing Network Monitoring • Peers monitor multiple performance signals • General – e.g. transfer rates, number of connected peers • Protocol specific – Torrent availability • Detect drops in throughput as local events • Individual signals • Noisy • Uncontrolled duration • Wide range of values
Moving-average smoothing reveals events Performance drops around 10:54 Final recovery at ~17:30 Further drop at 14:50 Crowdsourcing Network Monitoring
Group corroboration • Given locally detected events, why would they occur at the same time? • Service-specific problems (e.g., no seeder) • Use application level information • Coincidence (e.g., noisy local detection) • Union probability • Problem isolated to one or more networks • Group hosts according to network location Crowdsourcing Network Monitoring
Coincidence in network events • Coincidence assumes that local events are occurring independently • Calculate this using union probability (Pu) • P(Lh): probability for host h seeing a local event • For large n, likelihood of coincidence is very small Crowdsourcing Network Monitoring
Likelihood ratio • Are detected network problems occurring more often than chance? • Comparing probabilities • Coincidence (Pu) • Network (Pe) • Measure how often synchronized events occur in same network • Likelihood ratio: LR = Pe/Pu • LR > 1: Events more likely due to the network than chance • Empirically derive a stricter LR threshold • Use LR as a tuning knob to control rate of event detection Crowdsourcing Network Monitoring
Likelihood ratios in BT Yahoo Congestion event after recovery W=10, σ=1.5 Most events no more likely than chance All LR>1 correspond to actual network events! W=20, σ=2.0 Crowdsourcing Network Monitoring
Wide-area evaluation • Gold standard: False positives/negatives… • Almost no ISPs want to publish recordsof network problems • Existing approaches do not target service-level events • In short, there is no “ground truth” • Affects all research in this domain • What we can do • Find ISPs reporting network events via public interfaces • Work with ISPs under NDAs • Compare our approach with ISP information • Only works where we have coverage (users) Crowdsourcing Network Monitoring
Evaluation criteria • Coverage • Confirmed events • Number of networks covered worldwide • Cross-network events • Efficiency • Event detection rate • Overhead Crowdsourcing Network Monitoring
Effectiveness – BT Yahoo • One month from BT Yahoo • Detected: 181 events • 54 occur during confirmed problems • Remaining are not necessarily false positives • Even if so, about 4 events per day Crowdsourcing Network Monitoring
Effectiveness – North American ISP • One month of outage data • Varies according to number of subscribers (S) in each region • S > 10000 • We detect 50% (38% more may be detected but we don’t have enough users to confirm) • 10000 > S > 1000 • 67% may be detected but not sufficient corroboration Crowdsourcing Network Monitoring
Robustness to parameters • Sensitive local detection (MA settings: 1.5σ, w=10) • Less sensitive detection (MA settings: 2.2σ, w=20) • Robust to various detection settings, populations • Number of users in a network is not strongly correlated with number of events detected • Network problems detected only 2% of the time for small MA deviations, 0.75% of the time for large ones • Can be filtered with likelihood ratio threshold Crowdsourcing Network Monitoring
Summary • Service-level monitoring through crowdsourcing • Push monitoring to applications at the edge • Scalable, distributed approach to detection • Evaluation using large-scale P2P trace data Crowdsourcing Network Monitoring
NEWS implementation and deployment • Plugin for the VuzeBitTorrent client • More than 48,000 installs • Core classes for event detection only ~1,000 LOC • Lots more code for UI, user notifications • Event detection • Local detection based on 15-second samples of performance information • Transfer rates • Torrent state (leech/seed) • Content availability • Group corroboration and localization • Publishes event information to built-in DHT • Uses BGP prefix, ASN information for group corroboration (already collected by Vuze) Crowdsourcing Network Monitoring
Food for thought Questions? • Open issues • Which network groupings are best? • Whois listings, topologies, ISP-specific… • Where is the ground truth? • Crowdsourcing event labeling (Newsight) • Can we apply these principles to other services? • VoIP, video streaming, CDNs Crowdsourcing Network Monitoring
Questions? Crowdsourcing Network Monitoring
Backups Crowdsourcing Network Monitoring
Related work • Crowdsourcing • Human computation [von Ahn] • Intrusion detection [Dash et al.] • Events detected • Layer-3 and below [Lakhina et al., Mahajan et al.] • End-to-end [Madhyastha et al., Zhang et al.] • Monitoring location • In network (Netflow) • Distributed probing [Feamster et al., Katz-Bassett et al.] • Edge systems [Shavitt et al., Simpson et al.] • Measurement technique • Active [Andersen et al.] • Passive [Zhang et al., Casado et al., …] Crowdsourcing Network Monitoring
Snapshot of what we collect Crowdsourcing Network Monitoring • Started (proper) collection in December 2007 • Daily stats (approximate) • 3 to 4 GB of compressed data • About 10 to 20 GB raw data • 2.5-3M traceroutes • 100-150M connection samples
Wide area events • Detected problems in the US, Europe and Asia • Identified potential cross network events • Use ISP relationships and correlate per-ISP events • Detected cases in seven countries Crowdsourcing Network Monitoring
Robustness to parameters Ordered by # users Ordered by # events • Robust to various detection settings, populations • Number of users in a network is not strongly correlated with number of events detected Crowdsourcing Network Monitoring
How much does it cost? • Simulate using 3 stddev thresholds and 2 window sizes in parallel • Allows NEWS to detect multiple types of events • Model caching in DHT • Count number of DHT operations at any time • Goals • Low cost (does not affect user’s transfers) • Privacy preserving • No reliance on infrastructure Crowdsourcing Network Monitoring
Events at each time step • Read and writes are clustered • Expected for a system that detects events • Diurnal pattern Crowdsourcing Network Monitoring
How much does it cumulatively cost? • One read every 10 seconds, one write every two minutes (spread over hundreds of users) • Reasonable load on a DHT • Kademlia caches values close to host • 38 bytes per read/write, about 4 B/s overhead Crowdsourcing Network Monitoring
Strawman cost • Decentralized approach not using summaries • 13 signals x 4 bytes = 52 bytes, every 15 seconds • Sharing incurs O(N2) cost • 1000 hosts: 34.6MB/s • Centralized approach to collecting data • 13 signals x 4 bytes = 52 bytes, every 15 seconds • 1000 hosts: 4 KB/s • Plus CPU/memory costs for processing this data • Ignores important issues • Who hosts this? • Privacy? • Cross-network events? Crowdsourcing Network Monitoring
NEWS UI Notifications through non-intrusive interface List of events for historical information Crowdsourcing Network Monitoring
NEWS Usage Crowdsourcing Network Monitoring
The view from 1 million users 547,000 231,000 35,000 1,096 Crowdsourcing Network Monitoring
Extension to multiple signals • Leverage independent signals at each host • For example, upload rate and download rate • Even more unlikely that both signals affected at same time by coincidence Crowdsourcing Network Monitoring
Detecting throughput drops • Moving average • Low cost • Few parameters • Well understood • Key parameters • Window size, deviation • Approach • Find mean for each window • Find how much next sample deviates from mean • Key questions • Can we find good window sizes, threshold deviations? Crowdsourcing Network Monitoring
Relative likelihood of coincidence Five orders of magnitude between 3 peers and 9 peers corroborating an event The more peers seeing an event at the same time, the less likely it occurs by coincidence Simulate different numbers of hosts seeing events independently with a normally dist. probability Crowdsourcing Network Monitoring
Newsight • Ground truth hard to come by • Can we crowdsource event labeling? • Make information available to community http://aqualab.cs.northwestern.edu/projects/news/newsight.html Crowdsourcing Network Monitoring