170 likes | 271 Views
Unlocking Systems and Data: The Key to Network Management Innovation. Charles Kalmanek Internet & Network Systems Research V.P. AT&T Labs-Research. 2006 IEEE/IFIP Network Operations and Management Symposium. Network-wide model auditing, “what-if,” etc. Offered Traffic, Routing, Fault.
E N D
Unlocking Systems and Data: The Key to Network Management Innovation Charles Kalmanek Internet & Network Systems Research V.P. AT&T Labs-Research 2006 IEEE/IFIP Network Operations and Management Symposium
Network-wide model auditing, “what-if,” etc. Offered Traffic, Routing, Fault Topology, Configuration, Workflow Goal: A robust, global, multi-service IP/MPLS network Vision for IP Network Management Design goals, policies • Approach • Manage the entire network, not network elements • Instrument the network, rely on direct correlation of real data • Model interactions to predict the effects of actions in advance • Automate as much as possible, audit results Provisioning, Changes to the Network measure control Network CRK
Why It’s Hard • Scale & Diversity Challenges • Large, distributed networks (100,000’s of NE’s) • Complex, diverse building blocks • Ongoing maintenance, spanning multiple time zones • Fragile IP network control planes • Complex software systems on top • Constant change • Architectural change, new features & services, new protocols… • Customers join, leave, change/upgrade service • Network “events” – failures, migrations, upgrades, etc. • Measurement and data challenges • Inadequate implementation of the basics • Data often locked up in NM systems “smokestacks” • Diverse data sources, with highly variable data quality • Limited direct measurements of causality • Inadequate ability to trace events across the network CRK
P P P P P P P P C C C C C C C C P P P P P P P P P P P P C C C C C C C C C C C C PE PE PE PE PE PE PE PE PE PE E E E E E E E E E E Tier-1 Service Provider Network DWDM systems OC-48 or OC-192 DWDM PoP Intercity PoP Customer facing PE interfaces Rough stats: 100s of offices 100s of Ps, 1000s of PEs, 10000s of CEs 100,000s of transport facilities PoP Metro LEC PoP: Point-of-Presence P: Backbone (core) Router PE: Provider Edge Router CE: Customer Edge Router Access Network CE CPE (Enterprise customer networks rival ISP’s in size & complexity!) Customer Network CRK
Unlocking Network Data • Measurement data is essential to running the network • Marketing and customer acquisition • Network and customer care • Network engineering and capacity management • Research to improve / evolve the network • If you don’t have the data, you can’t design, manage, secure, or improve the network • If you can’t evolve systems, you can’t evolve the network • Example 1: Fault/performance management • Example 2: Router Provisioning CRK
Network Troubleshooting • Goals • Automate the entire life cycle of event detection and repair for every performance impacting event • Detect, Localize, Diagnose, Fix, Verify • Drive short and long term network, operations & systems improvements • Use forensics to reveal chronic events • Systems and Tools • Active and passive performance monitoring • Each data source has its unique value and limitations • Maintenance and troubleshooting require correlation across multiple data sets • Associations of customers to access circuits, router interfaces, network policies, network elements, monitoring systems, … CRK
Example: Cross-Layer Troubleshooting • IP composite link: multiple SONET links combined together • Example: 5 OC192s • IP routing does not take bandwidth into account. • On component failure: how to decide between mechanisms to take traffic off the link, as function of remaining capacity? 3 units of traffic Logical IP link NY LA 3 units of traffic congestion NY LA 1 unit of capacity CRK
Example: Cross-Layer Troubleshooting (cont.) • Detect: • Packet loss from active measurements for a set of PE pairs • Localize/Diagnose: • Temporal correlation: PE-PE measurement alerts occurring at the same time as flapping on several composite link members • Spatial correlation: paths where packet loss occurs contain flapping composite link components (PE-PE measurements mapped to paths via route monitoring) • Diagnose: • Congestion due to composite link component flapping • Fix: • Short term: “cost out” the link • Permanent: repair failing components • Verify: • Packet loss alerts disappear CRK
PE PE PE PE Example: Chronic Control Plane Outage • Detect • Active performance monitoring shows high loss at a PE • Localize/Diagnose • Correlation of performance alerts, fault data, routing updates, configuration, and workflow logs reveals recurring pattern • OSPF sessions flap during customer provisioning on some PE platforms • Diagnosis: BGP starves OSPF processing on this class of PEs • Fix • Short-term: process changes to control provisioning on this class of PE • Long-term: better OSPF and BGP process scheduler for PE • Verify • High loss disappears at the PE CRK
Data Distribution Problem • Many, diverse data feeds required • Labor-intensive and error-prone to create and maintain each feed • Ad-hoc development to convert, copy, encrypt, & ingest the data • Several groups with business critical functions need network data • Stringent delivery requirements (security, timeliness, reliability) • Customer data • Access: location, circuit ID, IP addresses, CE platform, LEC interface, layer 2 info (Frame Relay, Ethernet, DSL, Private Line,…), router info (hardware, software version) • Trouble tickets • Performance and SLA reports • Service orders • Network data • Network inventory • Route monitors, BGP tables • SNMP link utilization & faults • Syslog info (status, health, events) • Active path monitoring • Netflow • Other: workflow, VoIP, transport CRK
Data Correlation Framework • Flexible data/systems architecture • Pluggable data-source specific collectors • Data distribution bus • Common real time and archival data store • Variety of network management applications on top • Evolving domain knowledge • It’s an iterative process: exploratory data mining (EDM) • Apply statistical tools, visualization, “hunches,” … • Export results to “case manager” for analysis • Diagnosis engines • Near real-time drill down, forensics • Temporal and spatial event clustering • Scalable statistical mechanisms to uncover correlations CRK
Network Topology I/F Customer Portal Internal Portal Data/Systems Architecture GUI GUI GUI GUI GUI GUI GUI GUI Real-time Network Mgt Applications End-to-end Reporting Application • Data Distribution Bus • Publish/subscribe system handling all incoming data feeds • Supports multiple transport options, normalizes data to “standard” formats • Reliably delivers data to consumers • Data Store Component • Efficient long-term storage of operational data • Automatic generation of schema, loading scripts, access scripts, data aging allowing non-DBAs to manage warehouse Surveillance Application Planning Application Data Store Component (DSC) Data Distribution Bus (DDB) OA&M Active Probe Collector L3 Control Plane Collector SNMP Collector Netflow Collector Syslog Collector CDR Collector Network data is available to multiple applications allowing auditing, correlation, reporting, EDM, … CRK
Router Provisioning • Goal: translate service intent to network reality • Get hardware & circuits to the right place at the right time • Access & update network inventory databases • Configure routers to establish and verify the service • Challenges • Huge diversity at network element layer (dependencies on hardware & software versions, physical configuration, vendor, etc.) • Low level configuration languages, no abstraction layer, multiple ways of achieving the same thing • Config generator must consider hardware limitations, service definition, customer order info, additional customer info, etc. • Commercial tools offer limited customizability, only solve pieces of the problem • Initial provisioning is only part of the life cycle problem (network-wide changes, firmware mgt, auditing, CE-PE coordination, change requests, …) CRK
Auditing Provisioning Configuration File Analysis queries • Detect/Fix Discords • Non-compliance to architectural intent • e.g., errors in route-maps for VPNs crossing routing domains • Config time-bombs • e.g., gaps in the ACL perimeter defense • Additional Benefits • Assessment, Bootstrapping automation, Decision Support • Technology • Parsers, Algorithms, Rules and Queries encoding domain expertise : e.g., ACL analysis Customer/ network database Low level standard form (tables) Discords polled fix Router configuration CRK
Automated CPE Router Provisioning • Technical Questionnaire • E.g., Web form • (Service Level) • Logic: allocations of ports, IP addresses, VRFs, … • Device/service specific templates, with embedded variables and callouts to computations and databases • E.g., callouts for ports, IP addresses, ACL clauses, … • Detailed Device Configuration commands – bundled as a “configlet” • (Network Element Level) CRK
Template-driven Config Generation • Executing templates in a given context (stored in a database) produces configs, similar to code generation • Evolves easily to integrate new features, router models, access types, resiliency options • Eliminates errors, reduces holds • Ensures conformance to engineering guidelines Example: BGP configuration Context Substitution router bgp <BGP_1.CE_ASN> no synchronization bgp log-neighbor-changes network <WAN_IF_1.NETIP:computeIpMask_Netip(<WAN_IF_1.IF_IP>,255.255.255.252)> mask 255.255.255.252 network <WAN_IF_2.NETIP:computeIpMask_Netip(<WAN_IF_2.IF_IP>,255.255.255.252)> mask 255.255.255.252 network <ROUTER.LOOPBACKIP> mask 255.255.255.255 Functional Substitution CRK
Conclusions • Unlocking data and fault/performance management systems enables innovation • Exploratory data mining and data correlation are essential to forensics and network maintenance automation • Approach: Flexible data distribution and data storage architecture • Unlocking provisioning systems enables innovation • Bottom-up analysis is a useful tool for discord-detection, etc. • Template driven approach allows network engineering to add new network features without new systems development • Challenges are legion… • How to overcome proprietary data models, systems thwarting forensics? • How to find efficiently find needles in (massive) data haystacks? • How to raise the level of provisioning abstraction? • How to reduce the systems drag on network feature and architecture change? CRK