1 / 17

Unlocking Systems and Data: The Key to Network Management Innovation

Unlocking Systems and Data: The Key to Network Management Innovation. Charles Kalmanek Internet & Network Systems Research V.P. AT&T Labs-Research. 2006 IEEE/IFIP Network Operations and Management Symposium. Network-wide model auditing, “what-if,” etc. Offered Traffic, Routing, Fault.

neviah
Download Presentation

Unlocking Systems and Data: The Key to Network Management Innovation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Unlocking Systems and Data: The Key to Network Management Innovation Charles Kalmanek Internet & Network Systems Research V.P. AT&T Labs-Research 2006 IEEE/IFIP Network Operations and Management Symposium

  2. Network-wide model auditing, “what-if,” etc. Offered Traffic, Routing, Fault Topology, Configuration, Workflow Goal: A robust, global, multi-service IP/MPLS network Vision for IP Network Management Design goals, policies • Approach • Manage the entire network, not network elements • Instrument the network, rely on direct correlation of real data • Model interactions to predict the effects of actions in advance • Automate as much as possible, audit results Provisioning, Changes to the Network measure control Network CRK

  3. Why It’s Hard • Scale & Diversity Challenges • Large, distributed networks (100,000’s of NE’s) • Complex, diverse building blocks • Ongoing maintenance, spanning multiple time zones • Fragile IP network control planes • Complex software systems on top • Constant change • Architectural change, new features & services, new protocols… • Customers join, leave, change/upgrade service • Network “events” – failures, migrations, upgrades, etc. • Measurement and data challenges • Inadequate implementation of the basics • Data often locked up in NM systems “smokestacks” • Diverse data sources, with highly variable data quality • Limited direct measurements of causality • Inadequate ability to trace events across the network CRK

  4. P P P P P P P P C C C C C C C C P P P P P P P P P P P P C C C C C C C C C C C C PE PE PE PE PE PE PE PE PE PE E E E E E E E E E E Tier-1 Service Provider Network DWDM systems OC-48 or OC-192 DWDM PoP Intercity PoP Customer facing PE interfaces Rough stats: 100s of offices 100s of Ps, 1000s of PEs, 10000s of CEs 100,000s of transport facilities PoP Metro LEC PoP: Point-of-Presence P: Backbone (core) Router PE: Provider Edge Router CE: Customer Edge Router Access Network CE CPE (Enterprise customer networks rival ISP’s in size & complexity!) Customer Network CRK

  5. Unlocking Network Data • Measurement data is essential to running the network • Marketing and customer acquisition • Network and customer care • Network engineering and capacity management • Research to improve / evolve the network • If you don’t have the data, you can’t design, manage, secure, or improve the network • If you can’t evolve systems, you can’t evolve the network • Example 1: Fault/performance management • Example 2: Router Provisioning CRK

  6. Network Troubleshooting • Goals • Automate the entire life cycle of event detection and repair for every performance impacting event • Detect, Localize, Diagnose, Fix, Verify • Drive short and long term network, operations & systems improvements • Use forensics to reveal chronic events • Systems and Tools • Active and passive performance monitoring • Each data source has its unique value and limitations • Maintenance and troubleshooting require correlation across multiple data sets • Associations of customers to access circuits, router interfaces, network policies, network elements, monitoring systems, … CRK

  7. Example: Cross-Layer Troubleshooting • IP composite link: multiple SONET links combined together • Example: 5 OC192s • IP routing does not take bandwidth into account. • On component failure: how to decide between mechanisms to take traffic off the link, as function of remaining capacity? 3 units of traffic Logical IP link NY LA 3 units of traffic congestion NY LA 1 unit of capacity CRK

  8. Example: Cross-Layer Troubleshooting (cont.) • Detect: • Packet loss from active measurements for a set of PE pairs • Localize/Diagnose: • Temporal correlation: PE-PE measurement alerts occurring at the same time as flapping on several composite link members • Spatial correlation: paths where packet loss occurs contain flapping composite link components (PE-PE measurements mapped to paths via route monitoring) • Diagnose: • Congestion due to composite link component flapping • Fix: • Short term: “cost out” the link • Permanent: repair failing components • Verify: • Packet loss alerts disappear CRK

  9. PE PE PE PE Example: Chronic Control Plane Outage • Detect • Active performance monitoring shows high loss at a PE • Localize/Diagnose • Correlation of performance alerts, fault data, routing updates, configuration, and workflow logs reveals recurring pattern • OSPF sessions flap during customer provisioning on some PE platforms • Diagnosis: BGP starves OSPF processing on this class of PEs • Fix • Short-term: process changes to control provisioning on this class of PE • Long-term: better OSPF and BGP process scheduler for PE • Verify • High loss disappears at the PE CRK

  10. Data Distribution Problem • Many, diverse data feeds required • Labor-intensive and error-prone to create and maintain each feed • Ad-hoc development to convert, copy, encrypt, & ingest the data • Several groups with business critical functions need network data • Stringent delivery requirements (security, timeliness, reliability) • Customer data • Access: location, circuit ID, IP addresses, CE platform, LEC interface, layer 2 info (Frame Relay, Ethernet, DSL, Private Line,…), router info (hardware, software version) • Trouble tickets • Performance and SLA reports • Service orders • Network data • Network inventory • Route monitors, BGP tables • SNMP link utilization & faults • Syslog info (status, health, events) • Active path monitoring • Netflow • Other: workflow, VoIP, transport CRK

  11. Data Correlation Framework • Flexible data/systems architecture • Pluggable data-source specific collectors • Data distribution bus • Common real time and archival data store • Variety of network management applications on top • Evolving domain knowledge • It’s an iterative process: exploratory data mining (EDM) • Apply statistical tools, visualization, “hunches,” … • Export results to “case manager” for analysis • Diagnosis engines • Near real-time drill down, forensics • Temporal and spatial event clustering • Scalable statistical mechanisms to uncover correlations CRK

  12. Network Topology I/F Customer Portal Internal Portal Data/Systems Architecture GUI GUI GUI GUI GUI GUI GUI GUI Real-time Network Mgt Applications End-to-end Reporting Application • Data Distribution Bus • Publish/subscribe system handling all incoming data feeds • Supports multiple transport options, normalizes data to “standard” formats • Reliably delivers data to consumers • Data Store Component • Efficient long-term storage of operational data • Automatic generation of schema, loading scripts, access scripts, data aging allowing non-DBAs to manage warehouse Surveillance Application Planning Application Data Store Component (DSC) Data Distribution Bus (DDB) OA&M Active Probe Collector L3 Control Plane Collector SNMP Collector Netflow Collector Syslog Collector CDR Collector Network data is available to multiple applications allowing auditing, correlation, reporting, EDM, … CRK

  13. Router Provisioning • Goal: translate service intent to network reality • Get hardware & circuits to the right place at the right time • Access & update network inventory databases • Configure routers to establish and verify the service • Challenges • Huge diversity at network element layer (dependencies on hardware & software versions, physical configuration, vendor, etc.) • Low level configuration languages, no abstraction layer, multiple ways of achieving the same thing • Config generator must consider hardware limitations, service definition, customer order info, additional customer info, etc. • Commercial tools offer limited customizability, only solve pieces of the problem • Initial provisioning is only part of the life cycle problem (network-wide changes, firmware mgt, auditing, CE-PE coordination, change requests, …) CRK

  14. Auditing Provisioning Configuration File Analysis queries • Detect/Fix Discords • Non-compliance to architectural intent • e.g., errors in route-maps for VPNs crossing routing domains • Config time-bombs • e.g., gaps in the ACL perimeter defense • Additional Benefits • Assessment, Bootstrapping automation, Decision Support • Technology • Parsers, Algorithms, Rules and Queries encoding domain expertise : e.g., ACL analysis Customer/ network database Low level standard form (tables) Discords polled fix Router configuration CRK

  15. Automated CPE Router Provisioning • Technical Questionnaire • E.g., Web form • (Service Level) • Logic: allocations of ports, IP addresses, VRFs, … • Device/service specific templates, with embedded variables and callouts to computations and databases • E.g., callouts for ports, IP addresses, ACL clauses, … • Detailed Device Configuration commands – bundled as a “configlet” • (Network Element Level) CRK

  16. Template-driven Config Generation • Executing templates in a given context (stored in a database) produces configs, similar to code generation • Evolves easily to integrate new features, router models, access types, resiliency options • Eliminates errors, reduces holds • Ensures conformance to engineering guidelines Example: BGP configuration Context Substitution router bgp <BGP_1.CE_ASN> no synchronization bgp log-neighbor-changes network <WAN_IF_1.NETIP:computeIpMask_Netip(<WAN_IF_1.IF_IP>,255.255.255.252)> mask 255.255.255.252 network <WAN_IF_2.NETIP:computeIpMask_Netip(<WAN_IF_2.IF_IP>,255.255.255.252)> mask 255.255.255.252 network <ROUTER.LOOPBACKIP> mask 255.255.255.255 Functional Substitution CRK

  17. Conclusions • Unlocking data and fault/performance management systems enables innovation • Exploratory data mining and data correlation are essential to forensics and network maintenance automation • Approach: Flexible data distribution and data storage architecture • Unlocking provisioning systems enables innovation • Bottom-up analysis is a useful tool for discord-detection, etc. • Template driven approach allows network engineering to add new network features without new systems development • Challenges are legion… • How to overcome proprietary data models, systems thwarting forensics? • How to find efficiently find needles in (massive) data haystacks? • How to raise the level of provisioning abstraction? • How to reduce the systems drag on network feature and architecture change? CRK

More Related