E N D
1. When Technology Falters:The CareGroup Network Outage John D. Halamka MD
CIO, CareGroup
CIO, Harvard Medical School
2. Agenda In depth overview of the Network Outage
Key Lessons
The Sequel – SQL Slammer
Questions and Answers
3. CareGroup Network as Built
4. Timeline November 13, 2002 1:45pm
Napster-like internal attack
Change begins, redundant links cut
Callisma and Cisco on site
November 14, 2002
Spanning tree issues
WAN issues
CAP declared at 4:00pm
5. Core Switch Utilization
6. Timeline November 15, 2002
PACS Rebuild
Research/Cardiology rebuild
Reboot of core and distribution layer
November 16, 2002
VLAN mismatch
Redundant Core built as contingency
7. Core Switch Utilization
8. Root Cause Analysis CareGroup Network grew organically by Merger and Acquisition into a massive bridged switched network which was not within Spanning Tree spec
Equipment was not life cycle managed
Router/switch configuration was not in accordance with best practices i.e. multicast dense mode
9. Spanning Tree Problems When TAC was first able to access and assess the network, we found the Layer 2 structure of the network to be unstable and out of specification with 802.1d standards. The management vlan (vlan 1) had in some locations 10 Layer 2 hops from root.
The conservative default values for the Spanning Tree Protocol (STP) impose a maximum network diameter of seven. This means that two distinct bridges in the network should not be more than seven hops away from one to the other.
10. Key Lessons Partner with your network vendor
Encourage external audits of your network
Engage advanced engineering services
Avoid senior management blind spots
11. Key Lessons Avoid flat topology bridged switched networks.
13. Key Lessons Re-evaluate the enterprise architecture of your network
Routed core
Switched distribution and access layers
Robust Firewall
14. Key Lessons Life Cycle Manage your network
Eliminate Legacy Protocols
Recognize the value of new feature sets
Hardware must keep up with the demands of a changing organization – video over IP, IP telephony, bioinformatics, image management
15. Key Lessons Implement appropriate monitoring and diagnostic tools to maintain the health and hygiene of your network
Concord
NATKit
CiscoWorks
OpenView
16. Key Lessons Have a robust downtime plan
Out of band diagnostics
Dial up modems and computers in key clinical areas
Overview of CareGroup Disaster Recovery plan
17. Service Objectives
18. Protection Features
19. Protection features
20. Protection Techniques Cost versus Benefit
21. Protection Techniques by Vulnerability
22. Key Lessons Implement Strict Change Control
Standards, configurations, devices, protocols, links, processes, procedures, or services
Prior review and approval of all network infrastructure changes
Multi-discipline membership
Changes classed as substantial, moderate, or minimal impact
23. Key Lessons Implement Strict Change Control (cont)
Substantial changes require Cisco AES review
Changes scheduled 2am – 5am weekends
Changes require baseline, testing, and recovery plans
As-Built documentation to include overall, physical and logical diagrams
NCCB recommends expense allocation
24. The Sequel – SQL Slammer Released at 12:30am on January 25
Infected East Coast at 12:40am
Microsoft SQLServer 2000 was patched, however Microsoft did not issue any patches or security warnings on Microsoft Data Engine 2000 (MSDE), which is included with numerous desktop products
25. Spread of the Worm
27. Exact effect on CareGroup MSDE and non-IS maintained databases infected
Network saturated by worm activity
Shut off links to Research areas
Blocked all traffic from the public internet
Network traffic levels returned to normal
28. Cleanup Restart of servers and desktops that were disrupted by the outage
Once all areas research areas had cleaned desktops, we restored port 1433 connectivity
29. Further Lessons learned VPN as a security risk
Implement a scanning program to analyze research desktop and server vulnerabilities
Ensure you have modern network equipment that afford you the tools to control intra-VLAN traffic
30. Conclusions Lifecycle manage your network just as you would your desktop
Ensure senior management understands the value of the network as a strategic asset
Build great downtime procedures including out of band connectivity just in case the technology falters