1 / 11

Customer Debrief Shared Services Email Major Incidents January thru September

Customer Debrief Shared Services Email Major Incidents January thru September. October 23, 2013. Agenda. June Incident Root Cause Analysis Other Incidents SLA Reporting Follow Up Q & A. June Major Incident. Began Friday, June 7 beginning approximately 8:40 AM

hamlet
Download Presentation

Customer Debrief Shared Services Email Major Incidents January thru September

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Customer DebriefShared Services Email Major Incidents January thru September October 23, 2013

  2. Agenda June Incident Root Cause Analysis Other Incidents SLA Reporting Follow Up Q & A

  3. June Major Incident • Began Friday, June 7 beginning approximately 8:40 AM • Services impacted were Shared Services Email – typical reports were slow email, Outlook issues, email unavailable • Root cause was determined on June 11 through CTS and vendor analysis • Issue was mitigated Tuesday June 11 at approximately 8 PM

  4. Root Cause Analysis • A CTS customer with several thousand mailboxes was “NAT’ing” (Network Address Translation) their site behind a single IP address • The Load Balancer sent all mailboxes from this one IP address to the same Client Access Server (CAS) • In turn, the CAS makes calls to Active Directory for directory services. The CAS to Active Directory (AD) relationship is a persistent session (sticky).

  5. Root Cause Analysis Cont’d • The effect was an out of balance environment • Normal load on the CAS/AD relationship is 9,000 to 12,000 sessions (note: one mailbox can open multiple sessions) • In this case, the load spiked to as high as 20,000 sessions • This caused queuing/delay as the CAS requested services from AD. • Solution: change the IP nat’ing to span multiple IP addresses thereby allowing the Load Balance to balance the load 5

  6. Root Cause Analysis Cont’d • CTS notified the Customer Agency and began a plan to spread their Agency mailbox load across multiple IP’s • This Agency was fully cooperative and implemented a large change within a week’s time. • Agency change was complete on June 21 • CTS mitigation was removed • Normal operation returned Monday June 24 6

  7. Other Incidents • Monday March 25 • Email connectivity; rebooted CAS server • Monday April 22 • Slow email; no cause found • Tuesday May 28 (Monday was Memorial Day) • Email connectivity; no cause found 7

  8. Other Incidents • In each case the previous Sunday was “Patch Sunday” • On Patch Sunday the CAS servers are cleared out, patched, and rebooted. • Therefore, on Monday as Customers came back to work the Load Balancer began to redistribute the work load and the same NAT’ing issue occurred • In hind sight these were all the same issue 8

  9. Other Incidents 9

  10. Follow Up • Shared Services Email is a community environment and some exceptions have crept into the community • Cached vs. Online mode • Outlook best practices • Vault best practices • The accumulation of exceptions can have a detrimental effect on the community as a whole • CTS will review exceptions and work with Customers to realign with the community 10

  11. Questions and Answers

More Related