390 likes | 880 Views
Cerner Millennium System Stability Getting to 99.99% Uptime. Eric Ried Eric.Ried@Aurora.Org Aurora Health Care, Inc. Milwaukee, Wisconsin. Steve Sonderman Steve.Sonderman@Aurora.Org Aurora Health Care, Inc. Milwaukee, Wisconsin. Aurora Health Care as an Organization.
E N D
Cerner Millennium System Stability Getting to 99.99% Uptime Eric Ried Eric.Ried@Aurora.Org Aurora Health Care, Inc. Milwaukee, Wisconsin Steve SondermanSteve.Sonderman@Aurora.OrgAurora Health Care, Inc.Milwaukee, Wisconsin
Aurora Health Care as an Organization • Largest private employer in the state of Wisconsin • 25,000 employees • 660 employed physicians • 3400 physicians on staff • Comprised of: • 13 hospitals • Over 100 clinics • Over 120 retail pharmacies • Homecare • Hospice • Other outpatient treatment centers
Cerner Millennium Installations Across Aurora • Facilities live on Millennium: • All 13 hospitals • 85 Clinics • Approximately 3.5 million patients in the record system.
Millennium Infrastructure at Aurora(Production Environment) • Hardware • 4 HP GS1280 Alpha Servers • Two 32way • Two 24way • Redundant configurations on application nodes. • 270 Citrix Servers • 8 Chart Servers • 3 RRD Servers (12 port total capacity) • 7 Multum Servers • Other servers include Document Imaging and BMDI
Millennium Infrastructure at Aurora(Production Environment) • Database • Oracle 9i • Current size of 6.0TB • Applications • 6500 + concurrent users during peak usage times
Stability History • Growth of Organization and Deployment schedule. • Aurora was growing as an organization. • Strong focus on rolling out the Electronic Health Record (Cerner Millennium) to integrate facilities. • To accommodate, numerous application and system changes were frequently made. • Frequent Service Packages taken for new functionality. • Operating in a “break/fix” mode. • Frequent system outages were occurring.
Stability History • July 2006 • Freeze placed on all non-essential Production Changes. • Change Control Process Redesigned with creation of the Change Control Board (CCB) to oversee and approve all changes. • All deployment of new functionality or existing functionality to new facilities was put on hold. • In depth analysis of current stability issues began. • Set a short term goal of 99.75% uptime.
Dedicated Production Monitoring • 7 AM – 6 PM Monitoring Monday – Friday • Production monitored by an Engineer/Administrator at all times throughout the business day. • Primary utilities used in Monitoring • SYSMON • MON PROC/TOPC • SPSMON • WATCH_QUOTA • BMC Patrol • Softek Panther (Panther Sensors)
Dedicated Production Monitoring • SYSMON: • Message queuing • Sharp drop in • connections • Terminating servers • High connection count • MON PROC/TOPC: • All four nodes • Hung processes • High CPU usage (poor script performance)
Dedicated Production Monitoring • SPSMON: • Long running processes • Track down script in CCL • Correlate findings with Oracle statistics • Identify user trends
Dedicated Production Monitoring • Watch_Quota: • Monitored periodically throughout the business day • Quotas raised if trend is seen over time. • Reduces potential for memory resource issues.
Dedicated Production Monitoring • Graphs • - CPU Usage • - Service Manager BG Device Count • Total BG Device Count • Monitor for sharp drops or increases in device count on one or all nodes. • Monitor for spikes in CPU usage.
Softek Panther • Panther Sensors: • Shared Service Queue Backlog • Service Not Accepting Connections- Server Thrashing • Absence of Server (server deficit)
Redesigning the Change Control Process • Change Control Prior to the Production Freeze. • Planned changes were reviewed at a weekly meeting, however they were implemented without going through a formal approval process. • Changes were often made at the discretion of the analysts and engineers. • Change windows were reserved for major system and application changes. • Server cycling was performed ad hoc, upon request.
Redesigning the Change Control Process • Transitioning from Change “Review” to true Change Control. • Change Control Board (CCB) was formed. • All application and technical changes examined for business need, priority, system impact, and risk, then categorized into change types. • Exempt • Pre-approved • Management (CCB) approval. • Distinct change windows created to minimize impact to the end user.
Redesigning the Change Control Process • Change Control Today • Standardized templates are used to gather information on the requested change to assist in identifying business need, priority, impact, and risk. • CCB meets daily Monday – Friday to review all requested (non-exempt and management approval) changes. • No changes made until approved and scheduled into a designated change window. • Changes that require server cycling are limited to certain windows.
Knowledge Sharing with Cerner • In August 2006, began working closely with Cerner on major system and stability issues. • Series of onsite meetings were held. • Began tracking all production issues and events which had a negative impact to the end user. • In cooperation with Cerner, identified and resolved major stability issues.
Service Outage Analysis • Tracking events • Hung processes, poor performing scripts, Global Service Manager issues, Chart Server and RRD problems, Multum issues, Network interruptions, etc… • Any other event that had a negative impact on system performance or on the end user. • Events tracked on an Excel spreadsheet.
Service Outage Analysis • Data Tracked:
Service Outage Analysis • A closer look at Impact: • NI (No Impact) • Events that do not directly impact system performance or end user experience (i.e., low quota on a middleware server, proactive pages for low disk space warning, etc…). • IF (Incomplete Functionality) • A single function or component is not available or working properly (i.e., RRD or BMDI is unavailable). • PD (Performance Degradation): • An event that causes a performance issue or impacts the end user but all functions in the system are available (i.e., message queuing, server hangs using 100% of a CPU). • Outage • Millennium is unavailable or in a degraded state such that users are unable to function. At this point, the environment can be taken for diagnostics that require users to be off the system and even shutdown to resolve the issue.
Service Outage Analysis • Events imported into an Access 2007 Database. • Weekly meetings held to review prior week’s events. • Like events linked to a common issue. • Issues examined for common trends and patterns (script, user, quota, behavior, etc…) • Issues assigned “To Do’s” for staff follow up. • Quick escalation to Cerner once issues are found.
Service Outage Analysis • SOA Database (Review of Daily Events): • Daily events imported into database. • Review event and link to a repeating issues (if applicable). • Update status.
Service Outage Analysis • SOA Database (Issue Maintenance): • Daily events that repeat and/or need follow tasks become an Issue. • Issues are tracked until a resolution is identified. • Once the resolution is implemented and the issue is verified to be resolved, it’s closed.
Service Outage Analysis • SOA Database (To Do’s): • Follow up tasks are logged in the database and linked to the issue. • To Do reports are generated and sent to the necessary staff.
Service Outage Analysis • Middleware Tuning (Hardening) • In depth look into server configurations. • Paging File Limits • Number of instances • Kill times • Request class routing • Routed poor performing or unstable scripts to dedicated servers (i.e., scripts that frequently hang or terminate servers). • Less impact on users, by isolating the above types of scripts to dedicated servers. • Installed Service Package to correct issue with Clinical Event Server hanging and Code Cache Synchronization package. • Became proactive rather than reactive.
Service Outage Analysis • Other Tuning (Hardening) • Chart Servers were consistently having memory resource issues, causing distributions to error or hang in-process. • Implemented an auto-reboot schedule for the Chart Servers. • Multum interaction/reaction requests were slow and would often hang. • Increased the number of Multum Servers and installed a corrective component fix. • User processes caused issues with Middleware processes. • Through trend analysis, identified specific tasks users were performing that caused inefficient calls to the database, servers to hang, etc... • Worked with application teams and the end user to modify their process and even preferences to prevent the issues from occurring.
Post-Production Freeze • Since the Production Freeze which was lifted at the end of August 2006, we’ve accomplished the following: • Code upgrade from 2005.02.24 to 2005.02.53 • Achieved 99.99% uptime. • Provided a stable Millennium system for our end users. • Secured our ability to continue with scheduled deployments and implementing new functionality, thus meeting our strategic goals.
Current State • No unplanned Millennium downtime since Mid-August 2006. • How we got here: • Dedicated Production Monitoring • Redesigning Change Control • Knowledge Sharing with Cerner • Service Outage Analysis
Current State • No unplanned Millennium downtime since Mid-August 2006. • How we got here (continued): • Focus shifted from strong deployment schedule to stability. • Maintaining a stable code base (commitment to the same code base for one year with minimal exception packages).
Wrapping up… • Questions?
Wrapping up… • Contact Information: Eric Ried Supervisor, Cerner Technical Support Aurora Health Care 414.647.3068 Eric.Ried@Aurora.Org Steve Sonderman Software Systems Administrator Aurora Health Care 414.647.6422 Steve.Sonderman@Aurora.Org