120 likes | 222 Views
CSE 190: Internet E-Commerce. Lecture 14: Operations. Operations. Everything it takes to keep a web site up and running, 24x7 Deployment Process Monitoring (SNMP) Build system Link rot Maintenance window Load testing Browser compliance Log rotation Database backups Disk failure
E N D
CSE 190: Internet E-Commerce Lecture 14: Operations
Operations • Everything it takes to keep a web site up and running, 24x7 • Deployment Process • Monitoring (SNMP) • Build system • Link rot • Maintenance window • Load testing • Browser compliance • Log rotation • Database backups • Disk failure • Router failure • Robots • Staffing • Data centers • Expense of running a high availability site is comparable to running a physical store front
Deployment Process • Proceeds in three phases • Development • Within corporation, not accessible outside • Stage • Within internet environment • UAT run here • Only operations staff may access • Live • Accessible to outside world
Monitoring • SNMP (Simple Network Management Protocol) • Used to monitor both hardware, software • Provides: Counters, Values, Triggers, Statistics • Remote control of services • Information stored in MIB (Management Information Base) • RMON sometimes used as alternative to SNMPv2 • Software • HP OpenView
Maintenance Window • Installation • Standard: J2EE standard web service descriptor (XML file with tarball of files) • InstallShield • Custom installation scripts • Upgrades • Defined time on Friday or weekend to upgrade site, posted on web site • Process: • Front page linked to ‘Site down’ • Load balancer redirected if appropriate • Application stops accepting new clients • (Pause) Application terminates all active sessions • Application upgraded • Sanity checks performed • Servers rebooted • Load balancer restored
Link Rot • Link rot: the continual process by which links become invalid over time • Tracked with custom tools • Best practice: Pages have permanent URLs • Referral field: • Tracking this in logs shows who’s linking to what URL on your site
Load Testing • Network load (60% bandwidth max) • Average page size (~20-30k) • CPU load: Occurs at least three levels • HTTP level • Application level • DB query level • Metrics: maximum number of simultaneous users, latency vs. users • Memory usage (256 M – 1 G per machine) • Disk I/O load • 1 Gb per machine typical • Tools • Mercury Interactive: WinRunner • Segue: SilkTest • Rational: SiteLoad • Microsoft: WCAT
Browser Compatibility • Cost of testing proportional to the number of platforms you’re compatible with • The same product isn’t the same on different operating systems • E.g. IE4.5 isn’t the same on Mac vs. Windows • Incompatible DOMs between MS, Netscape, Mozilla • Browser archive • http://browsers.evolt.org/
Robots • Robots: Automatically traverse web pages to retrieve documents, link structure, data • Used for: • Indexing • HTML validation • Link validation • Mirroring • Problems: • Too much rapid access from single IP • May be indexing dynamic, obsolete data • Robot exclusion file:# /robots.txt file for mysite.com User-agent: webcrawlerDisallow: User-agent: lycra Disallow: / User-agent: * • Disallow: /jspDisallow: /logs
Integration Useful Life Obsolete & test Burn in Useful Life Wear out Hardware Failure Rate Software Failure Rate Failure Models • Mean Time To Failure (MTTF) = average amount of time the system is up • Mean Time between Failures (MTBF) = average amount of time between failures • Mean Time To Repair (MTTR) = average amount of time the system is down after it fails - active repair time (diagnostics and repair) • Mean Down Time (MDT) - average amount of time system is down after it fails - active repair time + preventive maintenance + logistics time (time spent waiting for personnel, etc) • Intrinsic availability: Mean Time To Failure (MTTF) Mean Time To Failure (MTTF) + MTTR • Operational availability: Mean Time Between Failure (MTBF) Mean Time Between Failure (MTBF) + MDT
When things go wrong • Network operations • Software recovers from common failures • Network staff paged by email if server not available (via SNMP) • Usually rotating assignment • Application developers may be called in if restarting servers, etc. fails completely. Only if it doesn’t look like a network problem.
Data Centers • Data centers: Host your machines in their own premises • Also called “colocation” • Features • Security: controlled entrance, exit • Weather: maintained temperature, humidity • Power: Backup power, available circuits • Bandwidth: OC-192 connections • Monitoring: 24/7 staff, may reboot misbehaving machines • Machines typically arranged in “cages”; 1u, 2u machines • Server blades • Examples • NTT / Verio • Exodus / Global Crossing