320 likes | 382 Views
Life of a Cell. Woes and Wins. The Conundrum. Distribute -- on-line -- millions of pages of aircraft maintenance documentation in a system that the FAA requires to be foolproof: No downtime All data identical for every mechanic worldwide. “Always”. Business Risks.
E N D
Life of a Cell Woes and Wins
The Conundrum Distribute -- on-line -- millions of pages of aircraft maintenance documentation in a system that the FAA requires to be foolproof: • No downtime • All data identical for every mechanic worldwide. “Always” Dexter "Kim" Kimball (dhk@ccre.com)
Business Risks An airplane cannot leave the gate if maintenance documentation is unavailable. An airplane stuck at the gate causes the airline to lose lots of money (system wide) Hasn’t been done before Dexter "Kim" Kimball (dhk@ccre.com)
Business Drivers Faster access to documentation translates to millions of dollars a year in recovered revenue • No such thing as “I did that yesterday I’ll just wing it” – documents change daily • New document is printed and carried aboard the aircraft (or you’re busted) • Search times and print times must be low Dexter "Kim" Kimball (dhk@ccre.com)
Business Drivers Consistency of documentation eliminates “flip flop” maintenance costs • I use procedure A and perform X • Downline – old documents ... “Hey, who did that? But uh oh I can fix it.” Procedure B • Downline – new documents, Procedure A .... Dexter "Kim" Kimball (dhk@ccre.com)
Business Drivers • Safety • An incident involving a fatality drops ticket sales by 50% for two weeks. • If the incident cannot be explained ticket sales remain off until it is • US Airways 737 (1994?), Pittsburgh, almost put airline out of business • Airline people really do care about the people they’re responsible for Dexter "Kim" Kimball (dhk@ccre.com)
The Plan Be the first airline to gain competitive advantage by going to 100% online documentation Retire microfilm/microfiche completely Don’t lose shirt Dexter "Kim" Kimball (dhk@ccre.com)
The Technologies • Excalibur Technologies “EFS” (Electronic File System) • Transarc AFS 3.3 • HP Servers • Bunch’o’stuff to convert manuals to TIF • Windows 3.1 target user platform Dexter "Kim" Kimball (dhk@ccre.com)
The Process Scan microfiche/film manual pages to TIF • EFS: OCR TIFs • AFS: Store TIF pages • EFS: Index TIFs (OCR output), keyword indexes • AFS: Store index • AFS: Replicate to strategically placed fileservers • Mechanics and engineers: • Click on index icon (File cabinet) • Keyword search • EFS client on Windows 3.1 desktop requests data from EFS server running on AFS fileserver Dexter "Kim" Kimball (dhk@ccre.com)
World wide airline, world wide cell • Fileserver locations decided by • Location on corporate backbone • Connectivity from other linestations (smaller airports) • Number of linestations that can be served from location • Paranoia (over designed by 2x) Dexter "Kim" Kimball (dhk@ccre.com)
Domestic Fileserver Locations Dexter "Kim" Kimball (dhk@ccre.com)
End User Workstations • Every hangar -- many per “dock” • Every gate – 2x, independent LANs • Every engineering department • Facilities for support of in-air aircraft (World wide) Dexter "Kim" Kimball (dhk@ccre.com)
AFS Client Locations • Minimal • No supported Windows 3.1 AFS client • EFS client requests data from AFS client Dexter "Kim" Kimball (dhk@ccre.com)
Number of users • 40000 human users • “I forgot my password” puts airline out of business • 1500 workstations – workstation hostname is “user” and is written on front of workstation Dexter "Kim" Kimball (dhk@ccre.com)
Woes and Wins • Network – shoving data into your LAN • Replication management • Who is authorized • You want me to release how many volumes? • vos release times • FAA – the system will not go down! All replicas will be identical • Let’s use a really big cache for Seattle! Dexter "Kim" Kimball (dhk@ccre.com)
Woe: Network How to get 300 – 600 GB of data to fileserver for initial load of ROs • Slow links to small airports • Slow links to international server locations • Fast links heavily trafficked • vos release can beat the * out of a network • An airline is always in operation – no magic window of opportunity Dexter "Kim" Kimball (dhk@ccre.com)
Win: Network • Can’t use vos release • Hey, we have lots of those airplane things • Load local (SFO) fileserver array with disks, setup vicep’s • vos addsite to fileserver/array; vos release • vgexport – OS says by to volume groups • vos remsite; remove drives; • Fly to wherever; vgimport, vos addsite / vos release. Rio, anyone? Dexter "Kim" Kimball (dhk@ccre.com)
Woes: Replication Management 15000 RW volumes, all replicated • Who’s authorized to issue vos release? • Which volumes to release? EFS randomly places data ... • How many volumes did you say to release? Dexter "Kim" Kimball (dhk@ccre.com)
Win: Replication Management • Authorization/automation • Per fleet per manual vosrel PTS group • PTS group on every relevant volume root node • User interface writes record to work queue, a file in /afs • Requester; manual/index; priority • Fileserver cron job compares requester with vosrel PTS group, figures out volume list, performs vos release –localauth Dexter "Kim" Kimball (dhk@ccre.com)
Woe: Replication Management • Which volumes to release? • Well known volume tree and consistent naming conventions • Release all volumes for requested manual • Who cares, really? How many can there be? • Sometimes 4000+ volumes per night • vos release is slowish – doesn’t check to see if volume is unchanged; looks at contents • Release cycle > 24 hours, queue issue. OW! Dexter "Kim" Kimball (dhk@ccre.com)
Win: Replication Management • Filter release requests • Compare RO dates, RW dates – if RW not changed and all ROs same date, skip it • Filter: 3 seconds • vos release “no op” – 30 seconds • Small fraction of volumes for given manual are actually changed • Sometimes 0 changed; sometimes < 1%; usually small fraction of total Dexter "Kim" Kimball (dhk@ccre.com)
Woe: FAA – the system will not fail!! • FAA requires 100% uptime, else won’t approve system and airline can go fish • Yeah, right! Dexter "Kim" Kimball (dhk@ccre.com)
Win: FAA – the system will not fail!! • Data outage vs. system outage • Replication, of course • Multiple configurations for EFS client • Crude failover • No data outage for six years and counting • Well, there were a couple of times when ... but we fixed that ... Dexter "Kim" Kimball (dhk@ccre.com)
Woe: FAA –replicas will be identical • Several million RW files X 5 replicas • Have to prove that all files are identical across the 5 ROs for a given volume Dexter "Kim" Kimball (dhk@ccre.com)
Win: FAA –replicas will be identical • Tree crawler! • A little cheesy – “ls –l | cksum” each directory in volume and compare results • Known “bad case” looked for 6x per day • Key “fs setserverprefs” – I prefer you, now you, now you, now you • Dedicated client, no mounted .backups Dexter "Kim" Kimball (dhk@ccre.com)
Woe: Let’s use a really big cache • It seemed like a really good idea • 20% files changed per quarter -- < 2%/week • Average file size 10K • Oops, the indexes are monolithic and 300 MB ... but don’t change often • Let’s try a 12 GB cache! • “Hello? I’ve got twenty minutes to turn the shuttle. It takes fifteen minutes to ...” Dexter "Kim" Kimball (dhk@ccre.com)
Win: Let’s not use a really big cache • AFS client (still I believe?) chokes on large cache • 12 GB =~ 1,200,000 cache “Vfiles” • At garbage collection time, cache purge looks for LRU • Gee, that takes a long time. Is the machine dead? • Let’s try a 3 GB cache! • (Worked indefinitely from 3.3 through 3.6) Dexter "Kim" Kimball (dhk@ccre.com)
Other smidgeons • vos release manager • Does volume need to be released? • Are all the relevant fileservers available? • Is there a sync site for the VLDB? • Do it • Did it? • Check VLDB entry • Compare dates Dexter "Kim" Kimball (dhk@ccre.com)
Other smidgeons • Data reasonableness checks • Do files pointed to by index actually exist? • If not, do not vos rel the index • Avoids the data outage of “empty index” – for example *(bad day)* Dexter "Kim" Kimball (dhk@ccre.com)
Other smidgeons • popcache • Index files: monolithic and large • Fileservers: overseas, slow networks • Initial search of newly released index could take many minutes • Cat indexes to /dev/null every five minutes • If index unchanged, local cached copy is used • If index changed, pulled from fileserver and user doesn’t pay penalty for first search Dexter "Kim" Kimball (dhk@ccre.com)
Other smidgeons • Anyone here ever have these? • AFS is complaining about the network, so AFS broke the network • AFS is the network’s canary in a cage • We could do the whole thing with NFS! • AFS isn’t POSIX compliant. Yay DFS! • A file lock resides on disk. File in RO volume can’t be locked. (Oh yes it can.) • HP T500 goes to sleep? • We could do the whole thing on a Kenmore! Dexter "Kim" Kimball (dhk@ccre.com)
Outcome: AFS Rules • The airline became the first airline (and may still be the only) to place 100% of its aircraft maintenance documentation on line • The system has run reliably for 5 years + • So of course it’s time to replace it • There are three server locations in the US, one each in Europe, Hong Kong, Narita, Sydney, Montevideo, Rio de J • Mechanics no longer mash the microfilm reader This system was enabled by AFS Dexter "Kim" Kimball (dhk@ccre.com)