460 likes | 689 Views
Managing an Archival File Service with VERITAS HSM. Keith Michaels The Boeing Company keith.r.michaels@boeing.com. Presentation Goals. Describe the service we provide using HSM Describe how we converted a large existing file service to use VERITAS HSM in one weekend
E N D
Managing an Archival File Service with VERITAS HSM Keith Michaels The Boeing Company keith.r.michaels@boeing.com
Presentation Goals • Describe the service we provide using HSM • Describe how we converted a large existing file service to use VERITAS HSM in one weekend • Our experiences administering a large HSM installation
Team Members • The Boeing Archive Service Team • Ed Alejandro • David Atler • Gareth Beale • William Julien • Shamus McBride • Keith Michaels • John Prescott • Rodger Wessling
Our Business • Airplanes, aerospace, missiles • Large Information Management Infrastructure • Seattle, Washington • Our niche: Long Term Preservation of Digital Information
Configuration System: 1 Sun 6500 for Veritas HSM 1 Sun 2000E for OSM Robotics: 2 StorageTEK Genesis silos Tape Drives: 16 STK 9491 drives (8 for OSM) Tape Type: 3490 E, 3490 EE Silo slots occupied: 8500 Tapes at vault: 7100 Online Storage; 200 Gb (mirrored) File count: 16 million Total Storage: 8 Tb Percent files online: 15% HSMs: 12 (1 per filesystem) User accounts: 1600 Recalls/day: 4000 files, 7 Gb New data/day: 15,000 files, 14 Gb
Hardware configuration Old server New server FDDI network NFS ftp samba NFS ftp samba OSM Veritas HSM UFS filesystems VxFS filesystems Disk arrays Disk arrays Tape drives Silo 1 Silo 2
Service Description • The Boeing Archive File Service (AFS) • Definition of archive: • Purpose of archiving: An archive is a repository for the protection and storage of inactive digital data, according to specific retention criteria. To provide a secure repository for digital information that needs to be kept for a long time. Our Guarantee: your data, expressed as a Unix file, will be retrievable back to that same file indefinitely, or until explicitly removed by the owner.
HSM File Server or Archive? • The technology you use depends on the service you want to provide • Easy: • Service: File Server / Disk Extension • Technology: HSM • Harder: • Service: Archiving • Technology: • Document Management System • Electronic Library • Database • “Enhanced” HSM
HSM Properties • Transparency: • Avoids users having to make decisions about data residency • When it works, it’s invisible • When it doesn’t work, it’s not obviously the culprit. • But: transparency can cause user frustration: • Inconsistent response (online vs. offline) • Lacks feedback to users
HSM Properties (cont.) • NUFA - Non-Uniform File Access is acceptable, if it’s bounded and predictable, but in practice, HSM tends to be unpredictable. • HSM controls that improve predictability: • MIGSTOP • migtie • quotas • Advice: • Downplay transparency • Tell users everything is always offline
Part 2: Migrating to VERITAS HSM • Timeline: • 1992-1996: UniTree • 1996-1999: Open Storage Manager (OSM) • 1999- VERITAS HSM • Migration Goals: zero impact • Minimum downtime • Transparent access to existing files during migration of data
Option 1: the obvious way • Hard cutover after move: • Bring up the new system • Shut down old system • Move all the files over to it • Resume services on new system • Big Problem: the files are all off-line! • With a tape mount for each file, we estimated a hard cutover like this would take 80 years! • GOAL: mount each tape only once
Configuration Old server New server FDDI network NFS ftp samba NFS ftp samba OSM Veritas HSM UFS filesystems VxFS filesystems Disk arrays Disk arrays Tape drives Silo 1 Silo 2
Option 2: “soft” cutover • Same as Option 1, except allow user access on old system while migration is in progress. • Problem: must reconcile changes that have occurred during the conversion process before cutting over. • Reconciliation must be done during downtime, however, it can be done repeatedly; only the last “delta” requires downtime.
Option 3: cutover before move • Set up new system, and switch users over to it immediately. • Then fill it up by copying files over. • Big Problem: Users cannot access old files until they have been copied over, which can take months.
Option 3 Solution: piggyback HSMs • Solution: Populate new system with “stubs” (slices) which point to the old files. “Piggy-back” system 2 on top of system 1. • Must generate a complete configuration on the new system up-front. • Takes downtime proportional to number of files, but does not require any tape mounts to cut over. • This method was used in 1999 to convert from OSM to VERITAS HSM. • Completed over weekend of July 17-18, 1999 with a total downtime of 70 hours.
Option 3: block diagram Old System New System OS OS Veritas HSM OSM UFS VxFS Reconstructed HSM filesystem Existing OSM filesystem (Veritas media initially empty) Ft method “pointers” Online files Offline files DISK TAPE TAPE DISK
Implementing Option 3 • Problem: • populate a VERITAS HSM server with 14 million files in an efficient and timely manner, with minimal user impact. • Process must be: • Robust: restartable from intermediate checkpoints • Auditable: need an independent verification process • Automatic: self-starting, self-throttling, self-checking
High-level Design (Phase 1) • Two Phase approach: • Phase 1: • Shut down production • Scan all OSM filesystems, build FHDB, VOLDB, and remote “ft” volumes • Move databases over to VERITAS side • Run migreconstruct to populate HSM filesystems with zero-length “slices” (pointers to files on old system). • Set file attributes on all files, using verification file • Cut over IP address • Resume production on new system
High Level Design (Phase 2) • Phase 2: • Copy all OSM files over to new system, replacing HSM “slice” • remigrate all new VERITAS HSM files • set migration date in FHDB to zero • push the transferred files out to VERITAS HSM tape • delete empty ft volumes • retire old OSM system • We are now at the beginning of Phase 2. • Estimate for completion: December, 2000.
Phase 2: approach “A” • Access all files on the new server, letting VERITAS HSM retrieve the files across the network from the old server. • Disadvantages: • requires OSM active for duration of phase 2 • Hard to order tapes optimally (requires OSM internal knowledge). • Must generate lists of files on each tape
Phase 2: approach “B” • Read in the OSM tapes directly on the VERITAS side. • Advantages: • Very efficient: no network transfer • OSM not needed • Disadvantages: • OSM tape format is proprietary • Requires a lot of OSM internal knowledge • tapes alone do not contain correct pathnames
Phase 1: Process Details • Based on “ft” method of VERITAS HSM • GOAL: make HSM think all the files on the old system have been previously migrated with ft method. • An ft volume consists of: • hostname • username • password • directory path
Create ft volumes • Migration to an ft volume causes a file transfer of the file data to an internal filename: • 3E8M856.0.0 • And also creates a “control file” containing the FHDB and VOLDB entries: 3E8M856.0.0.GLABEL • These are both created in the ft volume directory • We want to build ft volumes from scratch and populate them with preexisting files: • Use symbolic links to existing files : /move/volumes/user1/10/N1000/3E8862.0.0 -> /user1/agape/pet118/david/xpiglet/Psymbols.dat
Scan the OSM filesystems • Locally written conversion program: scanfs: • Scans all OSM (UFS) filesystems and produces: • A fully populated /move filesystem containing all the ft volumes • FHDB • VOLDB • Verification file • Speed: 300 files / second • Total elapsed time: 14 hours • Each ft volume was limited to 2000 files
Reconstruct the filesystems • migreconstruct: existing HSM utility • Uses FHDB and VOLDB info to create a filesystem. • Creates zero-length “slices” • Initializes inode, and DMAPI attributes with file handle • Sets status to “migrated” • Used “customized” (stripped down) version of migreconstruct • Combine with VxFS mount options for speed: • nolog,convosync=delay,mincache=tmpcache
Verification • All files were verified after migreconstruct: Checked for: • existence • length • all other inode fields, except ctime • Driven from Verification file created by scanfs • Migreconstruct did NOT set correct owner, group, perms, etc. • Only inode field not retained: ctime
Nobody files • Caused by UID mapping • A large number of files on the old system had invalid UIDs ( UID = -2) • VxFS 3.3.1 does not support such files (bug #28313). • This has been fixed in VxFS 3.3.2
Special Cases Handled • Hard linked files • Symbolic links • Named pipes • Zero length files • Funny characters in file name \n, |, unprintables, etc. • Future dated files
Credits • Thanks to VERITAS Consultation: • For assistance developing the conversion process. • And the HSM Development staff: • For improvements to migreconstruct, migpsweep, numerous bugfixes, etc.
PART 3: Living with HSM 3.2 • HSM is often sold as “automating” storage management: • HSM does not run itself! • Requires daily management: • Handling filesystem full (not totally preventable) • export/import, moving files around • figuring out why something won’t reload • explaining to users why their job is hung • resource management: tape drive/disk cache utilization, capacity planning • tape management: adding tapes, vaulting, replacing failed tapes
Scalability issues with HSM 3.2 • We push HSM harder than most: • Diverse user community • Limitations due to number of files (14.5 M) and number of HSMs (12 on one server): • Sequential databases require batch mode updating and recopying. • Locking strategy • Migration is batch oriented • Database has no rollback capability • Many utilities’ runtime is proportional to number of files: • migdbclean, migdbcheck, migdbrpt, voldb_info, migsweep, migmove, migbatch
Scalability issues with HSM 3.2 (cont.) • Tape striping is per-HSM, not per-system • No way to reserve some tape drives for retrieval, others for migration, backup, vault copy, etc. There is no coordinated sharing of drives between HSMs on the same system. • Recovery process after a problem with one HSM is sometimes to shut down migd, which affects unrelated HSMs. • Three sweeps needed to free space under no-space conditions: • Search for existing purge candidates • Build migration worklist and migratefiles • Select purge candidates and release space
Wishlist • What we would like to see in VERITAS HSM • Better performance under heavy loads. • An RDBMS for all databases (transactions, recoverability, rollback, ad-hoc queries). • Better export/import between HSMs on same system. • Ability to track user-defined file attributes (metadata) for archiving purposes. • Scratch pool for free tapes. • Accounting system • Vaulting
Wishlist (cont.) • Enhanced reports: • “How much was migrated/recalled over last 24 hours?” • What percent of the time were users waiting for tape mounts?” • What is the utilization of the disk cache? (“working set”) • Feedback to users: • If pending stage status was visible to users, they would be more patient. • Pop-up window: “The file you requested is off-line and will take 10 minutes to retrieve. Do you want to wait?” • We would like to guarantee that files migrate to tape within a fixed amount of time (for vaulting). • A system-wide administration tool for all HSMs on a server.
Conclusions • HSM 3.2 works as advertised. It does not solve all storage management problems, but it can still be useful: • Lesson 1: HSM does not run itself, especially if you push it hard. • Lesson 2: HSM has some scalability issues, especially with large numbers of files. • Lesson 3: HSM works best with big files and predictable arrival and recall rates, e.g., netbackup logs. • Lesson 4: Configure plenty of tape drives and disk cache to handle peak demand. • Lesson 5: Run migbatch frequently.
Our Plans for the Future • Upgrade HSM (new releases will address many scalability issues) • Adding 10 STK 9840 tape drives, CPUs, memory, disk, 3rd silo. • Add a third level in migration hierarchy for very low usage data. • Add metadata management capability to provide better archiving. • Investigate SAN technology to allow remote HSM sites to be centrally managed and share resources.
Summary • Digital Archiving is a very useful service. • All HSMs are moderately complicated, and require expertise and effort to operate. • VERITAS HSM 3.2 works best with low to moderate usage. • Improvements are coming.