1 / 44

Mass Storage Systems and AFS

Mass Storage Systems and AFS. Hartmut Reuter Rechenzentrum Garching der Max-Planck-Gesellschaft. Rechenzentrum Garching (RZG) Mass storage systems and HSM systems Multiple Resident AFS as the software to use mass storage with AFS Experiences with MR-AFS at RZG Conclusions. Overview.

lynnea
Download Presentation

Mass Storage Systems and AFS

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Mass Storage Systems and AFS Hartmut Reuter Rechenzentrum Garching der Max-Planck-Gesellschaft

  2. Rechenzentrum Garching (RZG) Mass storage systems and HSM systems Multiple Resident AFS as the software to use mass storage with AFS Experiences with MR-AFS at RZG Conclusions Overview

  3. Common computer center for IPP Max-Planck institutes Supercomputing power 816 processor Cray T3E NEC SX4, Cray J90 . Mass Storage 2 StorageTek silos 1 EMASS/Grau silo RZG == Rechenzentrum Garching

  4. Extension of cell ipp-garching.mpg.de Greifswald Berlin Hannover Garching

  5. Data grow faster than disk capacities Today activities in high energy- and plasma physics, geo-sciences, astronomy or digitizing of documents produce several TB/year. Future experiments in high energy- and plasma physics are supposed to produce 1 TB/day. These data are needed world-wide. AFS is an adequate way to share these data. Introduction

  6. Disks are fast, but expensive. Active data should be stored here. Space on tapes in robot libraries is by a factor 10 cheaper than on disks. For inactive data the slower access is good enough. The moving of data between disk and tape is done by hierarchical storage management (HSM) systems. On mainframes HSM systems are in use since the seventies, in Unix since about a decade. What is a Mass Storage System active files disk inactive files tape

  7. HSM software migrates inactive files onto tape and brings them back to disk whenever they are needed. Examples for HSM systems under Unix: ADSM (IBM) DMF (Cray, SGI) FileServ (Cray, SGI) HPSS (IBM, ...) SamFS (SUN) Unitree (SUN, Convex,...) Hierarchical Storage Management Disk Migration Recall Tape

  8. A mass storage system is expensive: hard- and software are expensive, it needs far more maintenance than disks, you need someone to manage it. Therefore you should use a mass storage system only if you expect to have several TB of data. Files in a mass storage system must be considered as tape files. Therefore you should store only those data in a mass storage system you would store on tape, anyway. The files must be big enough to save money while being stored on tape instead of on disk. A kind of disclaimer

  9. If you just ran an HSM system on your AFS fileserver, problems could be caused by: AFS files are not visible in the fileserver’s /vicep-partitions. The HSM system might require that. AFS abuses inode-fields: potential conflict with an HSM doing the same. HSM systems are too expensive to run on each fileserver. Volumes couldn’t be moved easily unless the HSM system is a client/server implementation and allows the sharing of its metadata between clients. To avoid all these problems MR-AFS has been developed Problems implementing HSM for AFS

  10. Developed at Pittsburgh Supercomputer Center (psc.edu) by Jonathan Goldick, Chris Kirby, Bill Zumach e.a. Since 1995 development and maintenance at RZG. Transarc’s update releases have always been incorporated. Presently used at 4 sites in Germany and 2 sites in USA together with the actual AFS 3.4a or 3.5(beta) client. Multiple Resident AFS (MR-AFS)

  11. Files may be stored outside the volume’s partition. Fileserver can do I/O remotely (remioserver). Fileservers can share HSM resources and disks. On HSM systems the native UFS can be used. (No abuse of inode fields, no special vfsck). Files from any fileserver partition can be migrated into the HSM system (AFS internal data migration). Volumes can be moved between fileservers even if their files are in the HSM system or on shared disks. Main features of MR-AFS

  12. Residencies and residency-database fileservers • Shared residencies are disk partitions which are shared by all MR-AFS fileservers in the cell. • Fileservers know about shared residencies from the residency database on the AFS database servers. database Shared residencies

  13. The Residency Database • Ubik database “rsserver” such as ptserver or vlserver (port 7010) • Describes shared residencies • name, id • minimum file size • maximum file size • priority • Availability (read / write / random access) • minimum migration age (how old a file must be to be copied here) • is it wipable? (MR-AFS internal data migration) • wiping thresholds and weight factors

  14. Shared Residencies fileserver • A fileserver can write a file • on its own local disk where the volume lives, remioserver local disk shared residency shared residency

  15. Shared Residencies fileserver • A fileserver can write a file • on its own local disk, • into a shared residency on the same server, remioserver local disk shared residency shared residency

  16. Shared Residencies fileserver • A fileserver can write a file • on its own local disk, • into a shared residency on the same server, • into a shared residency on another server via the “remioserver” daemon there. remioserver local disk shared residency shared residency

  17. The “remioserver” does remote I/O on behalf of fileservers. The new layer of generic I/O routines calls interface routines for local access directly or a remote “remioserver” daemon by rx-call Interface routines exist for afsinode interface (what standard AFS uses) ufs interface for standard Unix filesystem maxstrat interface (MaximumStrategy RAIDs) easily expandable to new interfaces (e.g. CDROM) Higher layers don´t need to know about remote access Remote I/O Interface

  18. Generic I/O layer volserver fileserver salvager scanner rx-calls generic I/O remioserver afsinode afsinode ufs ufs

  19. Files are distributed by a hash code over 256 directories (for up to 125,000 files) 256 * 16 directories or (for up to 2 million files) 256 * 256 directories. (for more files) File names are built from FileId (RW-volume, vnode, uniquifier), filetag (contents depends on environment) Primarily developed for archival residencies, it can be used even for the local disk partitions allowing one to run fileservers on platforms without full AFS-client support (Linux). UFS Interface

  20. Archival Residencies • The staging disk of a HSM system can be declared a shared residency. • The fileservers can use it to put files onto tape and to get them from tape. • This is called an “archival residency” because it should only be used for archiving. • Fileservers stage files from archival residencies on random access residencies (local disk or shared) before delivering them to the clients. remioserver HSM system Tape robot archival residency

  21. HSM-Systems used with MR-AFS HSM-system OS AFS-cells ADSM AIX cpc.engin.umich.edu DMF UNICOS psc.edu, ipp-garching.mpg.de DMF IRIX ipp-garching.mpg.de EMASS FileServ IRIX federation.atd.net Epoch ? ? SamFS Solaris tu-chemnitz.de Unitree Solaris rrz.uni-koeln.de, urz.uni-magdeburg.de

  22. How files get copies in HSM systems fileservers • For archival residencies a minimum migration age is specified in the database. • After a file has reached this age the fileserver automatically creates a copy in the archival residency. • The HSM system migrates the copy off to tape. remioserver Archival residency with minimum migration age set HSM system

  23. Read from an Archival Residency 1 client • Client requests file which is on archival residency. cache remioserver fileserver HSM system Tape robot local disk archival residency

  24. Read from an Archival Residency 2 client • Client requests file which is on archival residency. • Fileserver calls remioserver to trigger HSM system. cache remioserver fileserver HSM system Tape robot local disk archival residency

  25. Read from an Archival Residency 3 client • Client requests file which is on archival residency. • Fileserver calls remioserver to trigger HSM system. • HSM system copies file to disk. cache remioserver fileserver HSM system Tape robot local disk archival residency

  26. Read from an Archival Residency 4 client • Client requests file which is on archival residency. • Fileserver calls remioserver to trigger HSM system. • HSM system copies file to disk. • Fileserver copies file to local disk. cache remioserver fileserver HSM system Tape robot local disk archival residency

  27. Read from an Archival Residency 5 client • Client requests file which is on archival residency. • Fileserver calls remioserver to trigger HSM system. • HSM system copies file to disk. • Fileserver copies file to local disk. • Fileserver delivers file to client. cache remioserver fileserver HSM system Tape robot local disk archival residency

  28. Data Migration in MR-AFS: Wiping • Shared residencies and local disk partitions can be declared wipable. • All files having copies on other residencies can be wiped. • If disk usage exceeds the high-water mark wiping of files starts until the low-water mark is reached. • The order in which files are wiped is calculated from the file´s access history and size, and the weight factors from the database. • Wiped files remain visible in the directory, only the access takes a long time. high-water mark low-water mark

  29. What look wiped files like? • The Unix “ls” command doesn’t know about wiping or data migration. So it cannot show which files are wiped. • The new subcommand “fs ls” shows an “m” for migrated files. size mod. date name modebits > fs ls drwxrwxrwx 4096 Jan 12 09:20 . drwxrwxrwx 2048 Jan 25 11:03 .. mr--r--r-- 51718400 Jan 22 03:42 A mr--r--r-- 60780800 Jan 22 12:13 B -r--r--r-- 53523200 Jan 24 07:14 C -r--r--r-- 218091520 Jan 25 11:11 D > “m” for migrated

  30. How to get a file back from HSM • A file comes back from the HSM system onto a random access disk partition if the user opens it for read. This, of course, may take a while. • The user can prefetch files from the HSM system by issuing the new subcommand “fs prefetch <file names>”. This command returns control immediately and the files come back asynchronously. • The advantage of prefetching is that files which are stored in the HSM system on the same tape can be read with a single tape mount. Prefetching therefor increases the throughput of the HSM system.

  31. The Fetch Queue • The servers running the HSM system maintain a queue for the fetch requests of all users. • The priority of a new fetch request depends on the number of requests the user already has in the queue. • This allows users to prefetch hundreds of files at a time without blocking the queue for others. • A new “fs fetchqueue” command shows the queue: > fs fe Fetch Queue for residency stk_tape (16) is empty. Fetch Queue for residency d3_tape (64): Pos. Requestor FileId TimeStamp Rank Cmd. State 1 reblinsk 536896166.2128.16606 Jan 27 12:00 0 pref xfer to server 2 hwr 536879945.290.46282 Jan 27 12:05 0 open waiting for tape 3 reblinsk 536896166.3942.16734 Jan 27 12:02 1 pref waiting for tape >

  32. Other extensions to the fs-command • For platforms without AFS-client (Cray T3E), all AFS commands (fs, pts, vos, klog, unlog, kpasswd, ...) may be built. • For those systems the fs command gets some additional subcommands: • fs cd <directory> • fs pwd • fs read <AFS-file> <local file> • fs write <local file> <AFS file> • fs rm <AFS file> • fs mkdir <AFS directory> • “fs read” and “fs write” can also be used on systems with AFS client. They are faster because they bypass the AFS-cache.

  33. Dumping MR-AFS volumes • The files on archival residencies can be considered to be safe if • the HSM system writes the files immediately onto tape • and creates more than one tape copy of the files. • Therefore only files with a single non-archival residency need to be dumped. • A new “vos selectivedump” command allows one to specify all residencies to be included in the dump. • A “dumptool” command (written by Ken Hornstein from nrl.navy.mil) can be used to analyze the dump and to restore single files without restoring the whole volume.

  34. Two StorageTek silos For files < 8 MB: SGI Origin 2000 IRIX 6.4, DMF 2.6 7 Timberline tape drives 1 - 2 GB / tape For files >= 8 MB: SGI Origin 2000 IRIX 6.4, DMF 2.6 8 Redwood tape drives 50 - 70 GB / tape Mass storage used at RZG for MR-AFS

  35. Common mass storage system for general users in IPP and Max-Planck institutes, satellite gamma-ray astronomy (MPI for extraterrestric physics), plasma physics experiment Asdex-Upgrade (IPP), plasma physics experiment W7-AS. What are the data stored in MR-AFS?

  36. RZG’s AFS cell: ipp-garching.mpg.de • 22 fileservers all running MR-AFS binaries, but only 7 use MR-AFS features (multiple residencies, wiping). 700 GB local disk partitions. • 3 archival residencies where files on wiping fileservers get copies: • backup An 9.6 GB RAID partition for files < 64 KB. • stk_tape Files are migrated onto Timberline (3490) tapes, for files between 64 KB and 8 MB. 977,000 files with together 1.2 TB. • d3_tape Files are migrated onto Redwood (D3) tapes, for files > = 8MB. 137,000 files with together 7 TB. • 8 non-archival shared residencies with together 420 GB used as staging space on the wiping fileservers. • > 600 AFS clients with about 1200 active users.

  37. RZG’s Shared Residencies > res listresidencies Local_disk (id: 1) backup (id: 4) 9.6 GB, used 73.4 % archival stk_tape (id 16) 12.1 GB, used 27.9 % archival d3_tape (id: 64) 17.6 GB, used 11.4 % archival temp (id: 128) 67.7 GB, used 93.3 % w7-shots (id: 512) 33.8 GB, used 85.7 % aug-shots (id: 1024) 127.7 GB, used 88.0 % comptel (id: 2048) 29.2 GB, used 87.9 % soho (id: 4096) 14.8 GB, used 9.8 % m-tree (id: 8192) 46.0 GB, used 86.4 % aug-bigshots (id: 16384) 63.2 GB, used 88.7 % aug-smallshots (id: 65536) 42.2 GB, used 94.1 % >

  38. AFS volumes can become big! 743 GB > vos ex muser.jgc muser.jgc 536882889 RW 743580408 K On-line afs-serv7.ipp-garching.mpg.de /vicepz 3416 files RWrite 536882889 ROnly 0 Backup 536882891 MaxQuota 900000000 K, 8000 files Creation Mon Aug 28 16:32:15 1995 Last Update Thu Jan 28 10:44:50 1999 12 accesses in the past day (i.e., vnode references) Desired residency mask = 0, undesired residency mask = 512 RWrite: 536882889 Backup: 536882891 number of sites -> 1 server afs-serv7.ipp-garching.mpg.de partition /vicepz RW Site >

  39. Data Growth over time • We see a strong data growth over the last years. On average data double every 14 months. • Therefore it is very important to have a scalable system to which additional disk space and HSM systems can be added when necessary.

  40. Files and data per file-size range Our cell ‘ipp-garching.mpg.de’ has 8 million files with together 8.6 TB of data. More then 50% of the files are < 4 KB (first column). 89 % of the files are on disk, but only 8.3 % of the data.

  41. Data transfer between tape and disk • On average 470 files are staged from tape to disk per day. • 340 from Timberline and 130 from Redwood. • Peaks of 5000 files per day have been seen. • The corresponding data volume is 13 GB/day with peaks of 50 GB/day. • The files remain on disk for some days before they are wiped again. • The time a user has to wait for a file to come on-line is strongly non-linear with respect to the number of requests at the time. Wait times can vary between one minute and some hours in the worst case. • Data transfer from disk to tape corresponds to data growth (about 10 GB per day) • Data transfer tape to tape (to defragment tapes and to guarantee unlimited file life-time) is much higher.

  42. Wait times for staging files from tape • Staging requests from Timberline tapes are generally satisfied in < 100 seconds. • A chunk of bad Redwood tapes (media) still causes problems resulting in much • longer average wait times. We are presently replacing the bad tapes.

  43. What we expect most from Transarc • Support of files > 2GB. • RZG’s Cray T3E has a main memory > 100 GB, so users easily create files > 2 GB. • MR-AFS server code is already prepared for files > 2 GB. • Client and client/server interface should be provided by Transarc.

  44. Conclusions • AFS can make use of mass storage systems using • the normal AFS client (kernel-extension and afsd), • some extensions to AFS commands (fs and vos), • MR-AFS on the server side. • MR-AFS has proven to be • mature, very stable, and scalable, • to support all kinds of Unix based HSM systems. • For infinite file-life very important: • The underlying mass storage systems can be exchanged without affecting the user’s view of the data. That gives you time to migrate the files from the old to the new mass storage system without interrupting the service to the user.

More Related