1 / 19

Mass Storage @ RHIC Computing Facility

Mass Storage @ RHIC Computing Facility. Razvan Popescu - Brookhaven National Laboratory. Overview. Data Types: Raw: very large volume (xPB), average bandwidth (50MB/s). DST: average volume (x00TB), large bandwidth (x00MB/s). mDST: low volume (x0TB), large bandwidth (x00MB/s).

fern
Download Presentation

Mass Storage @ RHIC Computing Facility

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Mass Storage @ RHIC Computing Facility Razvan Popescu - Brookhaven National Laboratory

  2. Overview • Data Types: • Raw: very large volume (xPB), average bandwidth (50MB/s). • DST: average volume (x00TB), large bandwidth (x00MB/s). • mDST: low volume (x0TB), large bandwidth (x00MB/s). Mass Storage @ RCF

  3. Data Flow (generic) ReconstructionFarm (Linux) RHIC raw 35MB/s DST raw 10MB/s Archive(HPSS) 50MB/s File Servers(DST/mDST) DST 200MB/s mDST mDST 400MB/s 10MB/s AnalysisFarm (Linux) Mass Storage @ RCF

  4. Mass Storage @ RCF

  5. Present resources • Tape Storage: • (1) STK Powderhorn silo (6000 cart.) • (11) SD-3 (Redwood) drives. • (10) 9840 (Eagle) drives. • Disk Storage: • ~8TB of RAID disk. • 1TB for HPSS cache. • 7TB Unix workspace. • Servers: • (5) RS/6000 H50/70 for HPSS. • (6) E450&E4000 for file serving and data mining. Mass Storage @ RCF

  6. The HPSS Archive • Constraints - large capacity & high bandwidth: • Two types of tape technology: SD-3 (best $/GB) & 9840 (best $/MB/s). • Two tape layers hierarchies. Easy management of the migration. • Reliable and fast disk storage: • FC attached RAID disk. • Platform compatible with HPSS: • IBM, SUN, SGI. Mass Storage @ RCF

  7. HPSS Structure • (1) Core Server: • RS/6000 Model H50 • 4x CPU • 2GB RAM • Fast Ethernet (control) • Hardware RAID (metadata storage) Mass Storage @ RCF

  8. HPSS Structure • (3) Movers: • RS/6000 Model H70 • 4x CPU • 1GB RAM • Fast Ethernet (control) • Gigabit Ethernet (data) (1500&9000MTU) • 2x FC attached RAID - 300GB - disk cache • (3-4) SD-3 “Redwood” tape transports • (3-4) 9840 “Eagle” tape transports Mass Storage @ RCF

  9. HPSS Structure • Guarantee availability of resources for a specific user group  separate resources  separate PVRs & movers. • One mover per user group  total exposure to single-machine failure. • Guarantee availability of resources for Data Acquisition stream  separate hierarchies. • Result: 2PVR&2COS&1Mvr per group. Mass Storage @ RCF

  10. Net 1 - Data (1000baseSX) 10baseT Client STK Core M1 M2 M3 (Routing) N x PVR pftpd Net 2 - Control (100baseT) HPSS topology Mass Storage @ RCF

  11. HPSS Performance • 80 MB/sec for the disk subsystem. • 1 CPU per 40MB/sec for TCPIP (Gbit) traffic (1500MTU). • ~8MB/sec per SD-3 transport. • ? per 9840 transport. Mass Storage @ RCF

  12. I/O intensive systems • Mining and Analysis systems. • High I/O & moderate CPU usage. • To avoid large network traffic merge file servers with HPSS movers: • Major problem with HPSS support on non-AIX platforms. • Several (Sun) SMP machines or Large (SGI) Modular System. Mass Storage @ RCF

  13. I/O intensive systems • (6) NFS file servers for workareas • (5) x E450 + (1) x E4000 • 4(6) x CPU; 2GB RAM; Fast/Gbit Ethernet. • 2 x FC attached hardware RAID - 1.5TB • (1) NFS Home directory server (E450). • (3+3) AFS Servers (code dev. & home dirs.) • RS/6000 model E30 and 43P • (NFS to AFS migration) Mass Storage @ RCF

  14. Problems • Short lifecycle of the SD-3 heads. • ~ 500 hours < 2 months @ average usage. (6 of 10 drives in 10 months) • Low throughput interface (F/W) for SD-3 -> high slot consumption. • 9840 ??? • HPSS: tape cartridge closure @ transport error. • Built a monitoring tool to try to predict transport failure (based of soft error frequency). • SFS response when heavy loaded - no graceful failure (timeouts & lost connections). Mass Storage @ RCF

  15. Issues • Partially tested two tape layer hierarchies: • Cartridge based migration. • Manually scheduled reclaim. • Integration of file server and mover functions on the same node: • Solaris mover port. • Not an objective anymore. Mass Storage @ RCF

  16. Issues • Guarantee avail. of resources for specific user groups: • Separate PVRs & movers. • Total exposure to single-mach. failure ! • Reliability: • Distribute resources across movers  share movers (acceptable?). • Inter-mover traffic: • 1 CPU per 40MB/sec TCPIP per adapter: Expensive!!! Mass Storage @ RCF

  17. Inter-mover traffic - solutions • Affinity. • Limited applicability. • Diskless hierarchies. • Not for SD-3. Not tested on 9840. • High performance networking: SP switch. • IBM only. • Lighter protocol: HIPPI. • Expensive hardware. • Multiply attached storage (SAN). • Requires HPSS modifications. Mass Storage @ RCF

  18. Client Mover 1 Mover 2 (!) Multiply Attached Storage (SAN) 1 2 Mass Storage @ RCF

  19. Summary • Problems with divergent requirements: • Cost effective archive capacity and bandwidth. • Two tape hierarchies: SD-3 & 9840. • Test the configuration. • Availability and reliability of HPSS resources. • Separated COS and shared movers. • Inter-mover traffic ?!? • Merger of file servers and HPSS movers? Mass Storage @ RCF

More Related