110 likes | 121 Views
This paper discusses the design and implementation of the Fully Distributed Data Access System (SAM) for the D0 Experiment at FNAL. SAM manages the access to 500 Tbytes/year of data, serving 550 scientists in 65 institutions. It utilizes a multi-tiered architecture with distributed caching and global file routing. The system ensures data reliability, scalability, and flexibility while supporting data-intensive applications. SAM is a real system serving over 100 registered users and enabling efficient data movement and analysis. The paper highlights SAM's key features, challenges, and future directions for enhancing its capabilities.
E N D
SAM for D0 - a Fully Distributed Data Access System I.Terekhov, FNAL For the SAM Project: L. Lueking, V. White, L. Carpenter, H. Schellman, I. Terekhov, J. Trumbo, M. Vranicar, S. Veseli, S. White
Introduction • Sequential Access Model • The data access for the D0 Run II experiment at FNAL • 500 Tbytes/year, total 1 Pb • Raw detector data 250 KB/event, 1GB files, plus processed data • 550 scientists in 65 institutions, ++ • Data (I/O) intensive applications
The Distributed nature of SAM • All the data access entities, (files, events, … resource usage rules) are in a relational database. Meta-data. • The metadata is served via CORBA IDL interfaces: • The Database Server in Python • Universally defined structures, exceptions • Possibility of alternate implementations (online system, remote installations) • Multi-tiered architecture • Hierarchical collection of servers, IDL • Pure clients at the user end
Distributed Caching • User app always reads from/writes to local disk, SAM takes care of the rest. • A user pushes file into (pulls from) SAM and doesn’t know how or when the system relocates the file. • Disk allocation en route to/from MSS. • Every transfer requires authorization from the resource manager: network contention, MSS bandwidth, etc. • SAM cache managers (rather than physical machines) form a network. • Global file routing/replicating
Fermilab SAM Station 6-30Tbytes Disk + Tape Store Nikhef SAM Station ~300 Gbytes Disk Analysis Tapes Central Analysis Servers MonteCarlo Data Analysis Desktop Lyon (Computer Center) SAM Station ~5Tbytes Disk
Distributed Caching:file retrieval in SAM • A Station is a collection of resources (CPU, disk, network connections), possibly a cluster. • Station Master (SM) is a distributed cache manager. SM’s form a global network. • SM also runs Projects, (the activity of processing a dataset, related to, but not is, a user job in the batch system) • Projects coordinate multiple consumers each having multiple processes (threads of execution). Local network, Farm.
Distributed Caching:file storage in SAM • A file storage server is the part of the station responsible for importing data into SAM (online, MC, processed) • FSS accepts user request, finds a route to the final destination. Global FSS network. • Intermediate locations are in general used because: • “Final” destination is not directly accessible • Desirable to keep on “nearby” disk for subsequent retrieval • Each interim location is part of the cache (allocation subject to local use policy, etc)!
Example of Global Data Movement • IN2P3 collaborators from France import a file into SAM MSS • The file was produced on a desktop PC • Want also to keep a copy locally at Lyon for analysis • Also want a copy at D0’s central analysis station at FNAL site. • Two-three transfers. Robustness, retrial.
SAM is a Real System • Over 100 registered D0 users • Online system stores calibration data • Terabytes of MC data have been imported and continues • Data was reconstructed several times on the farms (diff. Versions) • Data is being analyzed. Cache contains hundreds of GB’s of files. • By the time of detector commissioning, Spring 2001, (nearly) all the components must be functional.
Summary • SAM is being developed as a fully distributed system: • Scalability • Robustness • Flexibility: implementation order is driven by D0 priorities, but general, Grid-compatible design applicable beyond D0 • Other actively worked issues, not covered: • Other MSS’s (remote institutions) • Uniform error reporting in a distributed, heterogenous system