380 likes | 705 Views
Introduction to Data Management. Scientific data management: – Large data volumes (10s of PB) – Distributed user base – Need for high performance transfers – Need for data security (or not) – Scalability. Data in “the Grid”?. “The Grid”. Data. Data. Data in “the Cloud”?.
E N D
Scientific data management: – Large data volumes (10s of PB) – Distributed user base – Need for high performance transfers – Need for data security (or not) – Scalability
Data in “the Grid”? “The Grid” Data Data
Data in “the Cloud”? “The Cloud” Data Data
Transfer Protocols – GridFTP (http://www.ogf.org/documents/GFD.20.pdf) Aka “gsiftp” (GSI = Globus (Grid) Security Infrastructure, cf RFC3820) – HTTP(S) – WebDAV (RFC 4918)
GridFTP – based on FTP Ancient protocol... RFCs 114 (1971), 141 (1971), 172 (1971), 265 (1971), 354 (1972), 542 (1973), 765 (1980), 959 (1985) Splitting control and data connection Extensions RFC 2228, 2773 (security), 2640 (internationalisation), 3659 (misc.), 2389, 5797 (FEAT)
Control connection: port 21 (FTP), 2811 (GridFTP) Client Server Data connections and firewalls (active vs passive mode (PASV))
GridFTP – extensions to FTP GSI security (later RFC 3820) Striping (and EBLOCK mode) TCP buffer size control/negot.? Data channel authentication (DCAU)
The Grid.... Ad-hoc transfers between GridFTP endpoints Initial user ingest? scp? Hands on with GridFTP: uberftp (cf ftp)
Moving data in (and to, and from) the Grid “Manually,” with GridFTP Portals – e.g. NGS portal GlobusOnline FTS (as of 3.0, tbc)
The gLite grid – daily TLA dose EMI – European Middleware Initiative UMD – Unified Middleware Distribution EGI – European Grid Infrastructure IGE – Infrastructure for Globus in Europe NGI – National Grid Initiative
The gLite grid – component TLAs SE – Storage Element SRM – Storage Resource Manager LFC – LHC file catalogue FTS – File Transfer Service BDII – Berkeley Database Information Index (LDAP)
SRM (OGF GFD.129) – control interface – support for “spaces” (reserved areas) – retention policies (replica, output, custodial) – access latencies (offline, nearline, online) – storage “type” - permanent, volatile FTS LFC LFN – Logical File Name (optional) Resolved by LFC into GUID – Globally Unique Identifier Resolved by LFC into SURL – Storage URL (or Site URL) Resolved by SE into TURL – Transfer URL (eg gsiftp) SRM GridFTP BDII Storage Element
gLite - Summary of basic data commands lcg-cp <srcfile> <dstfile> Copy to/from SE, or between SEs (no LFC) lcg-cr <srcfile> <dst> Copy file into SE, and register in LFC (guid) lcg-del <guid> lcg-rep <src> <dst> Replicate
Exercises Lots of small files (105, 106) Large files (108-1012) Migration Format migration, checksumming Who can copy data? Write/Modify?
Exercises How is scientific data mgmt different? • How do research disciplines differ? • What are the interdisciplinary benefits? How grids and clouds differ...? Can we trust the grids/clouds? Who leads the way? HEP? Industry?
Storage Accounting - static Ongoing work... – Distributed storage systems – Temporary file copies created – Scheduled deletions – Inaccessible free spaces, reserved space – Filesystem/tape overheads – Timeliness and accuracy – Impact of compression
GridFTP today GridFTP – workhorse of WAN grid data (OGF standard) The need for GSI (non-TLS) Numerous LAN protocols... … moving towards more common standards? (eg HTTP)
lcg-cr --vo dteam -l lfn:my_stuff -d srm-dteam.gridpp.rl.ac.uk file://`pwd`/foo.tmp guid:921ac0b8-82aa-61dc-0192-6effece Subsequent access and replication is by GUID
Data Security • Data security is like data security everywhere... • Except that the devil is in the detail • And the details are always different...
Data Security – Confidentiality • In flight, or at rest • The performance issue • And the time issue • Who can “activate” it? Data Data Data Data Data
Data Security – Availability LOCKSS again ... clouds are good at this. Somebody already thought about the difficult stuff...? Data Liability, SLAs,...
Data Security – Availability DDoS • Intentional • Botnets • Unintentional
Referencing Data • DOIs for data • DONA – Digital Objects Numbering Authority • Granularity? • Licences, permissions • Implementing data policies
Cloud Data – Cost • Clouds are elastic • Elasticity is good for (rapid) growth • Or shrinkth • Elasticity can be expensive, though • Compared to “traditional” data centre • Or in-house (but don’t underestimate this!) • Different cost models (Hybrids!)
Infrastructure Security • End-to-end security • Authentication and authorisation • Developing a threat model • Protecting credentials • Usability of security • Anonymised??
Infrastructure • Federated identity and single sign-on • Integration with existing infrastructures • Accounting • Securely... • Anonymously? • And billing
The Role of Standards • Standards promote interoperation • And maturity (sometimes) • Interoperation solves problems • Sometimes • E.g. eggs and baskets • Standards peer reviewed
Other Data Services IRODS – “data grid” Successor to SRB Server side workflows: rules, microservices Safety Deposit Box Commercial product from Tessella Data preservation
NGS data services NGS portal – https://portal.ngs.ac.uk/ http://www.ngs.ac.uk/tools/vbrowser Databases: Oracle, MySQL
EU Funded Data Projects EUDAT (www.eudat.eu) Collaborative iRODS based infrastructure Multidisciplinary, scalable, long tail SCIDIP-ES (earth science) www.scidip-es.eu SCAPE (www.scape-project.eu) PANDATA (neutron/synchrotron) pan-data.eu
New Stuff? More mature approach to clouds? CCN – Content Centric Networking RAID --> ECC, “object” storage
Exercises Lots of small files (105, 106) Large files (108-1012) Migration Format migration, checksumming Who can copy data? Write/Modify?
Exercises How is scientific data mgmt different? • How do research disciplines differ? How much can be shared? • What are the interdisciplinary benefits? How grids and clouds differ...? Can we trust the grids/clouds? Who leads the way? HEP? Industry?
References www.ngs.ac.uk www.ogf.org UMD user guide https://edms.cern.ch/document/722398/ GridPP storage and data management group • http://www.gridpp.ac.uk/wiki/Grid_Storage