1 / 19

EUDAT

EUDAT. Towards a pan- European Collaborative Data Infrastructure. Mark van de Sanden SURFsara Dutch National HPC center, The Netherlands Workshop on the Future of Big Data Management Imperial College, London, UK 27-28 June 2013. Outline. Setting the Scene

amory
Download Presentation

EUDAT

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. EUDAT Towards a pan-EuropeanCollaborative Data Infrastructure Mark van de Sanden SURFsara Dutch National HPC center, The Netherlands Workshop on the Future of Big Data Management Imperial College, London, UK 27-28 June 2013

  2. Outline Setting the Scene Collaborative Data Infrastructure EUDAT project CDI Building Blocks

  3. EB/year PB/day 2016-2020 #M Setting the Scene Users 1,3M Researchers 15MStudents 500MCitizens 200PB ~25PB/year 10k EB ~10PB/year RepositoryVoluime Repositories ~2-3PB/year 100 PB 80TB ~TB/year 20TB ~TB/year 7 Repositories Variety 1 TB 30 Repositories 5 Repositories Long tail of small data Large volume

  4. DoingsomeMath 4000 Institutes * 10 Rep/Institute * 5TB/Rep = 200PB 1,3M Researchers sharing 50GB = 65PB 15M students sharing 1GB = 15PB

  5. Data trends Zettabytes Exabytes Exponential growth Petabytes • Where to store it? • How to find it? • How to make the most of it? Terabytes Gigabytes Increasing complexity and variety

  6. Collaborative Data Infrastructure -A frameworkforthefuture? - User functionalities, data capture & transfer, virtual research environments Data Generators Users Trust Data Curation Data discovery & navigation, workflow generation, annotation, interpretability Community Support Services Persistent storage, identification, authenticity, workflow execution, mining Common Data Services

  7. Six research communities on Board • EPOS: European Plate Observatory System • CLARIN: Common Language Resources and Technology Infrastructure • ENES: Service for Climate Modelling in Europe • LifeWatch: Biodiversity Data and Observatories • VPH: The Virtual Physiological Human • INCF: International Neuroinformatics • All share common challenges: • Reference models and architectures • Persistent data identifiers • Metadata management • Distributed data sources • Data interoperability

  8. Data Centers and Communities

  9. Building Blocks of the CDI Metadata Catalogue AAI Network of trust among authentication and authorization actors Aggregated EUDAT metadata domain. Data inventory Data Staging Safe Replication Simple Store Dynamic replication to HPC workspace for processing Data curation and access optimization Researcher data store (simple upload, share and access)

  10. Safe Replication Community repository • To optimize access for user from different regions • To bring data closer to powerful computers for compute-intensive analysis PIDs•Policyrules Data center Data center Data center store store store Where to Store it? EUDAT CDI Domain of registered data • Robust, safe and highly available data replication service for small- and medium- sized repositories • To guard against data loss in long-term archiving and preservation

  11. Safe Replication PID PID rule: DoReplication() msiDataObjRsync() triggerCreatePID() msiDataObjRsync() updateMonitor() updateMonitor() iRODS iRODS rule: DoReplication() iRODS GPFS dCache SAMQFS HPSS DMF doReplication(*pid,*source,*destination,*status) { msiDataObjRsync(*source, "IRODS_TO_IRODS", "null", *destination, *rsyncStatus); triggerCreatePID("*collectionPath*child.pid.create",*pid,*destination); updateMonitor("*collectionPath*filepathslash.pid.update"); }

  12. How to make most of it? Data Staging • Provide the means to re-ingest computational results back into the EUDAT infrastructure Data center Data center store store PRACE HPC HPC EUDAT CDI Domain of registered data Support researchers in transferring large data collections from EUDAT storage to HPC facilities Reliable, efficient, and easy-to-use tools to manage data transfers

  13. Data Staging Community Portal Workflow PID User starts Workflow User can monitor data flow DataStaging() GO GO starts Transfers iRODS GridFTP 3rd Party Transfers datastager.py [-h] [-d] [-p PATH] [-u USER] [-y YEAR] [-n NETWORK] [-c CHANNEL] [-s STATION] [-P PID] [-PF PIDFILE] [-U URL] [-UFURLFILE] [-t TASKID] [-pF PATHFILE] [--ssSRC_SITE] [--dsDST_SITE] [--sdSRC_DIR] [--ddDST_DIR] {in,out} {seed,pid,url,taskid}

  14. How to find it? Joint Metadata Service Data center Data center Community Community store store repository repository Metadata portal EUDAT CDI Domain of registered data Easily find collections of scientific data – generated either by various communities or via EUDAT services Access those data collections through the given references in the metadata to the relevant data stores Europeana of scientific data

  15. Lucene SOLR Lucene SOLR Joint Metadata Service Community e.g. ENES or CLARIN WWW 3 Indexer Adapter schema A OAI Harvester Adapter schema A Full metadata content search Adapter schema A Raw MD Store CKAN Community OAI Metadata provider A XML-MD 5 1 Browsing limited set of (10?) facets 6 Community Non-OIA Metadata provider B 9 ftp or other protocol XML-MD Adapter schema B 2 7 8 PostgreSQL Community e.g. EPOS, …

  16. What about Homeless and Citizen scientist? Simplestore portal Simpleupload Simple metadata • Utilize other EUDAT services to provide reliability and data retention PID registration Data center Data center Data center store store store EUDAT CDI Domain of registered data Allow registered users to upload ”long tail” data into the EUDAT store Enable sharing objects and collections with other researchers

  17. Simple Store PID • Create a user profile • Deposit a Data Object • Select a Science Domain • Fill in basic metadata on basis of Science Domain • A PID is created Invenio Replicate

  18. eudat-info@postit.csc.fi mark.vandesanden@surfsara.nl

More Related