200 likes | 527 Views
Grid Anatomy. What are the essential componentsCPU Resources Middleware (software
E N D
1. EPP Grid Activities AusEHEP Wollongong
Nov 2004
2. Grid Anatomy What are the essential components
CPU Resources + Middleware (software – common interface)
Data Resources + Middleware
replica catalogues; unifying many data sources
Authentication Mechanism
Certificates (Globus GSI), Certificate Authorities
Virtual Organisation Information Services
Grid consists of VOs!? users + resources participating in a VO
Who is a part of what research/effort/group
Authorisation for resource use
Job Scheduling, Dispatch, and Information Services
Collaborative Information Sharing Services
Documentation & Discussion (web, wiki,…)
Meetings & Conferences (video conf., AccessGrid…)
Code & Software (CVS, CMT, PacMan…)
Data Information (Meta Data systems)
3. 2nd Generation Accessible resources for Belle/ATLAS
We have access to around ~120 CPU (over 2 GHz)
APAC, AC3, VPAC, ARC
currently 50% Grid accessible
Continuing to encourage HPCfacilities to install middleware
We have access to ANUSFpetabyte storage facility
Will request ~100 TB for Belledata.
SRB (Storage ResourceBroker)
Replica catalogue federatingKEK/Belle, ANUSF, Melbourne EPP data storage
Used to participate in Belle’s 4x109 event MC production during 2004
4. 2nd Generation SRB (Storage Resource Broker)
Globally accessible virtual file system
Domains of storage resources
eg. ANUSF domain contains the ANU petabyte storage facility and disk on Roberts in Melbourne
Federations of Domains
eg. ANUSF and KEK are federated
$ Scd /anusf/home/ljw563.anusf$ Sls –l$ Sget datafile.mdst$ Scd /bcs20zone/home/srb.KEK-B
5. Grid Anatomy What are the essential components
CPU Resources + Middleware
Data Resources + Middleware
replica catalogues; unifying many data sources
Authentication Mechanism
Globus GSI, Certificate Authorities
Virtual Organisation Information Services
Grid consists of VOs!? users + resources participating in a VO
Who is a part of what research/effort/group
Authorisation for resource use
Job Scheduling, Dispatch, and Information Services
Collaborative Information Sharing Services
Documentation & Discussion (web, wiki,…)
Meetings & Conferences (AccessGrid…)
Code & Software (CVS, CMT, PacMan…)
Data Information (Meta Data systems)
6. 3rd Generation Solutions NorduGrid -> ARC (Advanced Resource Connector)
Nordic Countries plus others like Australia
We’ve used this for ATLAS DC2
Globus 2.4 based middleware
Stable, patched, and redesigned collection of existing middleware (Globus, EDG)
Grid 3 Middleware -> VDT
US based coordination between iVDGL, GriPhyN, PPDG
Globus 2.4 based middleware
LHC Computing Grid (LCG) <- EDG -> EGEE
Multiple Tiers: CERN T0, Japan/Taiwan T1, Australia T2 ?
Regional Operations Centre in Taiwan
Substantial recent development – needs to be looked at once again!
7. 3rd Generation Solutions Still a lot of development going on.
data aware job scheduling is still developing
VO systems are starting to emerge
meta-data infrastructure is basic
Deployment is still a difficult task.
prescribed system/OS only
8. Grid Anatomy What are the essential components
CPU Resources + Middleware
Data Resources + Middleware
replica catalogues; unifying many data sources
Authentication Mechanism
Globus GSI, Certificate Authorities
Virtual Organisation Information Services
Grid consists of VOs!? users + resources participating in a VO
Who is a part of what research/effort/group
Authorisation for resource use
Job Scheduling, Dispatch, and Information Services
Collaborative Information Sharing Services
Documentation & Discussion (web, wiki,…)
Meetings & Conferences (AccessGrid…)
Code & Software (CVS, CMT, PacMan…)
Data Information (Meta Data systems)
9. Virtual Organisation Systems Now there are 3 systems available
EDG/NorduGrid LDAP based VO
VOMS (VO Membership Service) from LCG
CAS (Community Authorisation Service) from Globus
In 2003 we modified NorduGrid VO software for use with the Belle Demo Testbed, SC2003 HPC Challenge (worlds largest testbed)
More useful for rapid Grid deployment than above systems.
Accommodates Resource Owners security policies
resource organisations are partof the community
their internal security policies arefrequently ignored/by-passed
Takes into account CA
certificate authorities are a partof the community
a VO should be able to list CAswho they trust to sign certificates
Compatible with existing Globus
Might be of use/interest to theAustralian Grid community?
GridMgr (Grid Manager)
10. Virtual Organisation Systems How do VOs manage internal priorities?
This problem has not yet become apparent!
This has been left up to local resource settings.
For non VO resources, changes would require allocation or configuration renegotiation.
CAS is only VO middleware to address this
done by VOs specifying policies allowing/denying access to resources
local resource priorities are not taken into account
difficult to predict the effect
VO manage job queue
centrally managed VO priorities, independent of
locally managed resource priorities
resource job consumers “pull” jobs from the queue
the VO decides and can change which jobs are run first
results of prototype testing – “fair-share” system could be used
users/groups are allocated a target fraction of all resources
11. Grid Anatomy What are the essential components
CPU Resources + Middleware
Data Resources + Middleware
replica catalogues; unifying many data sources
Authentication Mechanism
Globus GSI, Certificate Authorities
Virtual Organisation Information Services
Grid consists of VOs!? users + resources participating in a VO
Who is a part of what research/effort/group
Authorisation for resource use
Job Scheduling, Dispatch, and Information Services
Collaborative Information Sharing Services
Documentation & Discussion (web, wiki,…)
Meetings & Conferences (AccessGrid…)
Code & Software (CVS, CMT, PacMan…)
Data Information (Meta Data systems)
12. Data Grid Scheduling Task -> Job1, Job2 ...
Job1 -> input replica 1, input replica 2 ...
Job1 + Input -> CPU resource 1 ...
How do you determine what+where is best?
13. Data Grid Scheduling What’s the problem?
Try to schedule wisely
free resources, close to input data, less failures
Some resources are inappropriate
need to parse and check job requirements and resource info (RSL - Resource Specification Language)
Job failure is common
error reporting is minimal
need multiple retries for each operation
need to try other resources in case of resource failure
eventually we stop and mark a job as BAD
What about firewalls
some resource have CPUs which cannot access data
Schedulers
Nimrod/G (parameter sweep, not Data Grid)
GridBus Scheduler (2003, 2004 aided them towards SRB)
GQSched (prototype developed in 2002, used in 2003 demos)
14. Data Grid Scheduling GQSched (Grid Quick Scheduler)
Idea is based around the Nimrod model (user driven parameter sweep dispatcher)
Addition of sweeps over data files and collections
Built in 2002 as a demonstration to computer scientists of simple data grid scheduling
Simple tool familiar to Physicists
Shell script, Environment parameters
“Data Grid Enabled”
Seamless access to data catalogues and Grid storage systems
Protocols – GSIFTP, GASS (and non-Grid protocols also HTTP, HTTPS, FTP)
Catalogues – GTK2 Replica Catalog, SRB (currently testing)
Scheduling based on metrics for CPU Resource – Data Resource combinations
previous failures of job on resource
“nearness” of physical file locations (replicas)
resource availability
Extra features
Pre- and Post-processing for preparation/collation of data and job status checks
Creation and clean-up of unique job execution area
Private network “friendly” staging of files for specific resources (3 stage jobs)
Automatic retry and resubmit of jobs
Reporting of file access errors and job errors
Merging of RSL requirements for Resources and Jobs
Automatic checking and creation of Grid proxy
File globbing for Globus file staging, GSIFTP, GTK2 Replica Catalog, SRB
Future features
Scheduling based on optimised output storage location
Dynamic parameter sweep and RSL specification (eg. splitting file processing via meta-data)
Job profile reporting (in XML format, partially tested)
Automatic proxy renewal or notification of expiry (for long term jobs)
“Job Submission Service mode” for ongoing tasks such as Belle MC production?
15. Grid Scheduling $ gqsched myresources myscript.csh
16. Grid Anatomy What are the essential components
CPU Resources + Middleware
Data Resources + Middleware
replica catalogues; unifying many data sources
Authentication Mechanism
Globus GSI, Certificate Authorities
Virtual Organisation Information Services
Grid consists of VOs!? users + resources participating in a VO
Who is a part of what research/effort/group
Authorisation for resource use
Job Scheduling, Dispatch, and Information Services
Collaborative Information Sharing Services
Documentation & Discussion (web, wiki,…)
Meetings & Conferences (AccessGrid…)
Code & Software (CVS, CMT, PacMan…)
Data Information (Meta Data systems)
17. Meta-Data System Advanced Meta-Data Repository
Advanced = Above and beyond file/collectionoriented meta-data
Data oriented queries…
List the files resulting from task X.
Retrieve the list of all simulation data ofevent type X.
How can file X be regenerated? (if lost or expired)
Other queries we can imagine…
What is the status of job X ?
What analyses similar to Xhave been undertaken?
What tools are being used for X analysis?
Who else is doing analysis X or using tool Y ?
What are the typical parameters used for tool X ? And for analysis Y ?
Search for data skims (filtered sets) thatare supersets of my analysis criteria.
18. Meta-Data System XML
some great advantages
natural tree structure
strict schema, data can be validated
powerful query language (XPath)
format is very portable
information readily transformable (XSLT)
some real disadvantages
XML databases are still developing, not scalable
XML DBs are based on lots of documents of the same type
Would need to break tree into domains, query becomes difficult
LDAP
compromise
natural tree structure
loose schema but well defined
reasonable query feature, not as good as XML
very scalable (easily distributed and mirrored)
information can be converted to XML with little effort if necessary
structure/schema is easily accessible and describes itself !
might be a way to deal with schema migration (meta-data structure is dynamic; need to preserve old information)
19. Meta-Data System Components
Navigation, Search, Management of MD
Task/Job/Application generated MD
Merging and Uploading MD
20. Meta-Data System Navigation and Creation via Web
Search is coming
21. How to use it all together? Getting set up
Certificate from a recognised CA (VPAC)
Accounts on each CPU/storage resource
ANUSF storage, VPAC, ARC (UniMelb), APAC
Install require software on resources (eg. BASF)
Your certificate in the VO system
Running jobs
Find SRB input files, set up output collection
Convert your scripts to GQSched scripts
Run GQSched to execute jobs
Meta Data
Find/Create a context for your tasks (what you are currently doing)
Submit this with your job, or store with output
Merge context + output meta-data, then upload
NOT COMPLETE – need auto generated MD from BASF/jobs