310 likes | 546 Views
Environmental eScience 2 Martin Dove, Martin Keegan, Stuart Ballard and Mark Calleja National Institute for Environmental eScience and Department of Earth Sciences, University of Cambridge. Elements of escience: a recap.
E N D
Environmental eScience 2 Martin Dove, Martin Keegan, Stuart Ballard and Mark Calleja National Institute for Environmental eScience and Department of Earth Sciences, University of Cambridge
Elements of escience: a recap • Access to, and exploitation of, distributed computing resources: “grid computing” • Seamless secure access to resources • Location-independent access to data • Well-described information about the data (metadata) • Secure access • Cross-institute collaborative environment: concept of the “virtual organisation”
Grid computing, distributed data, and the virtual organisation
Flow of talks: where we are heading The aim is to create a virtual organisation with access to shared computing and data resources Security (certificates) Tools for pooling local shared resources (condor) Middleware for grid computing (Globus) Portals for data and computing Distributed data (storage resource broker) XML Collaborative tools (access grid)
Flow of talks: where we are heading The aim is to create a virtual organisation with access to shared computing and data resources Security (certificates) Tools for pooling local shared resources (condor) Middleware for grid computing (Globus) Portals for data and computing Distributed data (storage resource broker) XML Collaborative tools (access grid)
Computational grids • High-throughput vs high-performance computing: • Large single calculations, or many repeat calculations? • Parallelise the calculation or the study? • Do you have large memory requirements? • Do you need fast connections between processors? • Do you need all your results? Computational grids provide new opportunities for high-throughput calculations
Development of computational grids • Several original ideas: • Linking supercomputers to share large calculations • Using spare computer cycles to significantly increase the amount of useful computer time • Sharing resources leads to the virtual organisation
Condor http://www.cs.wisc.edu/condor Idea developed from 1988 in Wisconsin, based on earlier “Remote Unix” project Condor arose from the transition from “mainframe computing” to “workstation computing” to “desktop computing”
Condor technologies Mature-ish technology to build small or large distributed computing systems from standard desktop computers • “Grabbing extra computer power from idle processors” • Ideal for “high throughput computing” rather than “high performance computing” • Can be used to control dedicated clusters as well as idle machines • Will handle heterogeneous systems
Condor technology: accessing idle computer power Master node: Handles job submission and returns Slave nodes: Run the jobs
What does condor offer? • Usual batch queuing, scheduling, resource management etc, for both serial and parallel tasks • Matches resources to requirements automatically • Handles transfer of jobs between machines (deliberately or accidentally) using checkpointing • Users do not need individual login identification • Can be used for purpose-built clusters, or for office/lab resources • Recognises distributed ownership constraints
Access to data on condor pool • Request issued, which activates relevant CGI script on webserver. • Queries for pool info are handled by the agent on the central manager • If files being produced by a job are required, then a password protected page gives access to agent on relevant machine. Output returned to client’s web browser, or can save straight to disk. centralmanager 2 1 external user 3 webserver (rock)
UCL windows condor pool ~900 1 GHz P4 windows machines in teaching clusters – mostly underused
UCL windows condor pool • Runs Windows Terminal Server • All 90%+ underutilised and running 24/7… • We are building this as a large condor pool with UCL Information Systems group to use as a massive distributed computing system • In 6 months we had extracted 73 processor-years of computation • This has attracted interest from other Universities: expect to see this model used on many campuses over next few years
Example of condor-based study Calcite undergoes an order–disorder phase transition at high temperature involving rotations of the carbonate molecular ions – we have studied this in detail as a function of temperature over a range of pressures using molecular dynamics simulations. We used condor on the UCL cluster in order to generate data for many temperatures
Example of grid-based study Calcite, CaCO3
Example of grid-based study Calcite, CaCO3
Example of grid-based study Calcite, CaCO3
Condor pools A condor pool consists of compute resources on a single network using a single manager It is possible to link condor pools together: “flocking” But in some senses Condor technologies only pick up part of the idea of grid computing
The Globus project • A wider grid will include • Resource sharing between institutes • Issue of security • Resource discovery • Tools for handling data as well as computations The Globus project grew out of the I-way demonstration of linking US supercomputers in 1995 However, it is not yet as mature as Condor
The Globus toolkit • The Globus toolkit provides a secure (uses encryption and X509 certificates) access point for underlying resources. • For example, in Cambridge, one computer runs our Globus gatekeeper and outsiders who want to use our facilities (e.g. the Condor pool) must submit jobs to it. • Globus authenticates both the user and the machine they’re coming from. • It then passes the request to be handled by the relevant job scheduler • Very simple to administer, though tricky to install. • Flaky in places, and lacks some useful functionality – it is still a tool in development
A simple Globus job • First start a proxy; this will service challenges to my identity: tempo 1% grid-proxy-init Your identity:/C=UK/O=eScience/OU=Cambridge/L=UCS/CN=martin dove Enter GRID pass phrase for this identity: Creating proxy ................................... Done Your proxy is valid until: Sat Feb 14 03:10:40 2004 • Now run a command on a remote gatekeeper tempo 2% globus-job-run silica.esc.cam.ac.uk/jobmanager /bin/date Fri Feb 13 15:14:16 GMT 2004 • Note that I didn’t have to specify my identity – that’s the proxy’s job
Limitations and solutions • Globus client commands are clunky for anything but basic requests. • Ideally we’d like to wrap them in a nice web interface (e.g. a portal, more of which later). • Condor comes with a client tool for submitting jobs to remote Globus gatekeepers, called Condor-G, which has many benefits, including job handling (e.g. failure recovery) • One Globus limitation is that when the remote job is meant for a Condor pool, you can’t get all your output back! • There are hacks around these limitations
Workflows with DAGMan • OK, so we can submit jobs anywhere on our grid. Can we make workflows out of them? • An example of a workflow would be: run jobs A and B, and use the results of these jobs in order to run job C • Hence my task is made of many smaller, inter-dependent, jobs. • DAGMan (Directed Acyclic Graph Manager) is a meta-scheduler for Condor. It allows workflows to be submitted using the usual Condor/Condor-G client tools.
eMinerals minigrid GIIS GIIS GIIS GIIS GRIS GRIS GRIS GRIS Grid Information Index Service UK e-Science GIIS e-Minerals GIIS … UCL Cambridge CCLRC UK e-Science Grid e-Minerals Mini Grid Grid Information Resource Service
XML for exchange of data Exchange of data between programs is a major problem Example: search in program output for values of temperature at each time step in a simulation, where the number of lines between each time step output may not be constant Parsing is one solution, but is very messy XML (eXtensible Markup Language) is one strong emerging solution
XML for exchange of data The idea of XML is to use tags to describe the data – elements with attributes Example: • <lecture_list> • <Friday> • <lecturer name=“Martin Dove”\> • <\Friday> • <\lecture_list>
<?xml version="1.0" standalone="yes"?> <cml> <metadataList> <metadata name="version" value="SIESTA 1.3 -- [Release] (30 Jul 2003)"/> <metadata name="Arch" value="intel-nolibs"/> <metadata name="Flags" value="ifc -tpp5 -O2 -w -mp -Vaxlib -O"/> </metadataList> <step type="CG"> <lattice dictRef="siesta:lattice" spaceType="real"> <latticeVector units="bohr" dictRef="cml:latticeVector">75.589 0.000 0.000</latticeVector> <latticeVector units="bohr" dictRef="cml:latticeVector">0.000 75.589 0.000</latticeVector> <latticeVector units="bohr" dictRef="cml:latticeVector">0.000 0.000 75.589</latticeVector> </lattice> <molecule> <atomArray> <atom elementType="C" id="a1" x3="9.14000000" y3="4.13600000" z3="0.00000000"/> <atom elementType="C" id="a2" x3="10.41000000" y3="3.40700000" z3="0.00000000"/> <atom elementType="C" id="a3" x3="10.41000000" y3="2.07300000" z3="0.00000000"/> ..... <atom elementType="Cl" id="a21" x3="11.90200000" y3="1.20900000" z3="0.00000000"/> <atom elementType="Cl" id="a22" x3="11.90200000" y3="4.27000000" z3="0.00100000"/> </atomArray> </molecule> <propertyList> <property dictRef="siesta:Eions"> <scalar units="eV">7375.317349</scalar> </property> <property dictRef="siesta:Ena"> <scalar units="eV">1654.382113</scalar> </property> <property dictRef="siesta:Ekin"> <scalar units="eV">2401.682384</scalar> </property> <property dictRef="siesta:Enl"> <scalar units="eV">10.825807</scalar> </property> <property dictRef="siesta:DEna"> <scalar units="eV">0.000009</scalar> </property>
Why XML? • XML is free, open and standard • XML is simple and extensible • XML files can be validated • Check that a file contains the correct information • Specify that parameter values are “sensible”, or provide default values • XML can be transformed into other formats (eg HTML) • XML is Modular (Namespaces) • Integrate XHTML, MathML, anyML, SVG seamlessly, in the same document, without breaking any software.