150 likes | 157 Views
Experiences with a distributed patch certification. Presenter: John Walsh Location: PIC, Barcelona, ES. Motivations. My view of Testing and Certification Should adhere to general scientific principles and methods ‘Deployment and Testing’ is an ‘experiment’
E N D
Experiences with a distributed patch certification Presenter: John Walsh Location: PIC, Barcelona, ES
Motivations • My view of Testing and Certification • Should adhere to general scientific principles and methods • ‘Deployment and Testing’ is an ‘experiment’ • Must be independently repeatable • Results must be independently reproducible
EGEE Testbed types • Testbed Types • Multi Site TB • Wide area network • Medium to Large Scale deployments • Must be highly coordinated • “Controlled” environment difficult • A single service may make whole TB unusable for periods
EGEE Testbed types • Testbed Types • Multi Site TB • Wide area network • Medium to Large Scale deployments • Must be highly coordinated • “Controlled” environment difficult • A single service may make whole TB unusable for periods • Single Isolated TB • Generally small scale, limited external access • Can replicate components / simulate conditions of “real world” • Does not reproduce all conditions of the “real world” • Need not reproduce complete infrastructure • Highly controlled, less variables • Single tester can control environment • SAM integration difficult, but SAM standalone possible
TCD Testbeds • TCD runs a medium scale set of Isolated TBs • Xen extensively used • Non-trivial setup • Isolated ELgrid, e-Learning grid: • Replicates core Grid-Ireland infrastructure • 18 sites with 4 WNs each, and national services • Look and feel without impacting on production services
TCD Testbeds • TCD runs a medium scale set of Isolated TBs • Xen extensively used • Non-trivial setup • Isolated ELgrid, e-Learning grid: • Replicates core Grid-Ireland infrastructure • 18 sites with 4 WNs each, and national services • Look and feel without impacting on production services • Isolated TestGrid, allows multiple Grid Infrastructures: • Certification infrastructure: • 18 Grid-Ireland sites,4 WNs each, and national services • Tests Quattor profile changes • Quality control before deployment on production • R-GMA testing infrastructure: WMS, R-GMA, 1 site, 4 WNs • Experimental and Porting infrastructure: >150 nodes, multiple sites • TestGrid allows mixed public and private network address spaces
TCD infrastructure R-GMA TB Certification TB
R-GMA testbed Example R-GMA Certification TB R-GMA registry R-GMA MON XEN hypervisor VM1: WMS VM2: CE VM3: UI VM4: SE VM5-n: WNs • Implements core set of service nodes • Top level • R-GMA registry/browser/schema • gLite WMS (Xen) • Site • R-GMA site mon • gLite UI (Xen) • gLite CE + site BDII + torque (Xen) • gLite Classic SE (Xen) • >2 WN (Xen) • Installation via YAIM • Quattor in catchup mode (even on Production) • 5 TB Fileserver for image backups
Simple Install Procedure • Xen Nodes • Basic SL3 image (copied from repository) • Java 1.4.2 • NTP • Minimal network settings • APT • Basic SL repository • For each node • Install latest (certified) YAIM • Central YAIM configuration • Defines Basic Site Configuration • 3 way diff can check for changes in configurations • Each node configured as per type • WMS requires extra Condor repository
Simple Upgrade Procedure • Nice thing is way images can be used • Each node image should be copied to backup server • Known (good) state • Rollback possible • Then can use images to instantiate nodes very quickly • Can prepare siteInfo.def off-line and copy it to node • Do YAIM install • Fixes up repos in /etc/apt/sources.list.d/lcg.list • Problems? • Raise problem in savannah Patch discussion • Do YAIM configure • Problems(?) • As above
R-GMA certification • Hey presto, now have a TB • R-GMA testbed can only be used for testing: • Correctness behaviour of YAIM • That a patch fixes its target problem • Basic R-GMA components: • rgma-client-check OK • rgma-server-check (mon and reg) OK • Daemon startup scripts OK • Basic R-GMA testsuite OK • That the R-GMA daemons are stable(?) • Whether there are any new tests that can be added to TestSuite • A new SAM test(?)
Stability • R-GMA stability • Tomcat daemon can take days to become unstable • How stable are the components R-GMA depends on ? • MySQL, java JDBC connectors, etc • Is the default configuration OK ? • Can it be improved ? • Stress testing is vital • Should attempt to keep stats on system and component behaviour • Memory usage (any leaks?) • Disk usage, number of files, etc • File descriptors (any descriptors leaking?) • Log files OK ? • Rotation policy OK ?
Patch Problems • Patch may introduce a new problem • Important to discuss with the developers and within SA3 • Issues involved • Evaluation of problem • Will applying this patch cause more problems than solve? • Will it become a showstopper?
Summary • Isolated testbed experience has been positive • Xen lessens hardware costs • Can create custom TBs on demand • Large range of testing scenarios possible • Extra layer of quality control • Non-trivial setup • But once completed it becomes a good scientific testbed • Requires extra infrastructure nodes to be installed • Simple store/test/rollback procedure • Isolated testbed does not capture all scenarios • Scaling of tests may not always be possible • In future intend to add network emulation to help • PPS plays critical post-certification role
TestGrid Simple CA • Many nodes are (re)installed repeatedly • Host certificates must be securely saved • Copy to chosen media and store safely • TestGrid now uses a simpleCA for private network nodes • Allows greater flexibility in generating certs • CA controlled by small team of administrators • Does not require standard cert issuing procedure • Faster turn around on cert generation • Certificates cannot used outside of Testbed environment • Namespace is disjoint to EUGridPMA namespace • Initial overhead in setting up simpleCA • Learning curve • Best setup with local CA expert • Extra RPM for the simpleCA deployed on required nodes