90 likes | 110 Views
Database Operations. Elizabeth Gallas - Oxford ADC Weekly September 13, 2011. Overview. Brief notes Oracle 11g validation ATLR Replication User incidents (since S&C Week) Frontier ADCR. Brief Notes. LFC migration See Graeme’s talks … ATLARC / TAG Services
E N D
Database Operations Elizabeth Gallas - Oxford ADC Weekly September 13, 2011
Overview • Brief notes • Oracle 11g validation • ATLR • Replication • User incidents (since S&C Week) • Frontier • ADCR Elizabeth Gallas - Databases
Brief Notes • LFC migration • See Graeme’s talks … • ATLARC / TAG Services • Popular: Event Picking & other TAG Services/Reports • Increasing requests for queries/cross checks using TAG DB • AMI Database Master Server: issues at Lyon late in July full recovery, no data loss (early August) • DBA issue help: DQ2, Panda, DDM, AKTR, AGIS … • Indexing • Query optimization • Development improvements • AGIS Schema • Running in production mode on integration (INTR) server Needs to move to production ASAP • Oracle 11g testing Elizabeth Gallas - Databases 3
Oracle 11g Validation • All production DBs will upgrade to Oracle 11g • Scheduled: very early January 2012 • Testing reduces risks ! • Participation of developers – essential • DBAs & resources ready to help (platforms available since late May) • DBA’s initiated validation campaign in August • As announced in Roman’s talk (S&C Week – July) • ATLARC may upgrade to 11g in October 2010 • Take early advantage: Features, Performance improvements • Latest was summarized yesterday in Gancho’s talk at the ADC Development meeting: https://twiki.cern.ch/twiki/bin/viewauth/Atlas/DBOpsValidation11g Elizabeth Gallas - Databases 4
ATLR Status … August: no holiday … DB usage is “evolving” (growing) … • Developers finding increased utility for Conditions data • We have powerful tools to access this data • People using it in new ways, a great thing ! • Release 17: increased DB access • Studying logs to quantify differences • Tier-0: increased capacity … other bottlenecks loosened (file staging) … Database access now limiting Tier-0 job throughput Recent Technical Stop used for testing Frontier usage by Tier-0 (coordinated with Frontier experts) • No problems using CERN Frontier; Improved DB access time • BUT: some jobs had more DB retrievals for MUONALIGN • (See Hans’ talk in ADC Development meeting yesterday) • Trigger Reprocessing: • Early August: Bug (improper disconnects) problems: fixed • Currently: Trigger experts speeding up validation cycle • Use OFFSITE resources (Tier-1s): Timescale: ASAP • Development effort to later (also) use Frontier: test “in the next month” Elizabeth Gallas - Databases 5
Oracle Streams • Recent request to run Trigger Reprocessing at BNL • Need to export ATLAS_CONF_TRIGGER_REPR to BNL • Decided to add to Oracle Streams • By default, it will go to all Tier-1s • Added benefit … available if/when these jobs use Frontier • Steps: adding this Schema to Oracle Streams • Must insure stability of all schemas under replication https://twiki.cern.ch/twiki/bin/view/Atlas/DatabaseSchemasUnderReplication • This Schema: 200 MB (not a volume issue) • Owner account locking • Trigger expert (Joerg) working with DBAs: • Small schema changes required to meet requirements • If all goes according to plan, intervention this week to add this Schema to the replication to all Tier-1s • Wednesday 10:00 – 12:30 • Requires replication to be stopped during intervention Elizabeth Gallas - Databases 6
Incidents: User Access to Conditions 2 Frontier crashes at CERN Frontier site in 1 week • Follow up: Users – working independently on different projects • Developer: looking into SCT noise • Developer: adding info to Lumi Data Summary Metadata Reports • Why did Frontier crash ? Under investigation (memory issue?) Frontier “load” last week: “intense queries” from L1 Calo studies • Query time usually <2 sec, these were 20-30 seconds • Follow up with developer • Query is a reasonable request • Executed in reasonable time given nature of request • Look for ways to improve queries Raise number of Frontier DB connections from 10 to 20 Additional Notes: Incidents: reasoning behind dedicated Frontier launchpad for Tier-0 • Incidents NOT a problem on Oracle side, just for Frontier • Tracking down these issues reflects a lot of improvements in Frontier monitoring and understanding of Frontier logging • An ongoing effort Elizabeth Gallas - Databases 7
Tier-1s / Frontier Status • Oracle+Frontier servers: • RAL, Lyon, KIT, BNL, TRIUMF and CERN • Frontier Meetings: Aug 11, Aug 25, Sep 9 https://www.racf.bnl.gov/docs/services/frontier/meetings/minutes • Skipping weeks with Tier-1 Service Coordination meetings • Current failover strategy: • Some Frontier launchpads still not open (as recommended) • Frontier fail-over only to sites with open access configuration and resilient server deployment • Need updated Frontier https://savannah.cern.ch/bugs/index.php?86408 • Needed for failover to work • WAS thought to NOT to be urgent …changed our minds … when specific sites had issues / hurricanes … raise urgency • To be included in LCG 60(d) • Improving Frontier Monitoring and follow up on frequent/intense queries • Still a work and investigations to be done – takes time Elizabeth Gallas - Databases 8
ADCR Status • ADCR Database • Early August: • Alerts of storage and Oracle ASM problems. • Made controlled switch to standby hardware. • Added to standby for robustness, capacity: • 2 storage arrays • 3rd node • Current status: • SR open to Oracle on primary hardware - in progress. From Gancho: ADCR on standby hardware … performing better … Doubling of buffer pool cache (now 13 GB ) thus less IOPS … Adding 2 storage arrays: ADCR has 72 disks (instead of 4 arrays = 48 disks) Elizabeth Gallas - Databases 9