PS1 Prototype Systems Design Jan Vandenberg, JHU

PS1 Prototype Systems DesignJan Vandenberg, JHU Early PS1 Prototype

Engineering Systems to Support the Database Design • Raw data size • Index size • Most end-user operations I/O bound • Loading/Ingest more cpu-bound, though we still need solid write performance • Time to do full table scans • Time to do index scans • Need to do most work where the data is; can’t sling TB’s over the network quickly • …though we can brute-force past 1 Gbit Ethernet if necessary

Fibre Channel, SAN • Expensive but not-so-fast physical links (4 Gbit, 10 Gbit) • Expensive switch • Potentially very flexible • Industrial strength manageability • Little control over RAID controller bottlenecks

SATA • Fast • Cheap • Ugly, spooky • <cabling pic> • Tough to manage • <dlmsdb/sdssdb drive bay map>

SAS • For our purposes, it’s SATA without the ugliness • Fast: 12 Gbit/s FD building blocks • Cheap: PS1 prototype MD1000 pricing versus Newegg media costs • Not Ugly: IB cables versus rats’ nest • Industrial strength manageability: pretty blinking lights and mgmt apps versus downtime plus white knuckles • <cabling pic>

I/O Performance of Dell SAS Systems in the PS1 Prototype

SAS Performance, Gory Details • SAS v. SATA differences

Per-Controller Performance • Luckily, one controller is fast enough for one SATA disk box • <performance chart>

Resulting PS1 Prototype I/O Topology • <topo diagram> • <aggregate performance chart>

RAID-5 v. RAID-10? • Primer, anyone? • RAID-5 probably feasible with contemporary controller… • …though tough to predict real-world effects of latency… • …and not a ton of redundancy • But after we add enough disks to meet performance goals, we have enough storage to run RAID-10 anyway! • Remember sub-Newegg media costs

RAID-10 Performance • Executive summary: RAID0/2 for single-threaded reads, RAID0 perf for 2-user/2-thread workloads. RAID0/2 writes

PS1 Prototype Servers • <diagram of server roles plus storage and network interconnects>

PS1 Prototype Servers • <iron photo (w/Will?)>

Projected PS1 Systems Design • <diagram of 8-slice triply-replicated systems> • <plus geoplex?>

Backup/Recovery/Replication Strategies • No formal backup • …except maybe for mydb’s, f(cost*policy) • 3-way replication • Replication != backup • Little or no history • Replicas can be a bit too cozy: must notice badness before replication propagates it • Replicas provide redundancy and load balancing… • Fully online: zero time to recover • Replicas needed for happy production performance plus ingest, anyway • Off-site geoplex • Provides continuity if we lose HI (local or trans-Pacific network outage, facilities outage) • <lava pic?> • Could help balance trans-Pacific bandwidth needs (service continental traffic locally)

Why No Traditional Backups? • Not super pricey… • …but not very useful relative to a replica for our purposes • Time to recover • Money no object… do traditional backups too!!! • Synergy, economy of scale with other collaboration needs (IPP?)… do traditional backups too!!!

Failure Scenarios • Easy, zero-downtime: • Disks • Power supplies • Fans • Not so spooky, maybe some downtime and manual replica cutover: • System board (rare) • Memory (rare and usually proactively detected and handled via scheduled maintenance) • Disk controller (rare, potentially minimal downtime via cold-spare controller) • CPU (not utterly uncommon, can be tough and time consuming to diagnose correctly) • More spooky: • Database mangling by human or pipeline error • Gotta catch this before replication propagates it everywhere • Can’t replicate too aggressively • (and so off-the-shelf near-realtime replication tools don’t help us) • Catastrophic loss of datacenter • Have the geoplex • …but we’re dangling by a single copy ‘till recovery complete • …but are we still screwed? Depending on colo scenarios, did we also lose the IPP and flatfile archive? • Terrifying: • Unrecoverable badness fully replicated before detection • Catastrophic loss of datacenter without geoplex • Can we ever catch back up with the data rate if we need to start over? • At some point in the survey, the answer likely becomes “no”.

State Diagram for Replicas? • Loading • Replicating • Load balancing • Failing • Recovering • Possibly repeat-loading

Operating Systems, DBMS? • Sql2005 EE x64 • Why? • Why not DB2, Oracle RAC, PostgreSQL, MySQL, <insert your favorite>? • (Win2003 EE x64) • <Why EE?> • Platform rant from JVV available over beers • <JVV/beer graphic?>

Systems/Database Management • Active Directory infrastructure • Windows patching tools, methodology • Linux patching tools, methodology • Monitoring • Staffing requirements

Facilities/Infrastructure Projections for PS1 • Cooling • Rack space • Network ports • (plus AD/WSUS/monitoring infrastructure above)

Operational Handoff to UofH

Mahalo!(See Ya, Hon!)

PS1 Prototype Systems Design Jan Vandenberg, JHU