270 likes | 363 Views
SunFire 15K Enterprise-Grade Server March 14, 2003. Overview. Introduction of SunFire 15K architecture and concepts Hardware RAS features Dynamic Domains System requirements Number of domains Resources for each domain Expansion Process RAS features Risk factors and risk mitigation
E N D
SunFire 15K Enterprise-Grade ServerMarch 14, 2003 Implementation Review
Overview • Introduction of SunFire 15K architecture and concepts • Hardware RAS features • Dynamic Domains • System requirements • Number of domains • Resources for each domain • Expansion • Process RAS features • Risk factors and risk mitigation • Current Status • Schedule Implementation Review
12 fans each 12 fans each Sun Fire 15K: “Highly redundant, symmetric multi-processing server with a shared memory architecture” Features: • 1 to 18 CPU/Memory boards • 4CPUs/board, @1050MHZ • 1 to 8GB/CPU = 576GB memory • 1 to 18 I/O boards, 4 PCI slots/board • Can trade I/O boards v.s. extra CPU • Partition in 1-18 dynamic domains Implementation Review
SunFire 15K: High RAS Features • Reliability: • Fully redundant CPU/Memory and I/O boards, PCI cards • Dual System Controllers • Dual System Clock • Dual Grid power, redundant power supplies • Redundant fans • Environmental monitoring • Serviceability: • Hot Swap CPU/Memory boards • Hot Swap I/O boards, PCI components • Hot Swap System controller • Hot Swap power supplies • Hot Swap fans • Full Remote Diagnostics Implementation Review
Dynamic Domains • CPU/Memory Boards and I/O Boards can be re-assigned on-the-fly to other domains when needed. E.g.: $ moveboard SB2 -d BESSIEB • By hand: • Reallocate resources from operational and development domains to I&T domain for full load performance testing • Take boards off-line for maintenance – hot swap • Automatically by programs or monitoring software: • At times of peak loads, reallocate resources from Development and I&T domains to operational domain Implementation Review
Data Processing Software Systems • Pre-archive processing & Ingest: • Science data receipt and processing: Science Pipelines (OPUS) • Engineering data receipt and processing (EDPS) • Archive Ingest • Distribution: • Archive distribution (DADS) • On-the-fly reprocessing (OTFR) • Calibration: • Calibration pipeline and database • Database servers: • Pipeline Processing, Ingest/Distribution DB “CATLOG” • Archive Catalog Browsing DB “ZEPPO” • Will not support user interfaces: StarView, Web, APT Implementation Review
Number of Domains • High level requirements: • Separate Development, Integration & Test and Operational environments • Protect Ingest from Distribution • Respond to user community • Other requirements: • Separate Pipeline computing from Database servers • Separate DB for external users (ZEPPO) from internal operational DB (CATLOG) • Isolate OS and COTS testing, patching • Maximum number of domains: 3*(2+2) +1 = 13 • BUT: more domains = more fragmentation = less flexibility • Must balance flexibility with need for isolation Implementation Review
Number of Domains (cont.) • For Databases, combine development and Integration & Test domains for CATLOG and ZEPPO (saves 3 domains) • Combine Pre-archive processing & Ingest with Distribution (saves 3 domains): • Use similar processing pipelines • Protect Ingest from Distribution by dynamically adding resources when needed • Protect Ingest from Distribution by binding Ingest processes to dedicated CPU’s and Memory resources • Protect Ingest from Distribution using new features in DADS 10.* • Closely monitor performance and fall back to 2 separate domains as contingency plan • Number of domains = 13 – 3 – 3 = 7 Implementation Review
The 7 Domains Implementation Review
Domain Resources • CPU/Memory Boards: • 4 CPUs / board • CPUs run at 1.05GHz • 1-8GB of memory / CPU = 4-32GB / board • I/O Boards: • Provide external connections to SAN, network, disks • 4 PCI slots / board • 2 slots @ 33MHz • 2 slots @ 66MHz Implementation Review
1: Development domain • Supports Development Teams for: • OPUS (EDPS, OTFR, etc) • DADS 10.* • IRAF/STSDAS • Calibration pipelines • Calibration reference data • Today (combined with Testing), excluding desktops: • Tru64, Solaris: ~9-13CPU’s <1GB/CPU, 500MHz • Domain requirements: • 2 CPU Boards, 8 CPUs, 4GB/CPU • 1 I/O Board (not mission critical) Implementation Review
2: DADS/OPUS/OTFR domain • Compare Pre-archive pipelines, Ingest and Distribution Performance Requirements with current performance • Use outcome to scale current resources to domain requirements, accounting for faster CPU’s, architecture. • Account for new Software architecture of DADS 10.* • Account for lack of modeling = safety margin • Account for projected growth: • Short term: Distribution (ACS) • Intermediate: New Algorithms • Longer: Pre-archive pipelines, Ingest (SM4) • Overall increased use of 20%/year Implementation Review
2: DADS/OPUS/OTFR domain (cont.) • Today: • baseline Pre Archive processing and Ingest performance within requirements • Remember: failures addressed by Architecture • baseline Distribution & OTFR barely within requirements • Current systems maxed out. Implementation Review
2: DADS/OPUS/OTFR domain (cont.) • Today: • Tru64 cluster: 12 CPUs @ 500MHz, 1GB/CPU • 1 Sun 280R 2CPUs 750MHz (EDPS) • 3 OpenVMS systems: 1 CPU @ 250MHz, 0.5-1.5GB • Domain CPU/Memory requirement: • 9CPUs @ 1GHz, 4GB/CPU • New software architecture requirements (DADS 10.*) • 6CPUs @ 1GHz, 4GB/CPU • Short term growth, ACS + 20% • 3CPUs @ 1GHz, 4GB/CPU • Margin • 2CPUs @ 1GHz, 4GB/CPU Implementation Review
2: DADS/OPUS/OTFR domain (cont.) • Total Domain CPU/Memory Requirements: • 5 CPU Boards, 20CPUs @1GHz, 4GB/CPU • Total Domain I/O requirements: • Operational, so redundant: 2 I/O Boards • Can be multiplexed if necessary for performance • Remember: Dynamic domains: • We can re-assign resources on-the-fly, esp. from I&T domain to handle peak loads, longer term fluctuations Implementation Review
3 Integration & Test domain • Realistic end-to-end load and performance testing • Identical to operational DADS/OPUS/OTFR domain • Today: non-existent • Domain Requirements: • 5 CPU Boards, 20 CPUs @ 1GHz, 4GB/CPU • Remember: Dynamic domains • Full-load performance tests happen regularly, but not daily • Full-load performance tests are highly controlled, discrete, and scheduled events • I&T resources can be re-assigned to e.g. DADS/OPUS/OTFR domain when not needed Implementation Review
4,5,6: Database domains More details in afternoon “Databases” presentation • Operational DB, CATLOG • Today: 4CPUs @ 300MHz, 0.5GB total • Anticipate increased load because of faster pipelines, new instruments • Domain Requirements: • 2 CPU Boards, 8 CPUs @ 1GHz, 2GB/CPU • 2 I/O Boards (redundancy) • Archive Catalog Browsing DB, ZEPPO • Today: 2CPUs @ 300MHz, 0.6GB total • Domain Requirements: • 1 CPU Board, 4CPUs @ 1GHz, 2Gb/CPU • 2 I/O Boards (redundancy) • Development, test • Today: 2*2CPUs @ 200MHz, 1GB total • Domain Requirements • 1 CPU Board, 4 CPUs @ 1GHz, 2Gb/CPU • 1 I/O Board Implementation Review
7: OS & COTS testing, patches • Test next version of OS • Test patches, COTS upgrades, system procedures • Today: n.a. or scattered • Domain Requirements: • 1 CPU Board, 4 CPUs @ 1GHz, 4GB/CPU • 1 I/O board (not mission critical) • Remember: Dynamic domains • It is possible to shut down this domain when not needed • Reassign resources e.g. to DADS/OPUS/OTFR domain Implementation Review
SunFire 15K Nominal Domain Layout Implementation Review
SunFire 15K Peak-load domain layout Implementation Review
Future growth • Today (contingencies): • Add 1 CPU/Memory Board, 4CPUs • Add 8 I/O Boards or 16 “MaxCPU” CPUs • Add 300GB of RAM • upgrade to 1.2GHz CPUs • One to two years: • Double number of CPUs: 8 CPUs / board • Increased CPU clock speed • All within the box Implementation Review
Process RAS Features • STScI Administration and Software Configuration RAS Features • Sun Management Center: ease of management, monitoring and capturing of system performance metrics • Use dynamic server domains to keep the science flowing • Ability to prioritize processing in the event of a problem Implementation Review
Risk factors and mitigation • Schedule slippage risk mitigation: • Contract imposes penalty for late delivery • Decouple Database migration from milestones • Can keep old equipment past end of project • Loaner system to get head start • Technical risks mitigation: • Use loaner to detect issues, find solutions early • Extensive staff training included in contract to mitigate new technology risks • Operational risk and mitigations discussed in later presentations Implementation Review
Current Status • Order placed, Feb 3rd; expected Time of Arrival, Mar 4 • Loaner up and running with two domains • Training started • Completed Site survey, preparation (power, floor, environment) • Started interviewing operations staff, engineers, support staff and scientists to refine use model (later presentation) Implementation Review
High level Schedule • Initial domain design, March 12 • System setup & integration, March 24 • Physical setup, power • Network • Sun’s Application Readiness Process • System benchmarks • Domain configuration, May 1 • OS Install, patch, institutionalize • Test backup/recovery, SMC, basic reporting • 3rd party software • Documentation, review • Clone other domains Implementation Review
High level Schedule (cont) • Full system tests • Run benchmarks to establish system baseline • Develop procedures • System, account management • Backup/restore, disaster recovery • train support staff • Hand over 3 Development, I&T and Operational domains to ESS, May 22 • Database domain configuration • Customize OS, 3rd party applications • Hand over 3 DB domains to ESS, June 6 Implementation Review
Schedule Implementation Review