290 likes | 300 Views
Explore the innovative GriPhyN Project's use of Grid technology to manage massive data collections, ensuring efficient computation and network resource coordination for transformative scientific discoveries.
E N D
The GriPhyN Project (Grid Physics Network) Paul Avery University of Florida http://www.phys.ufl.edu/~avery/ avery@phys.ufl.edu Internet 2 WorkshopAtlantaNov. 1, 2000 Paul Avery (The GriPhyN Project)
Motivation: Data Intensive Science • Scientific discovery increasingly driven by IT • Computationally intensive analyses • Massive data collections • Rapid access to large subsets • Data distributed across networks of varying capability • Dominant factor: data growth • ~0.5 Petabyte in 2000 (BaBar) • ~10 Petabytes by 2005 • ~100 Petabytes by 2010 • ~1000 Petabytes by 2015? • Robust IT infrastructure essential for science • Provide rapid turnaround • Coordinate, manage the limited computing, data handling and network resources effectively Paul Avery (The GriPhyN Project)
Grids as IT Infrastructure • Grid: Geographically distributed IT resources configured to allow coordinated use • Physical resources & networks provide raw capability • “Middleware” services tie it together Paul Avery (The GriPhyN Project)
Tier 0 (CERN) 3 3 3 3 T2 T2 3 T2 Tier 1 3 3 T2 T2 3 3 3 3 3 3 4 4 4 4 Data Grid Hierarchy (LHC Example) Tier0 CERNTier1 National LabTier2 Regional Center at UniversityTier3 University workgroupTier4 Workstation GriPhyN: R&D Tier2 centers Unify all IT resources Paul Avery (The GriPhyN Project)
Why a Data Grid: Physical • Unified system: all computing resources part of grid • Efficient resource use (manage scarcity) • Resource discovery / scheduling / coordination truly possible • “The whole is greater than the sum of its parts” • Optimal data distribution and proximity • Labs are close to the (raw) data they need • Users are close to the (subset) data they need • Minimize bottlenecks • Efficient network use • local > regional > national > oceanic • No choke points • Scalable growth Paul Avery (The GriPhyN Project)
Why a Data Grid: Demographic • Central lab cannot manage / help 1000s of users • Easier to leverage resources, maintain control, assert priorities at regional / local level • Cleanly separates functionality • Different resource types in different Tiers • Organization vs. flexibility • Funding complementarity (NSF vs DOE), targeted initiatives • New IT resources can be added “naturally” • Matching resources at Tier 2 universities • Larger institutes can join, bringing their own resources • Tap into new resources opened by IT “revolution” • Broaden community of scientists and students • Training and education • Vitality of field depends on University / Lab partnership Paul Avery (The GriPhyN Project)
GriPhyN = Applications + CS + Grids • Several scientific disciplines • US-CMS High Energy Physics • US-ATLAS High Energy Physics • LIGO/LSC Gravity wave research • SDSS Sloan Digital Sky Survey • Strong partnership with computer scientists • Design and implement production-scale grids • Maximize effectiveness of large, disparate resources • Develop common infrastructure, tools and services • Build on foundations Globus, PPDG, MONARC, Condor, … • Integrate and extend existing facilities • $70M total cost NSF(?) • $12M R&D • $39M Tier 2 center hardware, personnel, operations • $19M? Networking Paul Avery (The GriPhyN Project)
University CPU, Disk, Users University CPU, Disk, Users Satellite Site Tape, CPU, Disk, Robot University CPU, Disk, Users PRIMARY SITE DAQ, Tape, CPU, Disk, Robot University CPU, Disk, Users University CPU, Disk, Users Satellite Site Tape, CPU, Disk, Robot Particle Physics Data Grid (PPDG) ANL, BNL, Caltech, FNAL, JLAB, LBNL, SDSC, SLAC, U.Wisc/CS Site to Site Data Replication Service 100 Mbytes/sec PRIMARY SITE Data Acquisition, CPU, Disk, Tape Robot SECONDARY SITE CPU, Disk, Tape Robot • First Round Goal: Optimized cached read access to 10-100 Gbytes drawn from a total data set of 0.1 to ~1 Petabyte • Matchmaking, Co-Scheduling: SRB, Condor, Globus services; HRM, NWS Multi-Site Cached File Access Service Paul Avery (The GriPhyN Project)
GDMP V1.0: Caltech + EU DataGrid WP2; for CMS “HLT” Production in October Grid Data Management Prototype (GDMP) Distributed Job Execution and Data Handling:Goals • Transparency • Performance • Security • Fault Tolerance • Automation Site A Site B Submit job Replicate data Job writes data locally Replicate data • Jobs are executed locally or remotely • Data is always written locally • Data is replicated to remote sites Site C Paul Avery (The GriPhyN Project)
GriPhyN R&D Funded! • NSF/ITR results announced Sep. 13 • $11.9M from Information Technology Research Program • $ 1.4M in matching from universities • Largest of all ITR awards • Excellent reviews emphasizing importance of work • Joint NSF oversight from CISE and MPS • Scope of ITR funding • Major costs for people, esp. students, postdocs • No hardware or professional staff for operations ! • 2/3 CS + 1/3 application science • Industry partnerships being developed • Microsoft, Intel, IBM, Sun, HP, SGI, Compaq, Cisco Still require funding for implementationand operation of Tier 2 centers Paul Avery (The GriPhyN Project)
U Florida U Chicago Caltech U Wisconsin, Madison USC/ISI Harvard Indiana Johns Hopkins Northwestern Stanford Boston U U Illinois at Chicago U Penn U Texas, Brownsville U Wisconsin, Milwaukee UC Berkeley UC San Diego San Diego Supercomputer Center Lawrence Berkeley Lab Argonne Fermilab Brookhaven GriPhyN Institutions Paul Avery (The GriPhyN Project)
Fundamental IT Challenge “Scientific communities of thousands, distributed globally, and served by networks with bandwidths varying by orders of magnitude, need to extract small signals from enormous backgrounds via computationally demanding (Teraflops-Petaflops) analysis of datasets that will grow by at least 3 orders of magnitude over the next decade: from the 100 Terabyte to the 100 Petabyte scale.” Paul Avery (The GriPhyN Project)
GriPhyN Research Agenda • Virtual Data technologies • Derived data, calculable via algorithm (e.g., 90% of HEP data) • Instantiated 0, 1, or many times • Fetch vs execute algorithm • Very complex (versions, consistency, cost calculation, etc) • Planning and scheduling • User requirements (time vs cost) • Global and local policies + resource availability • Complexity of scheduling in dynamic environment (hierarchy) • Optimization and ordering of multiple scenarios • Requires simulation tools, e.g. MONARC Paul Avery (The GriPhyN Project)
Virtual Datain Action • Data request may • Compute locally • Compute remotely • Access local data • Access remote data • Scheduling based on • Local policies • Global policies • Local autonomy Paul Avery (The GriPhyN Project)
Research Agenda (cont.) • Execution management • Co-allocation of resources (CPU, storage, network transfers) • Fault tolerance, error reporting • Agents (co-allocation, execution) • Reliable event service across Grid • Interaction, feedback to planning • Performance analysis (new) • Instrumentation and measurement of all grid components • Understand and optimize grid performance • Virtual Data Toolkit (VDT) • VDT = virtual data services + virtual data tools • One of the primary deliverables of R&D effort • Ongoing activity + feedback from experiments (5 year plan) • Technology transfer mechanism to other scientific domains Paul Avery (The GriPhyN Project)
GriPhyN: PetaScale Virtual Data Grids Production Team Individual Investigator Workgroups Interactive User Tools Request Planning & Request Execution & Virtual Data Tools Management Tools Scheduling Tools Resource Other Grid Resource Security and Other Grid Security and Management Management Policy Services Policy Services Services Services Services Services Transforms Distributed resources Raw data (code, storage, computers, and network) source Paul Avery (The GriPhyN Project)
GriPhyN: FOCUS On University Based Tier2 Centers Experiment ~PBytes/sec Online System ~100 MBytes/sec Bunch crossing per 25 nsecs.100 triggers per secondEvent is ~1 MByte in size Offline Farm,CERN Computer Center > 20 TIPS Tier 0 +1 ~0.6 - 2.5 Gbits/sec + Air Freight HPSS FNAL Center Italy Center UK Center France Center Tier 1 ~2.4 Gbits/sec Tier 2 Tier2 Center Tier2 Center Tier2 Center Tier2 Center Tier2 Center HPSS HPSS HPSS HPSS ~622 Mbits/sec Tier 3 Institute ~0.25TIPS Institute Institute Institute 100 - 1000 Mbits/sec Physics data cache Physicists work on analysis “channels”. Each institute has ~10 physicists working on one or more channels Tier 4 Workstations,other portals LHC Vision (e.g., CMS Hierarchy) Paul Avery (The GriPhyN Project)
SDSS Vision Three main functions: Main data processing (FNAL) • Processing of raw data on a grid • Rapid turnaround with multiple TB data • Accessible storage of all imaging data Fast science analysis environment (JHU) • Combined data access and analysis ofcalibrated data • Shared by the whole collaboration • Distributed I/O layer and processing layer • Connected via redundant network paths Public data access • Provide the SDSS data for the NVO (National Virtual Observatory) • Complex query engine for the public • SDSS data browsing for astronomers, and outreach Paul Avery (The GriPhyN Project)
MIT Livingston OC48 Hanford OC3 OC3 INet2 Abilene OC12 LSC Caltech Tier1 OC48 LSC LSC LSC LSC Tier2 LIGO Vision LIGO I Science Run: 2002 – 2004+LIGO II Upgrade: 2005 – 20xx (MRE to NSF 10/2000) Principal areas of GriPhyN applicability: Main data processing (Caltech/CACR) • Enable computationally limited searches periodic sources) • Access to LIGO deep archive • Access to Observatories Science analysis environment for LSC(LIGO Scientific Collaboration) • Tier2 centers: shared LSC resource • Exploratory algorithm, astrophysics researchwith LIGO reduced data sets • Distributed I/O layer and processing layer builds on existing APIs • Data mining of LIGO (event) metadatabases • LIGO data browsing for LSC members, outreach Paul Avery (The GriPhyN Project)
GriPhyN Cost Breakdown (May 31) Paul Avery (The GriPhyN Project)
Number of Tier2 Sites vs Time (May 31) Paul Avery (The GriPhyN Project)
LHC Tier2 Architecture and Cost • Linux Farm of 128 Nodes (256 CPUs + disk) $ 350 K • Data Server with RAID Array $ 150 K • Tape Library $ 50 K • Tape Media and Consumables $ 40 K • LAN Switch $ 60 K • Collaborative Tools & Infrastructure $ 50 K • Installation & Infrastructure $ 50 K • Net Connect to WAN (Abilene) $ 300 K • Staff (Ops and System Support) $ 200 K • Total Estimated Cost (First Year)$1,250 K • Average Yearly Cost including evolution, $ 750Kupgrade and operations 1.5 – 2 FTE support required per Tier2 Assumes 3 year hardware replacement Paul Avery (The GriPhyN Project)
Tier 2 Evolution (LHC Example) 20012006 • Linux Farm: 5,000 SI95 50,000 SI95 • Disks on CPUs 4 TB 40 TB • RAID Array 2 TB 20 TB • Tape Library 4 TB 50 - 100 TB • LAN Speed 0.1 - 1 Gbps 10 - 100 Gbps • WAN Speed 155 - 622 Mbps 2.5 - 10 Gbps • Collaborative MPEG2 VGA Realtime HDTVInfrastructure (1.5 - 4 Mbps) (10 - 20 Mbps) RAID disk used for “higher availability” data Reflects lower Tier2 component costs due to less demanding usage, e.g. simulation. Paul Avery (The GriPhyN Project)
Current Grid Developments • EU “DataGrid” initiative • Approved by EU in August (3 years, $9M) • Exploits GriPhyN and related (Globus, PPDG) R&D • Collaboration with GriPhyN (tools, Boards, interoperability,some common infrastructure) • http://grid.web.cern.ch/grid/ • Rapidly increasing interest in Grids • Nuclear physics • Advanced Photon Source (APS) • Earthquake simulations (http://www.neesgrid.org/) • Biology (genomics, proteomics, brain scans, medicine) • “Virtual” Observatories (NVO, GVO, …) • Simulations of epidemics (Global Virtual Population Lab) • GriPhyN continues to seek funds to implement vision Paul Avery (The GriPhyN Project)
The management problem Paul Avery (The GriPhyN Project)
BaBar HENPGCUsers D0 Condor Users BaBar Data Management HENP GC D0 Data Management Condor PPDG SRB Users CDF SRB Team CDF Data Management Globus Team Nucl Physics Data Management Atlas Data Management CMS Data Management Nuclear Physics Globus Users Atlas CMS Paul Avery (The GriPhyN Project)
Philosophical Issues • Recognition that IT industry is not standing still • Can we use tools that industry will develop? • Does industry care about what we are doing? • Many funding possibilities • Time scale • 5 years is looooonng in internet time • Exponential growth is not uniform. Relative costs will change • Cheap networking? (10 GigE standards) Paul Avery (The GriPhyN Project)
Impact of New Technologies • Network technologies • 10 Gigabit Ethernet: 10GigE DWDM-Wavelength (OC-192) • Optical switching networks • Wireless Broadband (from ca. 2003) • Internet information software technologies • Global information “broadcast” architecture • E.g, Multipoint Information Distribution Protocol: Tie.Liao@inria.fr • Programmable coordinated agent architectures • E.g. Mobile Agent Reactive Spaces (MARS), Cabri et al. • Interactive monitoring and control of Grid resources • By authorized groups and individuals • By autonomous agents • Use of shared resources • E.g., CPU, storage, bandwidth on demand Paul Avery (The GriPhyN Project)
Grids In 2000 • Grids will change the way we do science, engineering • Computation, large scale data • Distributed communities • The present • Key services, concepts have been identified • Development has started • Transition of services, applications to production use • The future • Sophisticated integrated services and toolsets (Inter- and IntraGrids+) could drive advances in many fields of science & engineering • Major IT challenges remain • An opportunity & obligation for collaboration Paul Avery (The GriPhyN Project)