190 likes | 328 Views
An Integrated Instrumentation Architecture for NGI Applications. Ian Foster, Darcy Quesnel, Steven Tuecke Argonne National Laboratory The University of Chicago. DOE NGI Instrumentation Project.
E N D
An IntegratedInstrumentation Architecturefor NGI Applications Ian Foster, Darcy Quesnel, Steven Tuecke Argonne National Laboratory The University of Chicago
DOE NGI Instrumentation Project “A Uniform Instrumentation, Event, and Adaptation Framework for Network-Aware Middleware and Advanced Network Applications” • With UIUC (Dan Reed, Ruth Aydt) • “Produce uniform notification and adaptation mechanisms, with the goal of catalyzing the development of both network-aware middleware and sophisticated network-aware applications”
Motivation • Environment incorporates multiple sensors • Sources of events relating to behavior of resources, middleware, and applications • Significant advantages to having uniform mechanisms for publishing/discovering sensors and for accessing sensor data • E.g., find all sensors for path A->B • Including historical data • Enables end-to-end, top-to-bottom, past-to-present analysis
Examples of Sensors • Network devices • E.g., routers • End system devices • E.g., computers, storage systems • Grid services • E.g., Globus HBM, Network Weather Service • Libraries • E.g., CAVERNsoft, MPI • Applications
For Example ... S S App S S S S MPICH globus-io CAVERNsoft DPSS Libs S S S S GRAM HBM NWS ... Sys S (netstat) S (netstat) (SNMP) S (SNMP) (SNMP) S S H H H/W R R R
Three Project Components 1. Mechanisms for creating, publishing, discovering, and accessing sensors 2. Synthesis and analysis techniques for identifying qualitative behavior and trends in sensor data 3. Adaptation techniques that exploit sensor data to adjust middleware and application configurations to improve performance Argonne focus: (1) and (3); UIUC: (2), (3)
Current Approach • Use a directory service (LDAP) to register and publish event sources • Publish: source, type, contact [online, archive] • Discover: “find all event sources of type X” • Use NetLogger format for data • Develop sensor manager to handle publish, subscribe, archiving • Use SQL database as archive • Initial sensor set based on Globus libraries, applications, NetLogger-accessible devices
InitialInstrumentation Architecture Sensor Application Sensor Discover (“what event sources for route A to B?”) Events in NetLogger format Subscribe Sensor Manager Publish (“netstat, host A, time T, contact X”) LDAP Archive Netarchive SQL MySQL File
Sensor Manager • We are building a program which: • Archives sensor event streams • Redirects sensor event streams to clients using a publish/subscribe interface • Generates sensor event streams from archive, based on query language • Publishes interfaces and index to LDAP • Relation to other work • Superset of Netlogd (simple archiver) • Might exploit Netarchiver (MySQL indexing)
Archiving Events • How to archive sensor event streams? • SQL: Save each event as a record in an SQL database • Advantage: Rich query support • Netarchive: Save each event into file. Use SQL database to build index of file contents • Advantage: Performance and scale? • We will explore the use of SQL databases • Premise: Most sensors will not produce high volume event streams; hence optimize for simplicity and rich query support
NCSA Origin Nodes Bandwidth/Latency ANL-NASA Ames ANL CPU Load Bandwidth/Latency ANL-Indiana Applying Info Infrastructure to Instrumentation
Publishing & Discovering Sensors • Globus LDAP-based Metacomputing Directory Service (MDS) provides scalable, global infrastructure for publishing and discovering sensor managers • Sensors stream events to a sensor manager • Sensor manager publishes availability of streams into LDAP • Clients discover sensor managers from LDAP, and can subscribe to either current or archived sensor event streams directly from sensor managers
Initial Applications • Replica creation in “Data Grid” applications • Online and historical instrumentation for large data transfers (app, lib, network) • Involves DPSS, globus-io • Also application-level selection of replicas, based on sensor information • MPI-based video streaming (Karonis, Papka)
Security • Grid Security Infrastructure (GSI) will be used throughout, hence possible to say e.g. • “Manager M accepts only streams from sensors of user U” • “Manager N only publishes streams to clients of users A, B, C” • As a first step, we have augmented the Netlogger C client with GSI
Monitor Sensor Sensor Actuator Sensor Manager Netarchive SQL File MySQL Instrumentation ArchitectureShowing Actuators Subscribe Discover Events Publish Events Subscribe Discover LDAP Publish
Future Directions • XML • Netlogger is an ASCII based format • If you using ASCII, why not use XML? • XML database could be used for archive • Events • Performance related events should be just one part of a larger, integrated event system • Typing • Netlogger is weakly typed • Various advantages to strongly typed events
Future Directions (2):Publish/Subscribe for Sensors • In first version: • Netlogger based sensors stream events to manager • Manager publishes sensor availability to LDAP • Clients subscribe to sensor manager for events • In later version: • Sensor can publish existence to LDAP • Client can subscribe directly to sensor for events
Network Weather Service(R. Wolski et al., U.Tenn) • Scalable, fault tolerant system for • Real-time performance measurements • Predictions of future state • When installed on N hosts, delivers: • Network performance (<=N2 via netperf) • Host cpu-load measurements (N) • We (USC/ISI crew) are working to integrate this into MDS; hopefully will eventually be consistent with approach described here (to be discussed)
hs=source.isi.edu to destination.anl.gov hn=source.isi.edu current_cpu: 0.802 current_cpu_prediction: 0.802 current_cpu_MSE: 0.000 weighted_cpu 0.414 weighted_cpu_prediction: 0.414 weighted_cpu_MSE: 0.000 source: hn=source.isi.edu, o=ISI, c=US destination: hn=destination.anl.edu, o=ANL, c=US serviceProvider: NWS throughput: 1.903 throughput_prediction: 1.709 throughput_MSE: 0.95 latency: 5.3 latency_throughput: 6.1 latency_MSE: 0.04 Structure of NWS data in MDS (old) c=US o=Globus o=ISI nn= the Internet ... N sets of cpu info for N hosts N2 Network performance entries for N hosts