370 likes | 546 Views
ISTORE Software Runtime Architecture. Aaron Brown, David Oppenheimer, Kimberly Keeton, Randi Thomas, Jim Beck, John Kubiatowicz, and David Patterson http://iram.cs.berkeley.edu/istore 1999 Winter IRAM Retreat. ISTORE Runtime Software Architecture.
E N D
ISTORE Software Runtime Architecture Aaron Brown, David Oppenheimer, Kimberly Keeton, Randi Thomas, Jim Beck, John Kubiatowicz, and David Patterson http://iram.cs.berkeley.edu/istore 1999 Winter IRAM Retreat
ISTORE RuntimeSoftware Architecture • Runtime system goals for the ISTORE meta-appliance (1) Provide mechanisms that allow network service applications to exploit introspection (monitor + adapt) (2) Allow appliance designer to tailor runtime system policies and interfaces • How the goals are achieved (1) Introspection: layered local and global runtime system libraries that manipulate and react to monitoring data (2) Specialization: runtime system is extensible using domain- specific languages (DSLs)
Roadmap • Layered software structure • Example of introspection • Runtime system extensibility using DSLs • Conclusion
Application Front-End Code DistributedGlobal Runtime Parallel App. Worker Code Local Runtime Device Interface& RTOS Distributed Global Runtime Parallel App.Worker Code DistributedGlobal Runtime Front-end node(s) Local Runtime Device Interface& RTOS Switch LAN/WAN Device nodes Layered software structure HW Device (NIC) HW Device
Device Interface& RTOS Front-end node(s) Device Interface& RTOS Switch LAN/WAN Device nodes Device Interface Layer HW Device (NIC) HW Device
Device interface layer • Microkernel OS modules • Traditional OS services • Networking, mem management, process scheduling, threads, … • Device-specific monitoring • Raw access patterns • Utilization statistics • Environmental parameters • Indications of impending failure • Self-characterization of performance, functional capabilities
Local Runtime Device Interface& RTOS Front-end node(s) Local Runtime Device Interface& RTOS Switch LAN/WAN Device nodes Local runtime layer HW Device (NIC) HW Device
Local runtime layer • Non-distributed mechanisms needed by network service applications • Feeds information to global layer or performs local operations on behalf of global layer • Example mechanisms • Application-specific filtering/aggregation of device monitoring data • Example: OLTP server vs. DSS server • Data layout and naming • Example: record-based interface for DB, file-based for web server • Device scheduling • Example: maximize TPS vs. maximize disk bandwidth utilization • Caching • Coherence essential vs. coherence unnecessary • More efficient caching implementation possible in second case
DistributedGlobal Runtime Local Runtime Device Interface& RTOS Distributed Global Runtime DistributedGlobal Runtime Front-end node(s) Local Runtime Device Interface& RTOS Switch LAN/WAN Device nodes Global runtime layer HW Device (NIC) HW Device
Global runtime layer • Aggregate, process, react to monitoring data • Relies on local per-device runtime mechanisms to provide monitoring data, implement control actions • Provides application interface that hides distributed implementation of runtime services • Example services • High-level services • Load balancing: replicate and/or migrate heavily used data objects when a disk becomes over-utilized • Availability: replicate data from failed or failing component to restore required redundancy • Plug-and-play: integrate new devices into the system • Low-level services used to implement high-level global services • Distributed directory tracks data and metadata objects • Migration, replication, caching • Inter-brick communication • Distributed transactions
DistributedGlobal Runtime Parallel App. Worker Code Local Runtime Device Interface& RTOS Distributed Global Runtime Parallel App.Worker Code DistributedGlobal Runtime Front-end node(s) Local Runtime Device Interface& RTOS Switch LAN/WAN Device nodes Distributed application worker code HW Device (NIC) HW Device
Distributed application worker code • Runs on top of global runtime system • Written by appliance designer • Application-specific • Database • scan, sort, join, aggregate, update record, delete record, ... • Transformational web proxy • fetch web page (from disk or remote site), apply transformation filter, update user preferences database, ... • System administration tools implemented at this level • Customized runtime system defines administrative interface tailored to application
Application Front-End Code DistributedGlobal Runtime Parallel App. Worker Code Local Runtime Device Interface& RTOS Distributed Global Runtime Parallel App.Worker Code DistributedGlobal Runtime Front-end node(s) Local Runtime Device Interface& RTOS Switch LAN/WAN Device nodes Application front-end code HW Device (NIC) HW Device
Application front-end code • Runs on front-end interface bricks • Accepts requests from LAN/WAN connection • Incoming requests made using standard high-level protocols • HTTP, NFS, SQL, ODBC, … • Invokes and coordinates appropriate worker code components that execute on internal blocks • Takes into account locality and load balancing • Database: front-end performs SQL query optimization, invokes distributed relational operators on data storage devices • Transformational proxy: front-end invokes distiller thread on appropriate device brick • if data is cached, invoke on disk node • otherwise, fetch data from web and invoke on compute node or disk node
Roadmap • Layered software structure • Example of introspection • Runtime system extensibility using DSLs • Conclusion
From introspection to adaptation • Example: slowly-failing data disk in large DB system (1) Detect problem (2) Repair problem while continuing to handle incoming requests (3) Return to normal system operation Intelligent HWcomponents Adaptive, self-maintaining appliance + Continuous monitoring Extensible, application-tailored runtime system
Failing disk: detection • Microkernel monitoring module continuously monitoring disk’s health detects exceptional condition, e.g. • ECC failures • Media errors • Increased rates of ECC retries • Notifies global fault handling mechanism
Failing disk: reaction • Global fault handling mechanism… • Prevents system from sending more work to failed device • Modifies global directory to remove entries corresponding to failed component’s data • Application-specific response to impending failure • Transactional system: discard work currently in progress on failing device, reissue to another data replica • Non-transactional system w/o coherent replicas: checkpoint computation, restore on another data replica • Transformational web proxy: do nothing • Instruct disk runtime system to shut disk down • Disk device is considered failed
Failing disk: return to normal operation • Global fault handling mechanism... • Rebuilds data redundancy • By allocating space for a new replica on a functioning disk and copying data to it from existing replicas • Using an application-specific data replication mechanism • Where to allocate new replicas, how to copy data, how to lay out data for new replicas, how to update global directory • Example in upcoming slide • Life returns to normal • Degree of fault-tolerance has been restored • Failed component can be replaced during regularly-scheduled maintenance
Roadmap • Layered software structure • Example of introspection • Runtime system extensibility using DSLs • Conclusion
Runtime system extensibility • Two ways of looking at system • Partitioned on functional/mechanism boundaries • Collection of libraries: failure detection, transactions, ... • Mechanisms are isolated libfail librepl application OS libtrxn libcache
Runtime system extensibility • Two ways of looking at system • Partitioned on functional/mechanism boundaries • Collection of libraries: failure detection, transactions, ... • Mechanisms are isolated • Partitioned on global system properties • This is how the programmer thinks about the system (high-level) • e.g. application-specific data availability policy • Failure detection (which devices to monitor, …) • Replication (used to restore redundancy) • Transactions (how to restart work in progress) • Caching (how to handle dirty cached objects during failure) libfail librepl application OS libtrxn libcache
Runtime system extensibility • Two ways of looking at system • Partitioned on functional/mechanism boundaries • Collection of libraries: failure detection, transactions, ... • Mechanisms are isolated • Partitioned on global system properties • This is how the programmer thinks about the system (high-level) • e.g. application-specific data availability policy • Failure detection (which devices to monitor, …) • Replication (used to restore redundancy) • Transactions (how to restart work in progress) • Caching (how to handle dirty cached objects during failure) libfail librepl Customized runtimesystem library application OS libtrxn libcache compiler policy
Extensibility using DSLs • DSLs are languages specialized for a particular task • Each ISTORE DSL • Encapsulates high-level semantics of one system behavior • Allows declarative specification of • Behavior of one aspect of the system (a “policy”) • Interfaces to coordinated mechanisms that implement the policy • Is compiled into an implementation that might coordinate several local and/or global base runtime system mechanisms • May be implemented as background and/or foreground tasks • Analysis tools can potentially infer unspecified emergent system behaviors from the specifications • e.g. what impact will a new redundancy policy have on transaction commit time • Extensions compiled together with local and global base mechanisms form the distributed runtime system
Extensibility using DSLs: Example Avail::FailureDetected(Device d) {Object o; ObjList objs;Transaction t; TxnList txns;Replica x, c, r;Directory::MarkDeviceDisabled(d);Admin::AlertFailure(d);objs = Directory::GetObjects(d); objs stored on failed deviceforeach o (objs) { x = Directory::GetReplica(o,d) find o’s replica on d Directory::DeleteReplica(x); delete from global directory txns = Txn::GetActiveTxns(x); foreach t (txns) { Txn::AbortTxn(t); abort pending txns for o on d } c = Directory::GetReplica(o); find still-accessible copy r = LoadBalancer::AllocateReplica(o); get space for new repl LocalRuntime::CopyObject(c,c->device,r,r->device); copy it Directory::AddReplica(r,r->device); update directory foreach t (txns) { Txn::IssueTxn(txn,r); reissue txns on new replica } } }
Extensibility using DSLs (cont.) • Similar specification written for each extension to base library • Other examples of extensible system behaviors • Transaction response time requirements • Prioritizing operations based on type of data processed • Resource allocation • Backup policy • Exported administrative interface
Why use DSLs? • Possible choices • Each appliance designer writes runtime system from scratch • Similar to exokernel operating systems • All designers use single parameterized runtime system library • Similar to tunable kernel parameters in modern OSs • Designer writes high-level specification of system behavior • DSL compiler automatically translates specification into runtime system extensions that coordinate base mechanisms • Advantages include • Programmability • Performance • Reliability, verifiability, safety • Artificial diversity
DSL advantages (cont.) • Programmability • High-level specification close to designer’s abstraction level • Easier to write, reason about, maintain, modify runtime system code • Simple enough to allow site-specific customization at installation time • Performance • Aggressive DSL compiler can take advantage of high-level semantics of specification language • Base library mechanisms can be highly optimized; optimization complexity hidden from appliance designer • Web example: infer that TCP checksums should be stored with web pages
DSL advantages (cont.) • Reliability • Automatically generate code that’s easy to forget or get wrong • Example: synchronization operations to serialize accesses to distributed data structure • Verifiability • Of input code (DSL specification) • More abstract form of semantic checking • e.g. DSL supports types natural to behavior being specified => type-checking verifies some semantic constraints • e.g. “ensure no unencrypted objects are written to disk” • Of output code (coordinated use of base mechanisms) • DSL compiler writer satisfied DSL compiler is correct => appliance designer inherits verification effort • Safety (prevent runtime errors) • Whole classes of general programming errors not possible • DSLs hide details: runtime memory management, IPC, ... • Compiler automatically adds code: synchronization, ...
Specification DSL advantages (cont.) DSL compiler • Artificial diversity • Potentially allow system to continue operation in face of internal bugs or malicious attack • Multiple implementations of component run simultaneously on different data replicas • Continuously check each other with respect to high-level behavior • Non-traditional fault-tolerance, but related to process pairs • Potentially usable to enhance performance • Select best-performing implementation(s) for future use; periodically reevaluate choice • Examples of possible implementation differences • Low-level: runtime memory layout, code ordering and layout • High-level: system resource usage (recompute vs. use stored data, general space/time/bandwidth tradeoffs) Implementation 1 Implementation 2 Implementation 3 ...
ISTORE software summary • ISTORE software architecture provides an extensible runtime environment for distributed network service application code • Layered local and global mechanism libraries provide introspection and self-maintenance • Mechanisms can be customized using DSL-based specifications of application policy • DSL code coordinates base mechanisms to implement application semantics and interfaces • DSL-based extension offers significant advantages in programmability, performance, reliability, safety, diversity
ISTORE summary • Network services are increasing in importance • Self-maintaining scaleable storage appliances match the needs of these services • ISTORE provides a flexible architecture for implementing storage-based network service apps • Modular, intelligent, fault-tolerant hardware platform is easy to configure, scale, and administer • Runtime system allows applications to leverage intelligent hardware, achieve introspection, and provide self-maintenance through • Layered runtime software structure • DSL-based extensibility that allows easy application-specific customization
Agenda • Overview of ISTORE: Motivation and Architecture • Hardware Details and Prototype Plans • Software Architecture • Discussion and Feedback
What ISTORE is not • An extensible operating system • Use commodity OS, only add hardware monitoring module • MM could just be a device driver => no need for microkernel OS • ISTORE could be built on top of an extensible operating system for even greater flexibility • An attempt to make commodity OS’s extensible • Extensible runtime system allows designer to customize higher-level operations than OS extensions do • Closest to an extensible distributed operating system built on top of a commodity single-node operating system • A multiple-protection-domain system • Assumes non-malicious programmer • If user-downloaded code permitted, sandbox must be implemented as part of (trusted) application • DSLs specify resource allocation/scheduling policies, appliance designer responsible for ensuring fairness • A framework for building generic servers
ISTORE boot process (1) Initially, undifferentiated ISTORE system (2) On boot, each device block contacts system boot server (3) Device blocks download customized runtime system and application worker code • Front-end blocks also download application front-end code • Runtime system libraries structured as shared libraries => hot upgrade
Example Appliances • E-commerce • Web search engine • Transformational web/PDA proxy • Election server • Mail server • News server • NFS server • Database server: OLTP, DSS, mixed OLTP-DSS • Video server