金　泰勇九州大学大学院システム情報研究院知能システム専攻

Inter-Transactional Parallelism for Persistent Distributed Shared Virtual Memory- Implementation and Performance - 金　泰勇九州大学大学院システム情報研究院知能システム専攻

Outline • Introduction • Overviewof WAKASHI • Network Of Workstations • Persistent Distributed Shared Virtual Memory • GeneralizedDistributedLockProtocol • Algorithm • Related Work • Evaluation (M-OO7 benchmark) • Cost-basedDistributedTransaction Coordinator • Architecture and Algorithm • Related Work • Evaluation ( TPC-C benchmark) • Conclusion and Future Work

Introduction (1) • ShusseUo - An Object Database Management Group (ODMG) compliantObject Database System • OQL Compiler • ODL Pre-Processor WARASA • ODMG Object Model • Persistent Object Manipulation Language • (C++ Binding） INADA • PersistentDistributedSharedVirtual Memory WAKASHI • Transaction Management Operating System

Introduction (2) • Network Of Workstations(NOW) NOW Local Area Network CPU CPU … CPU Disk Disk Disk Workstation Workstation Workstation

Introduction (3) • Characteristics of NOW [Berkeley NOW group] • Better performance for sequential application than an individual workstation • Most of the sequential applications can be divided into several independent parts • These parts are able to be executed in parallel at NOW. • Better price/performance for parallel applications than Massively Parallel Processors (MPP) We utilized NOW as the hardware environment of Database server

T1a T1b T1 T1c T2a T2 T2b Introduction (4) • Two Transactional Parallelisms Inter-Transactional Parallelism NOW Intra-Transactional Parallelism

Introduction (5) • Distributed Shared Virtual Memory (DSVM) [Kai Li 1989, Princeton University] DSVM Space … Memory Memory Memory • Hardware Level DSVM • [DASH, KSRI] • Software Level DSVM • [MUNIN, TreadMark, WAKASHI] Disk Disk Disk Workstation Workstation Workstation

Introduction (6) • Persistent Distributed Shared Virtual Memory • Transaction integrated DSVM • All of DSVM access are included into the transactions • Data in DSVM space is maintained persistent • The problems which PDSVM has to face • Utilize the resource efficiently • Decrease the cost of the communication exchanging among different sites Two main factors to evaluate the communication cost. • Message Size • Number of the messages Cost( n KB message) < n ×Cost( 1 KB message )

Introduction (7) • PDSVM implemented at WAKASHI DSVM Mapping DSVM Mapping DSVM Mapping DSVM Mapping Disk Mapping Disk Mapping Primary Site Mirror Site Primary Site Mirror Site Mirror Site Mirror Site

Write(p) Read(p) Swap p out to disk Swap p into memory Swap p into memory Primary Read Primary Write Introduction (8) • PDSVM Data Access patterns and cost at Primary Site P

Read(p) Write(p) remote_page_lock page_transfer remote_page_lock page_transfer Swap p out to disk Swap p out to disk Swap p out to disk Swap p into memory Primary Site Mirror Site Introduction (8) • PDSVM Data Access patterns at Mirror Site

Generalized Distributed Lock (GDL) Protocol

Generalized Distributed Lock Protocol of WAKASHI • Lock Release transaction begin transaction end w(p) Primary site Mirror site transaction begin transaction end transaction begin transaction end r(p) r(p) Message Type: • Remote Page Lock • Page Transfer • All Page Lock Release • Remote Page Lock Forward

Generalized Distributed Lock Protocol of WAKASHI • Lock Retain transaction begin transaction end w(p) Primary site Mirror site transaction begin transaction end transaction begin transaction end r(p) r(p)

Generalized Distributed Lock Protocol of WAKASHI • Retain Mode ( Commit Mode and Abort Mode) Lock Release Read Lock Retain Mode Read Lock Retain Write Lock Write Lock Retain Transaction Commit or Transaction Abort Retain Mode: LRL_LRL, RLRT_LRL, WLRT_LRL LRL_RLRT, RLRT_RLRT, WLRT_RLRT

Generalized Distributed Lock Protocol of WAKASHI • Attach retain mode to transaction Transaction_Begin( <h1, mode_1>, <h2, mode_2>…); … … READ(h1, p1); … READ(h2, P2); … WRITE(h3, P3); Transaction_Commit(); • Retain modes are decided when a transaction begins • Using <HID, RETAIN_MODE > to attach retain mode to a heap • Attached retain modes are just valid on the pages accessed during the transaction

Generalized Distributed Lock Protocol of WAKASHI • Related Work • Lazy Release Consistency (LRC) Protocol [Rice University, 1992] • Locks are managed by a lock manager • At the client programs, locks are processed by two kinds of primitive: Acquire and Release • Lock are not released immediately when locks are released at client program • From their measurement result, LRC performs better than common release consistency protocol at some applications • In LRC, the lockprimitives are explicitly set by client programmer • LRC is designed for the distributed parallel computing applications

Generalized Distributed Lock Protocol of WAKASHI • Related Work • Cache Consistency Protocol at Client-Server Database Architecture • Caching 2 Phase Lock (C2PL) Approach [Franklin 1992] • The lock should be granted when a cache is to be accessed • The lock should be released when a transaction ends • Callback (CB) Approach [Wisconsin Univ. 1992, 1997] All of the caches remain valid until a callback message comes • Callback Read: When a cache is to be updated, the callbacks are sent to other clients where the caches are READ • Callback All When a cache is to be accessed, the callbacks are sent to other clients which are holding the caches that are conflict with the access • Difference with GDL • Architectures are different • GDL supports more lock processing modes

… Private Module Private Module Private Module User User User Module Assembly (7Level) … Shared Module … Mega Module … … Sub Module Sub Module Sub Module Composite Parts Atomic Parts Evaluation of GDL • Multi-User OO7

Evaluation of GDL • TransactionType • Read Only: Traverse module without any update • Update: Traverse with update each atomic parts • Operation Configuration Vector (OCV) • <Pr, Pw, Sr, Sw> • Pr/Pw is the probability of read/write operations occurring at private modules • Sr/Sw is the probability of read/write operations occurring at shared modules • OCV Types • Read Only <50, 0, 50, 0> • 10% Update <45, 5, 45, 5> • 50% Update <25, 25, 25, 25>

Evaluation of GDL • Testbed Ethernet-100M bit Ultra5 Ultra5 Ultra5 Ultra5 Ultra5 Ultra5 Ultra5 Super-Sparc (400Mhz) Disk IBM DJNA (22G) Main Memory 128M

Evaluation of GDL • Non-Clustering Plan Ultra5 Ultra5 Ultra5 Ultra5 Ultra5 Ultra5 Database • All of modules are located in 1 heap • The heap is located at 1 site

Evaluation Result • Read Only

Evaluation Result • 10%Update

Ultra5 Evaluation of GDL • Clustering Plan Ultra5 Ultra5 Ultra5 Ultra5 Ultra5 Ultra5 Shared Private Private Private Private Private Private Module Module Module Module Module Module Module • Each Module are located in 2 heaps (ReadOnly, Update) • Private Modules are distributed in all of the sites • Shared Module is allocated at 1 site.

Evaluation Result • Read Only

Evaluation Result • Number of the messages at Read Only

Evaluation Result • Number of the messages at 10% Update

Evaluation Result • Number of the remote page lock messages at 10% Update

Evaluation Result • Number of the remote page lock forward messages at 10% Update

Evaluation Result • Number of the messages at 50% Update

Evaluation Result • Number of the remote page lock messages at 50% Update

Evaluation Result • Number of the remote page lock forward messages at 50% Update

Cost-based Distributed Transaction Coordinator

Transactions Cost-based Distributed Transaction Coordinator • Transaction Coordinator • Utilize all of the workstations efficiently • Execute the transactions with lower cost • Transaction Cost is decided by • The type of transaction • The site where the transaction runs Transaction Coordinator Workstation Workstation Workstation Workstation

T1 T2 T3 T4 Architecture Functionalities of Execute Element Transaction Pool • Execute the coordinated transactions Transaction Scheduler Database Distribution Manager Load Information Manager Cost-based Transaction Coordinator • Collect the load information of the executed transactions Execute Element Manager Adapter Dispatcher • Feedback the load information of the executed transactions to transaction coordinator Dispatcher Dispatcher Dispatcher Dispatcher Execute Element Dispatcher Execute Element Execute Element • Transaction Placement Policy: • Decide how to coordinate the transaction when it is submitted to TC • Transaction Scheduling Policy: • Decide which blocked transaction in Transaction Pool is executed • at which site when a transaction is finished at an Execute Element

Cost-based Approaches • Cost-based Approach - 1 • Static approach(CTC-Static) • Static Coordinator Description File(SCDF) T  S • T is the type ID of transaction • S is the ip address of the host where the transaction t is executed. • Transaction Placement Policy • Select an idle EE for the submitted transaction according to SCDF • Transaction Scheduling Policy • Select the next transaction which is blocked also according to SCDF

Cost-based Approaches • Cost-based Approach - 2 • Transaction Priority Oriented approach(CTC-TPOA) • Transaction Placement Policy Look through all of the EEs to find an idle EE to execute the transaction submitted • Transaction Scheduling Policy Look through all of the EEs to find an idle EE to execute the blocked transaction whose arrival time is the earliest.

Cost-based Approaches • Cost-based Approaches -3 • Low-Cost Oriented approach(CTC-LCOA) • Priority Value (PV) PV(t, s) = Cost(t,s) – Preemption Factor(t) • Cost(t,s)is the cost for executing t at host s. • If a transaction whose arrival time is later than that Of t is coordinated in prior to t, the Preemption Factor of t is increased by k • Transaction Placement Policy It is the same with CTC-TPOA • Transaction Scheduling Policy For the site where the EE has finished a transaction, look through all of the blocked transaction and coordinate the transaction whose PV is the lowest to the site Distribution of Active EEs is fixed in CTC-LCOA

Related Work • Degree Multi-Programming (DMP) based Algorithm [ObjectStore, 1991] • Limit the Multi-Programming Level (number of the concurrent transactions) • Feedback based Algorithm • Throughput Feedback based Algorithm [VLDB, 1991] resource contention aided algorithm • Conflict Ratio Feedback based Adaptive Transaction Scheduling Algorithm [VLDB, 1992] data contention aided algorithm • Resource Contention • The current available resource does not satisfy the required resource • Data Contention • The excessive lock conflicts degrades the performance significantly

Customers Districts ３，０００１＋ History １＋１０ Order-Line Order Warehouse ５－１５ Stocks １００，００００－１ New-Order １００，０００ Items Evaluation (1) • TPC-C benchmark model • TPC-C is an Online Transaction Processing Benchmark • Database Scheme

Evaluation (2) • Transaction Type • New-Order（ｎ/a） • Payment（４３%） • Order-Status（４%） • Delivery（４%） • Stock‐Level（４%） The measured throughput of New-Order (MQTH) is reported as performance result.

Evaluation (3) • Testbed Coordinator Site 16 ×Execute Element Site Ethernet-100M bit Ultra5 Ultra5 Ultra5 …… Ultra5 Ultra5 Super-Sparc (400Mhz) Disk IBM DJNA (22G) Main Memory 128M

Evaluation (4) • MQTH Result

Evaluation (5) • Rate of Primary Accessed Pages

Evaluation (6) • Distribution of Active EEs (MPL=32)

T1,T2,T2,T2,T2,T3 Evaluation (7) • Why the distribution of Active Execute Elements in CTC-Static is unbalanced? T1→S1 T2→S2 T3→S3 Transaction Coordinator SCDF EE EE EE EE Execute Element EE EE EE EE EE EE Execute Element Execute Element Execute Element Execute Element EE Execute Element MPL＝３ S1 S2 S3

金 泰勇 九州大学大学院システム情報研究院 知能システム専攻