160 likes | 302 Views
Automatic Data Partitioning in Software Transactional Memories. Torvald Riegel, Christof Fetzer, Pascal Felber (TU Dresden, Germany / Uni Neuchatel, Switzerland). No one-size-fits-all TM!. STMs: Design: Invisible vs. visible reads Object-based vs. word-based Parameters:
E N D
Automatic Data Partitioning in Software Transactional Memories Torvald Riegel, Christof Fetzer, Pascal Felber (TU Dresden, Germany / Uni Neuchatel, Switzerland)
No one-size-fits-all TM! • STMs: • Design: • Invisible vs. visible reads • Object-based vs. word-based • Parameters: • Lock-based: #locks, addresslock mapping • HTMs: • Different interfaces (e.g., Rock vs. AMD’s ASF) • Resource bounds • Heterogeneous workloads: Global tuning does not help Divide and conquer !?
How to divide • User-driven? hmm, rather not … • Temporally • Runtime tuning can handle phases • … But only if whole workload has same phases • Memory • “Word-based”: Mapping function is difficult • Runtime overheads • Mapping needs to be stable • Memory allocator affects mapping heavily (see false conflicts) • “Object-based”: still need mapping or per-object data • Code • Problem: same function might operate on different data
How to conquer? • Tune concurrency control mechanisms • Use different STM implementations • Use HTM only where applicable/necessary • Tune TM parameters per partition • Challenge: Threads must agree on which mechanisms to use for each item/location! • Two-phase commit or similar is necessary when using several independent TM mechanisms • Improve mapping/partitioning at other levels • E.g., locationlock mapping
Data Partitioning • Partition memory automatically • We use Pool Allocation (Lattner et al, PLDI 05) • Mixed compile-time/runtime technique: • Based on pointer analysis for C/C++ • Nodes in points-to graph become partitions • Partitions are instantiated dynamically at runtime and supplied to called functions that use these partitions • Memory allocator is not affected • Implementation extends Tanger (STM compiler) • STM load/store functions get pointer to partition
Example: Points-to graph for STAMP’s Vacation Type, if known struct has 4 fields, 2 are pointers A second Red-Black Tree instance A Red-Black Tree instance Partial,simplified DS graph for main()
Conquering … • Partition types determine STM implementation used per partition (TinySTM): • Multiple Locks (general purpose) • Single Shared Lock (infrequently updated partitions) • Single Exclusive Lock (low concurrency partitions) • Read-Only (no concurrency control necessary) • Thread-local, transaction-local • Loads/stores dispatched to type-specific STM functions on each call • Partition types and parameters can be tuned • E.g., read-only partitions get tuned on first write
Performance Partitioningdecreases falseconflicts in lockarray. Lock hashfunction gets a2nd levelat compile time. Exclusive Lock is faster than general purpose STM Partitioning addsruntime overhead TinySTM w/o partitioningsupport, 220 / 224 locks TinySTM with partitioning, 4 different tuning heuristics
Performance (2) Read-Only partitions during first phase of benchmark 226 locks ! (224 livelocks due tofalse conflicts) 5 x 256K locks
Challenges • Analysis: Calls to libraries? • Points-to graphs can probably be attached to libs (local per-function analysis + callgraph) • Analysis is bottom-up on call-graph • TM implementations that don’t support two-phase commit • Dispatch: Runtime overheads • JIT? • Size of binaries • Tuning partitions and partitioning • No direct feedback, partitioning results in even more parameters to be tuned • Partition selection / merging at compile-time/runtime
Questions? Tanger + TinySTM + …:http://tinystm.org(send email for version with partitioning support)
Partition Type Performance & Tuning Strategies • Tuning strategy: • Start with read-only type • On reaching a certain number of aborts, switch to: • Single Exclusive Lock • Single Shared Lock • Multiple Locks • Part-1: switch directly to Multiple Locks, Part-4: try other types first (single locks, fewer multiple locks)
Analysis • We use Data Structure Analysis (DSA [1]): • Pointer analysis for LLVM compiler framework • Creates a points-to graph with Data Structure (DS) nodes • Context-sensitive: • Data structures distinguished based on call graphs • Field-sensitive: • distinguish between DS fields • Unification-based: • Pointers target a single node in the points-to graph • Information about pointers from different places get merged • If incompatible information, node is collapsed (= “nothing known”) • Can safely analyze incomplete programs: • Calls to external / not analyzed functions have an effect only on the data that escapes into / from these functions (get marked “External”) • Analyzing more code increases analysis precision [1] Chris Lattner, PhD thesis, 2005
Analysis (2) Integration into Tanger compilation process: • Compile and link program parts into LLVM intermediate representation module • Analyze module using DSA • Local intra-function analysis: per-function DS graph • Merge DS graphs bottom-up in callgraph (put callees’ information into callers) • Merge DS graphs top-down in callgraph (vice versa) • Transactify module • Use DSA information to decide between object-based / word-based • Requirement: If memory chunk (DS node) is object-based, then it must be safe for object-based everywhere in the program • DSA can give us this guarantee • Link in STM library and generate native code