350 likes | 470 Views
How In-Memory Affects Database Design. Louis Davidson Certified Nerd. Who am I?. Been in IT for over 19 years Microsoft MVP For 10 Years Corporate Data Architect Written five books on database design
E N D
How In-Memory Affects Database Design Louis Davidson Certified Nerd
Who am I? • Been in IT for over 19 years • Microsoft MVP For 10 Years • Corporate Data Architect • Written five books on database design • Ok, so they were all versions of the same book. They at least had slightly different titles each time • The book doesn’t yet cover this topic
Attention: There Is Homework(lots of it) • I can’t teach you everything about In-Memory in 1 hour • The code will be available, but it is still very rudimentary • It will get you started, but is only just the tip of the iceberg • Do lots of thinkin’ and testin’ before divin’ in
Introduction: What exactly is In-Memory OLTP in SQL Server 2014? • A totally new, revamped engine for data storage, co-located in the same database with the existing engine • Obviously Enterprise Only… • Purpose built for certain scenarios • Terminology can be confusing • Existing tables: Home - On-Disk, but ideally cached In-Memory • In-Memory tables: Home - In-Memory: but backed up by On-Disk Structures • If you have enough RAM, On-Disk tables are also in memory • But the implementation is very very different • In-Memory is both very easy, and very difficult to use
Design Basics (And no, I am not stalling for time due to lack of material) • Designing and Coding is Like the Chicken and the Egg • Design is what you do before coding • Coding patterns can greatly affect design • Engine implementation can greatly affect design and coding patterns • We will discuss how In-Memory technologies affect the entire development lifecycle I was first As if… Children Relics
Design Basics - Separate your design mind into three phases • Logical (Overall data requirements in a data model format) • Physical ImplementationChoice(Indexes, Physical Structures, etc) • Physical (Relational Code) • Before the engine choice I always suggested 3 before 2 • We will look at each of these phases and how in-mem may affect your design
Logical Design (Though Not Everyone’s Is) • This is the easiest part of the presentation • You still need to model • Entities and Attributes • Uniqueness Conditions • General Predicates • As I see it, nothing changes…
Physical Implementation (Or DBA stuff that I only slightly care about) • Everything is different, and I am not here to cover these details… • In-Mem data structures coexist in the database alongside On-Disk ones • Data is housed in RAM, and backed up in Delta Files and Transaction Logs • Delta files are stored as filestream storage • The transaction log is the same one as you are used to • Tables and Indexes are extremely coupled • MVCC (Multi-Valued Concurrency Control) used for all isolation
Physical Implementation Overview Client App TDS Handler and Session Management SQL Server.exe Key No improvements in communication stack, parameter passing, result set generation Existing SQL Component Proc/Plan cache for ad-hoc T-SQL and SPs Parser, Catalog, Algebrizer, Optimizer Hekaton Compiler Hekaton Component Natively Compiled SPs and Schema Interpreter for TSQL, query plans, expressions 10-30x more efficient Generated .dll Access Methods Hekaton Engine for Memory_optimized Tables & Indexes Buffer Pool for Tables & Indexes Reduced log bandwidth & contention. Log latency remains Query Interop Memory-optimized Table Filegroup Transaction Log Data Filegroup Checkpoints are background sequential IO
Physical Design (No, let’s not get physical) • Your physical design will almost certainly need to be affected • So much changes, even just changing the table structure • In this section, we will discuss: • Creating storage objects • Table creation • Index creation (which is technically part of the table creation) • Altering a Table’s Structure • Accessing (Modifying/Creating) data • No Locks, No Latches, No Waiting • Using Normal T-SQL (Interop) • Using Compiled Code (Native)
Creating Storage Objects - Tables • The syntax is the same as on-disk, with a few additional settings • You have a durability choices • In-Mem Table: Schema_Only or Schema_and_Data • Database level for transactions: Delayed (also for on-disk tables) • Aaron Bertrand has a great article on this here: http://sqlperformance.com/2014/04/io-subsystem/delayed-durability-in-sql-server-2014 • You also have less to work with... • Rowsize limited to 8060 bytes (Enforced at Create Time) • Not all datatypes allowed (LOB types,CLR,sql_variant, datetimeoffset, rowversion) • No foreign keys • No check constraints • Limited unique constraints (just one unique index per table) • Every durable (Schema_and_Data) table must have a primary key • Note: There are memory optimized temporary tables too: See Kendra Little’s article here: http://www.brentozar.com/archive/2014/04/table-variables-good-temp-tables-sql-2014/
Dealing with Un-Supported Datatypes… • Say you have a table with 10 columns, but 1 is not allowed in a In-Memory table • First: Ask yourself if the table really fits the criteria we aren’t done covering • Second: If so, consider vertically partitioning • CREATE TABLE In_Mem (KeyValue, Column1, Column2, Column3)CREATE TABLE On_Disk (KeyValue, Column4) • It is likely that uses of disallowed types wouldn’t be good for the OLTP aspects of the table in any case.
Creating Storage Objects - Index creation • Indexes are linked directly to the table • 8 indexes max per table due to internals • Only one unique index allowed • Indexes are never persisted, but are rebuilt on restart • Two Index Types • Hash • Ideal for single row lookups • Fixed size, you choose the number of hash buckets (approx 1-2 * # of unique values http://msdn.microsoft.com/en-us/library/dn494956.aspx) • Bw Tree • Best for range searches • Very similar to a BTree index as you (hopefully) know it, but optimized for MVCC and pointer connection to table • String index columns must be a binary collation (case AND access sensitive) • For more in-depth coverage • check Kalen Delaney's white paper ... http://t.co/T6zToWc6y6 • Or for an even deeper (nerdier?) versions: “Hekaton: SQL Server’s Memory-Optimized OLTP Engine” http://research.microsoft.com/apps/pubs/default.aspx?id=193594 or The Bw-Tree: A B-tree for New Hardware Platforms (http://research.microsoft.com/pubs/178758/bw-tree-icde2013-final.pdf)
A Taste of the Physical Structures • A table with two hash indexes • From Kalen’s Whitepaper: http://t.co/T6zToWc6y6
Creating Storage Objects - Altering a Table • The is the second easiest slide in the deck • No alterations allowed - Strictly Drop and Recreate • You can rename a table, which makes this at east easier ALTER
Accessing the Data - No Locks, No Latches, No Waiting • On-Disk Structures use Latches and Locks to implement isolation • In-Mem Use Optimistic-MVCC • You have 3 Isolation Levels: • SNAPSHOT, REPEATABLE READ, SERIALIZABLE • Evaluated before, or when the transaction is committed • This makes data integrity checking "interesting" • Essential difference, your code now must handle errors
Concurrency is the #1 difference you will deal with • Scenario1: 2 Connections - Update Every Row In 1 Million Rows • Any Isolation Level • On-Disk • Either: 1 connection blocks the other • Or: Deadlock • In-Mem • One connection will fail, saying: “the row you are trying to update has been updated since this transaction started” EVEN if it never commits.
Another slide on Concurrency(Because if I had presented it concurrently with the other one, you wouldn’t have liked that) • Scenario2: 1 Connection Updates All Rows, Another Reads All Rows (In an explicit transaction) • On-Disk • Either: 1 connection blocks the other • Or: Deadlock • In-Mem • Both Queries Execute Immediately • In SNAPSHOT ISOLATION the reader will always succeed • In REPEATABLE READ or SERIALIZABLE • Commits transaction BEFORE updater commits: Success • Commits transaction AFTER updater commits: Fails
Accessing the Data - Using Normal T-SQL (Interop) • Using typical interpreted T-SQL • Most T-SQL will work with no change (you may need to add isolation level hints) • A few Exceptions • TRUNCATE TABLE - This one is really annoying :) • MERGE (In-Mem table cannot be the target) • Cross Database Transactions (other than tempdb) • Locking Hints
Accessing the Data using Compiled Code (Native) • Instead of being interpreted, the stored procedure is compiled to machine code • Limited synax (Like programming with both hands tied behind your back) • Allowed syntax is listed in what is available, not what isn't • http://msdn.microsoft.com/en-us/library/dn452279.aspx • Some really extremely annoying ones: • SUBSTRING supported; LEFT, RIGHT, not so much • No Subqueries • OR, NOT, IN, not supported in WHERE clause • So you may have to write some "interesting" code
The Difficulty of Data Integrity • With on-disk structures, we used constraints for most issues (Uniqueness, Foreign Key, Simple Predicates) • With in-memory code, we have to implement in stored procedure • Uniqueness on > 1 column set suffers from timing (If N connections are inserting the same data...MVCC will let them) • Foreign Key can't reliably be done because: • In Snapshot Isolation Level, the row may have been deleted while you check • In Higher Levels, the transaction will fail if the row has been updated • Check constraint style work can be done in stored procedures for the most part.
Problem: How to Implement Uniqueness on > 1 Column Set: INDEXED VIEW? • CREATE VIEW Customers.Customers$UniquenessEnforcementWITH SCHEMABINDINGASSELECT customerId, emailAddress, customerNumberFROM customers.CustomerGO • CREATE UNIQUE CLUSTERED INDEX emailAddress ON Customers.Customers$UniquenessEnforcement (emailAddress)GO • Msg 10794, Level 16, State 12, Line 8The operation 'CREATE INDEX' is not supported with memory optimized tables.
Problem: How to Implement Uniqueness on > 1 Column Set: Multiple Tables? • Wow, that seems messy… And what about duplicate customerId values in the two subordinate tables?
Problem: How to Implement Uniqueness on > 1 Column Set: Simple code • You can’t…exactly. But what if EVERY caller has to go through the following block: • DECLARE @CustomerIdINT SELECT @CustomerId = CustomerIdFROM Customers.CustomerWHERE EmailAddress = @EmailAddressIF @customerId is null… Do your insert • This will stop MOST duplication, but not all. Two inserters can check at the same time, and with no blocks, app locks, or constraints even available, you may get duplicates. • Remember the term: Optimistic Concurrency Control
When Should You Make Tables In-Memory - Microsoft's Advice • From http://msdn.microsoft.com/en-us/library/dn133186.aspx
When Should You Make Tables In-Memory Louis's Advice • More or less the same as Microsoft's really (duh!) • Things to factor in • High concurrency needs/Low chance of collisions • Minimal uniqueness protection requirements • Minimal data integrity concerns (minimal key update/deletes) • Limited searching of data (binary comparisons only) • Limited need for transaction isolation/Short transactions • Basically, the “hot” tables in a strict OLTP workloads...
The Choices I made Louis has improved his methods for estimating performace, but your mileage will still vary. Louis’ tests are designed to reflect only one certain usage conditions and user behavior, but several factors may affect your mileage significantly: How & Where You Put Your Logs Computer Condition & Maintenance CPU Variations Programmer Coding Variations Hard Disk Break In Therefore, Louis’ performance ratings are a minimally useful tool for comparing the performance of different strategies but may not accurately predict the average performance youwill get. I seriously suggest you test the heck out of the technologies yourself using my code, your code, and anyone else’s code you can to make sure you are getting the best performance possible.
The Grand Illusion (So you think your life is complete confusion) • Performance gains are not exactly what you may expect, even when they are massive • In my examples (which you have seen), I discovered when loading 20000 rows (10 connections of 2000 each) • (Captured using Adam Machanic'shttp://www.datamanipulation.net/SQLQueryStress/ tool) • On-Disk Tables with FK, Instead Of Trigger - 0.0641 seconds per row - Total Time – 3:55 • On-Disk Tables withOUT FK, Instead Of Trigger - 0.0131 seconds per row - Total Time – 2:44 • In-Mem Tables using Interop code - 0.0091 seconds per row - Total Time 2:31 • In-Mem Tables with Native Code - 0.0035 second per row - Total Time – 1:23 • In-Mem Tables, Native Code, SCHEMA_ONLY – 0.0003 seconds per row - Total Time – 1:00 • But should it be a lot better? Don't forget the overhead... (And SQLQueryStress has extra for gathering stats)
Demo Code Review As We have time