200 likes | 322 Views
Databases vs. Programmers. Dmitri Gavrilov , Ph.D. Principal Engineering Manager, Exchange, Microsoft. Why do I need to know this?. Would you ever need to code a B-tree? If you do, you are wasting your employer’s money
E N D
Databases vs. Programmers Dmitri Gavrilov, Ph.D. Principal Engineering Manager, Exchange, Microsoft
Why do I need to know this? • Would you ever need to code a B-tree? • If you do, you are wasting your employer’s money • Unless you work for Oracle, IBM (DB2), SAP (Sybase) or Microsoft (SQL Server) • Why is it then important to know how B-trees work? • I’ll try to show you.
You are Hired! • You’ve been just hired as a database developer by the UofC IT department (ACS? UCS?) • Your first project is fixing the database of students. • What’s wrong with the current implementation? • It’s too slow!
What have you got? • Self-service website for students to enter their personal data: • Name, email, phone number • ID, year, department • Admin website: • Create/edit/delete students • Bulk upload data • Generate reports • The backend DB is MS Access 2007
What happened? • Over the summer, a local IT shop named QuickySoft was hired to develop the site. • The project went well, stayed almost on schedule, tests went fine, everybody was happy. • On Sept 6, the site melted down. • It took over a minute for a student to log on to the site • When students attempted to update their data, the page timed out • The admin could not upload new data (timeouts) • Project manager got fired, and you got hired.
What do you do? • Well, that’s clear. The database was crappy. • You replace MSAccess with a real database • Say, MS SQL Server. • Your first task? Database schema design!
DB schema design • That’s what DBAs get paid huge money for • Table: normalized dataset • ID, name, email, phone, department, year • Primary key: ordering of the dataset • ID seems to be a good primary key • We’ll discuss how to select primary key later • Index: optimizes searching, sorting • Name, email, more? • Tables and indexes are B-trees
What makes a real database slow? • Lack of required index • QuickySoft gurus forgot to add an index on name. • Every lookup by name ended up scanning the whole table of students. • That was too easy… But DB is still slow. • IOPS is the king • Disk is the slowest computer component • Query speed is proportional to the number of IO operations it involves • Caching DB pages in memory helps, but not for large DBs
Improving IOPS • Avoid going to disk, use DB cache. Buy more RAM. • Too easy. Try harder. • Reduce the number of times we have to move the HD head • Reduce the distance by which HD head is moved • Improve locality of reference!
Locality of Reference • QuickySoft guys wanted to make DB more compact (for better caching!) • There’s a lot of repetition among email suffixes, to they decided to normalize them out. • Email suffixes are stored in a separate table: gmail.com, ucalgary.ca, yahoo.com, hotmail.com, etc… • For each user, store just the username part (marina), plus ID of the suffix (25 cpsc.ucalgary.ca) • Now, each READ operation needs to get data from two tables (students + suffixes). • Two disk reads instead of one.
Locality of Reference • Site admin complained that generating reports is super slow. Why? • Her reports are sorted by name. • We have an index on name, so we just scan the index and output the records. All’s well? • Data in the table is ordered by ID (our primary key) • When generating the report, we incur N disk seeks (N is the number of students) • Let’s select name as the primary key! • Reports are now lightning fast, since we are just reading the file from start to end!
Oops? • You get a call at 11pm. • Admin is unable to upload a new batch of students. It’s just too slow, the page is timing out. • Why? Input data contains random names. You lost locality of reference. You have to insert data into random places in the table, causing repeated page splits and B-tree rebalancing. Which means many random disk seeks. • In the past, you were always appending data to the end of the table, as you assigned auto-incremented IDs to new students.
Trade-off Time • There’s no perfect software. • You will always have to trade-off: • Space vs. speed • Users vs. admins • Flexibility vs. simplicity (power user vs. your mom) • Features vs. deadlines • Security vs. usability • Quality vs. deadlines • Should we optimize reporting or upload? • Reporting usually loses • Let’s reinstate ID as our primary key.
It’s Still Broken • You get a call at 1am two weeks later • We are running out of disk space! • But your projections were right? • You measured the average size of a student record, it was ~200Kb (including photo ) • 1000 students must grow the DB by 200Mb • In reality, 1000 students grow the DB by 400Mb. • We’ll run out of disk space tomorrow if you don’t do something!
What’s going on? • DB is growing because we are inserting new student records • We always append to the end of the table (in ID order) • The last B-tree page fills up • We split it in half • We continue adding to the last page • All pages are half-full!
What do you do? • You have to change page-split strategy to something like 80/20. • In real DBs, that’s called “index density” • You optimized for sequential inserts, but you hurt random inserts. Another trade-off. Ah well.
Data Migration • Ok, you developed your new database and you are proud of it. • You need to migrate existing data from the old Access DB to your new shiny SQL DB. • You want to minimize downtime • You’ll be taking site down to migrate the data
How can you migrate faster? • You have 17 indexes on your table • Name, phone, year, dept, … • Should you create indexes before populating the table or after? • A good DB engine can usually sort N elements faster than doing N inserts. • But creating 17 indexes afterwards means 17 scans over the table. • Does DB engine support batch index creation (a single table scan)? Some do. • Do you trust input data? Do you need to verify key uniqueness? If so, you need indexes prior to inserting data. • Trade-offs again…
Many more interesting topics • Transactions support • Multithreading/Locking/Serialization • Query optimization • Crash resilience/data consistency • DB cache strategies • Distributed databases • …