Sarah Sproehnle

Sarah Sproehnle • Cloudera, Inc • sarah@cloudera.com

Outline • Identifying slow queries • Indexing techniques • Proper schema design • EXPLAIN • Rewriting subqueries • Common mistakes

Identifying slow queries • Use the slow query log! --log-slow-queries --long-query-time --log-queries-not-using-indexes • 5.1.21+ allows microsecond granularity • 5.1: SET GLOBAL slow_query_log = 1; • Do not use log_output = table • mysqldumpslow • Monitor SHOW PROCESSLIST • mytop, innotop, Enterprise Monitor (query analyzer is great, but adds latency; use selectively) • Ask around

Indexing • Most indexes are b-trees • B-tree is great for: WHERE col = x WHERE col > x WHERE col IS NULL WHERE col LIKE ‘foo%’ ORDER BY col [DESC] LIMIT n • Not useful for: WHERE function(col) = x WHERE col LIKE ‘%foo’ ORDER BY col -- without a LIMIT • Covering indexes Query: SELECT a FROM t WHERE b=‘foo’ ORDER BY c; Index: KEY(b, c, a);

Indexing continued • Creating the index can be painful (consider InnoDB plugin or replication switchover) • The order of columns in a composite index matters! KEY (a, b) will be used for: WHERE a = x WHERE a = x AND b = x WHERE a = x ORDER BY b SELECT b.. WHERE a = x but not… WHERE b = x • Hash indexes for Memory, NDB Cluster and InnoDB (adaptive) Fast and compact, but only useful for equality lookups • For InnoDB, do not append the PK to a secondary index • Watch out for duplicate indexes (mk-duplicate-key-checker)

Schema design • Normalize or denormalize? • Choose NOT NULL if possible • Choose good primary keys (keep them small!) They are often used as foreign keys; InnoDB uses the primary key as a row id • Keep your data small. Use the right data types: SMALLINT vs. INT vs. BIGINT CHAR vs. VARCHAR TIMESTAMP (4 bytes) vs. DATETIME (8 bytes) Store IP addresses as INT UNSIGNED (inet_aton) Use PROCEDURE ANALYSE() • Index a prefix of a string column: KEY(col(5)) • Use RANGE partitioning, but careful which functions you use • Use replication to split read/writes

EXPLAIN • EXPLAIN SELECT… unfortunately doesn’t work on UPDATE/DELETE • Important that you use EXPLAIN on your true data set • Useful for verifying: Is an appropriate index being used? What order are the tables joined in? This is critical given MySQL’s nested join algorithm. Is a temporary table required? (“Using temporary”) Covering index or are seeks to the row(s) needed? (“Using index”) • Example: a query that looks innocuous SELECT.. WHERE idx IN (42, 101, 1024); EXPLAIN can reveal interesting results!

Rewriting subqueries • mysql> EXPLAIN SELECT name FROM Country WHERE code IN(SELECT countrycode FROM City WHERE population>9000000)\G *************************** 1. row *************************** select_type: PRIMARY table: Country type: ALL possible_keys: NULL key: NULL key_len: NULL ref: NULL rows: 239 Extra: Using where *************************** 2. row ********* select_type: DEPENDENT SUBQUERY This query should not be correlated! table: City type: ALL possible_keys: NULL key: NULL key_len: NULL ref: NULL rows: 4079 Extra: Using where

Rewriting subqueries • Instead of “WHERE col IN(SELECT..)”, write a JOIN SELECT DISTINCT Country.name FROM Country JOIN City ON code = countrycodeWHERE City.population > 9000000; • In general:SELECT col FROM t1 WHERE id_col IN (SELECT id_col2 FROM t2 WHERE condition); Can be rewritten as follows: SELECT DISTINCT col FROM t1, t2 WHERE t1.id_col = t2.id_col AND condition; Fixed in 5.4!

Common mistakes/General advice • Need a random row? Do not use ORDER BY rand() LIMIT 1!Consider generating a random number and doing a lookup by auto_inc column. • Avoid hints (STRAIGHT_JOIN, FORCE INDEX) • Use hints when necessary, especially pre 5.1 • Don’t set sort_buffer_size extremely large • Query cache: 256MB max Careful when benchmarking (use SQL_NO_CACHE) • memcached • Move long running queries (e.g., reporting queries) to a slave

More general advice • Consider other storage engines • Materialize data into a Memory or MyISAM table. Keep it fresh with triggers or an event. Use - -init-file for Memory tables • LIMIT for paging – not always suitable App servers generally do not scale with large result sets, but evaluating LIMIT n,m over and over is painful • For batch processing, consider Hadoop: Can handle very large datasets Sqoop moves data from MySQL to Hadoop (and back) Built in reliability and scalability Don’t think in MapReduce? Use Hive

Thanks for attending! • Sarah Sproehnle • Cloudera, Inc • sarah@cloudera.com

Sarah Sproehnle

Sarah Sproehnle

Presentation Transcript

Sarah Morton

Sarah

SARAH

Sarah Langan

By : Sarah

Sarah Whittle

By Sarah

Sarah Schaelchlin

Sarah Allen

Sarah

Sarah

Sarah Boughten

Sarah Breedlove

Sarah Dieruff

Sarah

Sarah Outen

Sarah Vaughan

Sarah Weeks

Sarah Present