370 likes | 380 Views
Understand the impact of hardware on query performance and learn about SQL optimization techniques. Find out how database design influences query execution in PostgreSQL. Improve performance with efficient querying strategies.
E N D
Database Systems – SQL SQL OPTIMIZATION Writing efficient queries requires understanding what effects the performance of your query. In general hardware can have a profound effect on your performance. There are some basic physical constraints that databases must work around. Especially with non-solid state memory. Solid state memory changes the stats below and are constantly rewriting the specifications. Disk Seeks – Average 10 ms, aka 100 seeks a second. Only real optimization is to distribute data across multiple disks. Seek time per table is hard to improve. Disk Reading & Writing – 10 to 20 MB per second, best way to optimize is to distribute data across multiple disks Disk Spindles – The greater the number of spindles the greater the opportunity for a database to read/write in parallel. CPU Cycles – Data must be processed. Smaller data fits in memory and thus faster to process.
Database Systems – SQL SQL OPTIMIZATION A bigger issue is your data base design and how your query executes. Let’s start by understanding how Postgres executes a query. Transmitting the SQL string to the database backend. Parsing the query string Planning of the query to optimize the retrieval of data Retrieval of data from the hardware Transmission of the data to the client. http://www.revsys.com/writings/postgresql-performance.html, Note there are additional comments here about general DB tuning that are skipped as we do not have control over our database environment.
Database Systems – SQL SQL OPTIMIZATION Transmitting the SQL string to the database backend. This is not a particularly lengthy step. It requires the actual characters you type to be transferred to the database. If a query is exceptionally long it may be placed in a stored procedure to save the transmission time. Parsing the query string The text you typed must be broken into tokens. Again, if this takes longer than desired, a stored procedure may be used to save the parsing time. Planning of the query to optimize the retrieval of data If your query is already prepared it can save a lot of time. Otherwise it must determine the best way to execute the query. Should it use an index (s) or could a hash join be more efficient.
Database Systems – SQL SQL OPTIMIZATION Retrieval of data from the hardware There isn’t much you can do here other than improving your hardware. Transmission of the data to the client. There isn’t much you can do here other than to minimize the number of columns returned in the result set as well as the number of rows.
Database Systems – SQL OPTIMIZATION The more extensive your permissions are, the less optimized your database will be. Table level, column level permissions, resource counting, etc can be problematic if you have a large number of statements being executed. However, if you have a few very queries that execute over a large amount of data, this factor may not be as significant. USING EXPLAIN Using EXPLAIN helps you understand how your query executes. It informs you what order tables are joined and what indexes are used to join them. If you notice joins on unindexed fields, you can index them to improve performance. To execute an EXPLAIN simply type: EXPLAIN SqlQuery;
Database Systems – SQL OPTIMIZATION Suppose that you have the SELECT statement shown here and that you plan to examine it using EXPLAIN ANALYZE: CREATE TABLE authors ( id int4 PRIMARY KEY, name varchar ); CREATE TABLE books ( id int4 PRIMARY KEY, author_id int4, title varchar );
Database Systems – SQL OPTIMIZATION Try analyzing the following query: EXPLAIN ANALYZE SELECT authors.name, books.title FROM books, authors WHERE books.author_id=16 and authors.id = books.author_id ORDER BY books.title; We get: QUERY PLAN ------------------------------------------------------------------------------------------------- Sort (cost=29.71..29.73 rows=6 width=64) (actual time=0.189..16.233 rows=7 loops=1) Sort Key: books.title -> Nested Loop (cost=0.00..29.63 rows=6 width=64) (actual time=0.068..0.129 rows=7 loops=1) -> Index Scan using authors_pkey on authors (cost=0.00..5.82 rows=1 width=36) (actual time=0.029..0.033 rows=1 loops=1) Index Cond: (id = 16) -> Seq Scan on books (cost=0.00..23.75 rows=6 width=36) (actual time=0.026..0.052 rows=7 loops=1) Filter: (author_id = 16) Total runtime: 16.386 ms
Database Systems – SQL OPTIMIZATION Read from the bottom up We first see the complete query time: Total runtime: 16.386 ms Postgres then performs a sequential scan on the books table filtering the rows that have an author_id of 16. -> Seq Scan on books (cost=0.00..23.75 rows=6 width=36) (actual time=0.026..0.052 rows=7 loops=1) Filter: (author_id = 16) Although no explicit index is placed on the authors table, an implicit one exists due to the primary key. Therefore postgres utilizes it to select the authors whose keys are equal to 16 -> Index Scan using authors_pkey on authors (cost=0.00..5.82 rows=1 width=36) (actual time=0.029..0.033 rows=1 loops=1) Index Cond: (id = 16)
Database Systems – SQL OPTIMIZATION The final results are sorted by the book’s title: Sort (cost=29.71..29.73 rows=6 width=64) (actual time=0.189..16.233 rows=7 loops=1) Sort Key: books.title -> Nested Loop (cost=0.00..29.63 rows=6 width=64) (actual time=0.068..0.129 rows=7 loops=1) Note the actual and estimated times are listing in parenthesis. Let’s add an index CREATE INDEX books_idx1 on books(author_id); If you rerun the EXPLAIN query would you expect the performance to increase?
Database Systems – SQL OPTIMIZATION The final results are sorted by the book’s title: Sort (cost=29.71..29.73 rows=6 width=64) (actual time=0.189..16.233 rows=7 loops=1) Sort Key: books.title -> Nested Loop (cost=0.00..29.63 rows=6 width=64) (actual time=0.068..0.129 rows=7 loops=1) Note the actual and estimated times are listing in parenthesis. Let’s add an index CREATE INDEX books_idx1 on books(author_id); If you rerun the EXPLAIN query would you expect the performance to increase? It will not, until you run: ANALYZE books; However, you are still not ensured the index will be used. If there are a small number of records in books, it still may perform a sequential scan.
Database Systems – SQL OPTIMIZATION What are other considerations why an index might not be used in Postgres? Problem: The planner has decided its faster to do a table scan than an index scan : This can happen if a) your table is relatively small, or the field you are indexing has a lot of duplicates. Solution: Case in point, boolean fields are not terribly useful to index since 50% of your data is one thing and 50% is another. However they are good candidates to use for Partial indexes e.g. to only index data that is active. Problem: You set up an index that is incompatible with how you are actually filtering a field. There are a couple of variants of this situation. The old LIKE '%me' will never use an index, but LIKE 'me%' can possibly use an index. The upper lower trap - you defined your index like: CREATE INDEX idx_faults_name ON faults USING btree(fault_name);, But you are running a query like this: SELECT * FROM faults where UPPER(fault_name) LIKE 'CAR%' Possible fix: CREATE INDEX idx_faults_name ON faults USING btree(upper(fault_name)); http://www.postgresonline.com/journal/archives/78-Why-is-my-index-not-being-used.html
Database Systems – SQL OPTIMIZATION For this example (run in mySQL), make the following assumptions: The columns being compared have been declared as follows: The tables have the following indexes:
Database Systems – SQL OPTIMIZATION / mySQL The tt.ActualPC values are not evenly distributed. Initially, before any optimizations have been performed, the EXPLAIN statement produces the following information: table type possible_keys key key_len ref rows Extra et ALL PRIMARY NULL NULLNULL 74 do ALL PRIMARY NULL NULLNULL 2135 et_1 ALL PRIMARY NULL NULLNULL 74 tt ALL AssignedPC, NULL NULLNULL 3872 ClientID, ActualPC range checked for each record (key map: 35) Because type is ALL for each table, this output indicates that MySQL is generating a Cartesian product of all the tables. For the case at hand, this product is 74 × 2135 × 74 × 3872 = 45,268,558,720 rows.
Database Systems – SQL OPTIMIZATION / mySQL One problem here is that MySQL can use indexes on columns more efficiently if they are declared as the same type and size. In this context, VARCHAR and CHAR are considered the same if they are declared as the same size. tt.ActualPC is declared as CHAR(10) and et.EMPLOYID is CHAR(15), so there is a length mismatch. To fix this disparity between column lengths, use ALTER TABLE to lengthen ActualPC from 10 characters to 15 characters. ALTER TABLE tt MODIFY ActualPC VARCHAR(15);
Database Systems – SQL OPTIMIZATION / mySQL Now tt.ActualPC and et.EMPLOYID are both VARCHAR(15). Executing the EXPLAIN statement again produces this result: table type possible_keys key key_len ref rows Extra tt ALL AssignedPC, NULL NULLNULL 3872 Using ClientID, where ActualPC do ALL PRIMARY NULL NULLNULL 2135 range checked for each record (key map: 1) et_1 ALL PRIMARY NULL NULLNULL 74 range checked for each record (key map: 1) et eq_ref PRIMARY PRIMARY 15 tt.ActualPC 1 This is not perfect, but is much better: The product of the rows values is less by a factor of 74. This version executes in a couple of seconds.
Database Systems – SQL OPTIMIZATION / mySQL A second alteration can be made to eliminate the column length mismatches for the tt.AssignedPC = et_1.EMPLOYID and tt.ClientID = do.CUSTNNBR comparisons: ALTER TABLE tt MODIFY AssignedPC VARCHAR(15), -> MODIFY ClientID VARCHAR(15); After that modification, EXPLAIN produces the output shown here: table type possible_keys key key_len ref rows Extra et ALL PRIMARY NULL NULLNULL 74 tt ref AssignedPC, ActualPC 15 et.EMPLOYID 52 Using ClientID, where ActualPC et_1 eq_ref PRIMARY PRIMARY 15 tt.AssignedPC 1 do eq_ref PRIMARY PRIMARY 15 tt.ClientID 1
Database Systems – SQL OPTIMIZATION / mySQL At this point, the query is optimized almost as well as possible. The remaining problem is that, by default, MySQL assumes that values in the tt.ActualPC column are evenly distributed, and that is not the case for the tt table. Fortunately, it is easy to tell MySQL to analyze the key distribution: ANALYZE TABLE tt; With the additional index information, the join is perfect and EXPLAIN produces this result:
Database Systems – SQL OPTIMIZATION / mySQL table type possible_keys key key_len ref rows Extra tt ALL AssignedPC NULL NULLNULL 3872 Using ClientID, where ActualPC et eq_ref PRIMARY PRIMARY 15 tt.ActualPC 1 et_1 eq_ref PRIMARY PRIMARY 15 tt.AssignedPC 1 do eq_ref PRIMARY PRIMARY 15 tt.ClientID 1 Note that the rows column in the output from EXPLAIN is an educated guess from the MySQL join optimizer. You should check whether the numbers are even close to the truth by comparing the rows product with the actual number of rows that the query returns. If the numbers are quite different, you might get better performance by using STRAIGHT_JOIN in your SELECT statement and trying to list the tables in a different order in the FROM clause.
Database Systems – SQL OPTIMIZATION / Postgres These issues do not seem to exist in Postgres. EXPLAIN SELECT tt.ActualPC, et.EmployID, tt.ClientID FROM tt, et, et AS et_1, doo WHERE tt.ActualPC = et.EMPLOYID AND tt.AssignedPC = et_1.EMPLOYID AND tt.ClientID = doo.CUSTNMBR; QUERY PLAN ---------------------------------------------------------------------------- Hash Join (cost=65.49..160.23 rows=1495 width=38) Hash Cond: (tt.clientid = doo.custnmbr) -> Hash Join (cost=5.29..77.60 rows=1495 width=38) Hash Cond: (tt.assignedpc = et_1.employid) -> Hash Join (cost=2.64..54.40 rows=1495 width=49) Hash Cond: (tt.actualpc = et.employid) -> Seq Scan on tt (cost=0.00..30.59 rows=1659 width=33) -> Hash (cost=1.73..1.73 rows=73 width=16) -> Seq Scan on et (cost=0.00..1.73 rows=73 width=16) -> Hash (cost=1.73..1.73 rows=73 width=16) -> Seq Scan on et et_1 (cost=0.00..1.73 rows=73 width=16) -> Hash (cost=33.98..33.98 rows=2098 width=16) -> Seq Scan on doo (cost=0.00..33.98 rows=2098 width=16)
Database Systems – SQL OPTIMIZATION / Postgres Why so many sequential scans?
Database Systems – SQL OPTIMIZATION / Postgres Why so many sequential scans? In reality a few thousand records isn’t many. The query optimizer decided the overhead for indexes or hash joins wasn’t worth it.
Database Systems – SQL SQL OPTIMIZATION SPEEDING UP SELECTS When the data stored in a database changes, the statistics used to optimize queries are not updated automatically. Therefore, use the ANALYZE command on each table to speed up results. WHERE CLAUSE OPTIMIZATION First note that any optimizations for the WHERE clause of a SELECT query also work for the WHERE clauses of DELETE and UPDATE queries.
Database Systems – SQL SQL OPTIMIZATION Examples of very fast queries Some examples of queries that are very fast: SELECT COUNT(*) FROM tbl_name; SELECT MIN(key_part1),MAX(key_part1) FROM tbl_name; SELECT MAX(key_part2) FROM tbl_name WHERE key_part1=constant; SELECT ... FROM tbl_name ORDER BY key_part1,key_part2,... LIMIT 10; SELECT ... FROM tbl_name ORDER BY key_part1 DESC, key_part2 DESC, ... LIMIT 10; These were actually statements about mySQL, but this should be the same in most modern relational databases.
Database Systems – SQL SQL OPTIMIZATION INSERT The time required for inserting a row is determined by the following factors, where the numbers indicate approximate proportions: Connecting: (3) Sending query to server: (2) Parsing query: (2) Inserting row: (1 × size of row) Inserting indexes: (1 × number of indexes) Closing: (1)
Database Systems – SQL SQL OPTIMIZATION INSERT If you are inserting many rows from the same client at the same time, use INSERT statements with multiple VALUES lists to insert several rows at a time. This is considerably faster (many times faster in some cases) than using separate single-row INSERT statements. When loading a table from a text file, use COPY FROM. This is usually significantly faster than using INSERT statements. See .http://www.postgresql.org/docs/8.1/static/sql-copy.html While outside of the scope of what I wish to cover, if you are loading a large amount of data in proportion to the current size of the table, it may be more efficient to drop the indexes, load the file, and then reinstate the indexes.
Database Systems – SQL SQL OPTIMIZATION INSERT To speed up INSERT operations that are performed with multiple statements for non-transactional tables, lock your tables: BEGIN WORK; LOCK TABLES IN ACCESS EXCLUSIVE; INSERT INTO a VALUES (1,23),(2,34),(4,33); INSERT INTO a VALUES (8,26),(6,29); COMMIT WORK; http://www.postgresql.org/docs/8.1/static/sql-lock.html There is no LOCK TABLE in the SQL standard, which instead uses SET TRANSACTION to specify concurrency levels on transactions. PostgreSQL supports that too; see SET TRANSACTION for details. Copying a file to a table is still faster than the method demonstrated above.
Database Systems – SQL SQL OPTIMIZATION GENERAL TIPS Use persistent connections to the database to avoid connection overhead. Run ANALYZE AND VACUUM update the statistics and reclaim deleted space after a lot of rows are removed from a table. One recommendation I do not agree with: Try to keep column names simple. For example, in a table named customer, use a column name of name instead of customer_name. I DISAGREE! To make your names portable to other SQL servers, you should keep them shorter than 18 characters.
Database Systems – SQL SQL TRANSACTIONS It is often important to ensure a series of statements occur as an atomic unit or do not occur at all. For example, if you wanted to transfer money from one account to another, you would not want the removal of the funds from one account to occur without the depositing of those funds in the second account. If something happened to prevent the depositing of the funds, then you would want the withdrawal cancelled. This is accomplished through the use of transactions. In PostgreSQL, a transaction is set up by surrounding the SQL commands of the transaction with BEGIN and COMMIT commands. So our banking transaction would actually look like: The syntax is simple: BEGIN; Any SQL commands you wish to execute atomically COMMIT; http://www.postgresql.org/docs/8.3/static/tutorial-transactions.html
Database Systems – SQL SQL TRANSACTIONS Note that some statements can not be rolled back. These are typically ones that alter the definition of the database/table structure. If statements like these are included within a transaction, then if another statement fails within the transaction, then a full rollback to the beginning of the transaction can not occur. Transactions can be broken up so that you can rollback to a specific point within the transaction using the SAVEPOINT command. SAVEPOINT identifier ROLLBACK TO SAVEPOINT identifier RELEASE SAVEPOINT identifier http://www.postgresql.org/docs/8.1/static/sql-savepoint.html http://developer.postgresql.org/pgdocs/postgres/sql-rollback-to.html http://www.postgresql.org/docs/8.1/static/sql-release-savepoint.html
Database Systems – SQL VIEWS A view in SQL is a great way to present information to a user in another way than the logical table structure. You might do this to limit the access of certain fields, i.e. Social Security Number or Date of birth. You might wish to derive a field. i.e age from date of birth. You might wish to denormalize a series of tables and remove the ID fields so they do not confuse someone generating reports from the data.
Database Systems – SQL You could great a view that would show the following: VIEWS The actual syntax has more options than this, but a simple form for creating a view is as follows: CREATE VIEW ViewName AS SelectStatement; So if you had the following table: BaseballPlayers view BaseballPlayers table The BaseballPlayers2 view can be created with the following statement: CREATE VIEW BaseballPlayers2 AS SELECT LastName, FirstName, Year(DOB) AS Age FROM BaseballPlayers; http://www.postgresql.org/docs/8.1/static/sql-createview.html
Database Systems – SQL VIEWS A View name can not be the same name as a table that already exists. Views must have unique column names. Columns selected can be column names or expressions. If you create an expression, name it! A view can be created from many kinds of SELECT statements. It can refer to base tables or other views. It can use joins, UNION, and subqueries. Currently, views are read only: the system will not allow an insert, update, or delete on a view. You can get the effect of an updatable view by creating rules that rewrite inserts, etc. on the view into appropriate actions on other tables. For more information see CREATE RULE. http://www.postgresql.org/docs/8.1/static/sql-createview.html
Database Systems – SQL VIEWS Use the DROP VIEW statement to drop views. Be careful that the names and types of the view's columns will be assigned the way you want. For example, CREATE VIEW vista AS SELECT 'Hello World'; is bad form in two ways: the column name defaults to ?column?, and the column data type defaults to unknown. If you want a string literal in a view's result, use something like CREATE VIEW vista AS SELECT text 'Hello World' AS hello;
Database Systems – SQL DATE/TIME WORKING with DATETIME, DATE, and INTERVAL VALUES Types DATETIME or TIMESTAMP Structured "real" date and time values, containing year, month, day, hour, minute, second and millisecond for all useful date & time values (4713 BC to over 100,000 AD). DATE Simplified integer-based representation of a date defining only year, month, and day. INTERVAL Structured value showing a period of time, including any/all of years, months, weeks, days, hours, minutes, seconds, and milliseconds. "1 day", "42 minutes 10 seconds", and "2 years" are all INTERVAL values.
Database Systems – SQL DATE/TIME WORKING with DATETIME, DATE, and INTERVAL VALUES Which do I want to use: DATE or TIMESTAMP? I don't need minutes or hours in my value That depends. DATE is easier to work with for arithmetic (e.g. something reoccurring at a random interval of days), takes less storage space, and doesn't trail "00:00:00" strings you don't need when printed. However, TIMESTAMP is far better for real calendar calculations (e.g. something that happens on the 15th of each month or the 2nd Thursday of leap years). More below. 1. The difference between two TIMESTAMPs is always an INTERVAL TIMESTAMP '1999-12-30' - TIMESTAMP '1999-12-11' = INTERVAL '19 days' 2. You may add or subtract an INTERVAL to a TIMESTAMP to produce another TIMESTAMP TIMESTAMP '1999-12-11' + INTERVAL '19 days' = TIMESTAMP '1999-12-30' 3. You may add or subtract two INTERVALS INTERVAL '1 month' + INTERVAL '1 month 3 days' = INTERVAL '2 months 3 days'
Database Systems – SQL DATE/TIME 1. The difference between two DATES is always an INTEGER, representing the number of DAYS difference DATE '1999-12-30' - DATE '1999-12-11' = INTEGER 19 You may add or subtract an INTEGER to a DATE to produce another DATE DATE '1999-12-11' + INTEGER 19 = DATE '1999-12-30'
Database Systems – SQL DATE/TIME