Select Operation- disk access and Indexing *Some info on slides from Dr. S. Son, U. Va

Select Operation- disk access and Indexing *Some info on slides from Dr. S. Son, U. Va

Disk access • DBs traditionally stored on disk • Cheaper to store on disk than in memory • Costs for: • Seek time, latency, data transfer time • Disk access is page (block) oriented • 2 - 4 KB page size

Access time • Access time is the time to randomly access a page • System initially determines if page in memory buffer (page tables, etc.) • Large disparity between disk access and memory access

Select operation using table scan • If read the entire table for a select – table scan • Improvements to table scan of disk: • Parallel access • Sequential prefetch

Parallel access • Linear search - all data rows read in from disk • I/O parallelism can be used (Raid) • multiple I/O read requests satisfied at the same time • stripe the data across different disks • Problems with parallelism? • must balance disk arm load to gain maximum parallelism • requires the same total number of random I/O's, but using devices for a shorter time

Sequential prefetch I/O • Retrieve one disk page after another (on same track) – (32 in DB2, varies in Oracle) • Seek time no longer a problem • Must know in advance to read 32 successive pages • Speed up of I/O by a factor of ≈10 (500 I/O's per second vs. 70)

Access time • Seek time –as low as 4 ms server • Latency time –as low as 1 ms or less • Data transfer time – .4-2 ms • Solid state disks up to 100,000 I/Os per sec. – still expensive

Access time for fast I/O RIO Seq. Prefetch .004 .004 Seek - disk arm to cylinder .001 .001 Latency - platter to sector .0005 .016 Data transfer - Page .0055 .021 1 page vs. 32 pages .176* seconds .021 seconds 32 pages for both * .0055X32=.176 for 32 pages of RIO vs .021 for 32 pages of Seq. Prefetch

Organizing disk space • How to store data so minimize access time if read the entire table?

Disk allocation • Disk Resource Allocation for Databases (DBA has control) • Goal – contiguous sectors on disk - want data as close together as possible to minimize seek time • No standard SQL approach, but general way to deal with allocation • Some OS allow specification of size of file and disk device

Types of Files • Heap files (unordered – sequential) • Sorted files (ordered – sort key) • Hash files (hash key, hash function) • B+-trees • Storage Area Networks SAN – ERP (enterprise resource planning) and DW (data warehouses) • Storage devices configured as nodes in network – can attach/detach

Tablespace Tablespace is: • Allocation medium for tables and indexes for ORACLE, DB2, etc. • Can put >1 table in a table space if accessed together • Tablespace corresponds to 1 or more OS files and can span disk devices • Usually relations cannot span disk devices

DB storage structures DB Company Database Table- tspace 1 system space OS files fname1fname2 fname3 Tables Empl Dept Proj Dep EmpIndx Segments data data data data index Extents

Tablespace • ORACLE DB's contain several tablespaces, including one called system - data description + indexes + user-defined tables • default tablespace given to each user • if multiple tablespaces - better control over load balancing • can take some disk space off-line

Extent • Relation composed of 1 or more extents • Extent - contiguous storage on disk • when data segment or index segment first created, given an initial extent from tablespace 10KB (5 pages) • if need more space given next contiguous extent

DB storage structures DB Company Database Table- tspace 1 system space OS files fname1fname2 fname3 Tables Empl Dept Proj Dep EmpIndx Segments data data data data index Extents

Extent • Can increase the size by a positive % (cannot decrease) • initial n - size of initial extent • next n - size of next • max extents - maximum number of extents • min extents - number of extents initially allocated • pct increase n - % by which next extent grows over previous one

Oracle create tablespace • http://www.adp-gmbh.ch/ora/sql/create_tablespace.html

Create table • Create table statement - can specify tablespace, no. of extents • When initial extent full, new extent allocated • pctfree - determine how much space in a page can be used for inserts of new rows • if pctfree =10%, inserts stop when page is 90% full • Uses another page • pctused – determines when new inserts start again • if fall below certain percentage of total, default pctused = 40% pctfree + pctused < 100

Rows • Row layout on each disk page 1 2 3… N Row N Row N-1 … Row 1 Header info Row directory free space data rows • Header - • Row directory – row number and page byte offset • Row number is row number in page – also called slot# • Page byte offset – with varchar, row size not constant • To identify a particular row use RID (RowID) – page #, slot # [file#] slot# is number in row directory (logical #)

Differences in DBMSs re: rows • ROWID can be retrieved in ORACLE but not DB2 (violates relational model rule) • ORACLE • rows can be split between pages (row record fragmentation) • Can have rows from multiple tables on same page, more info • DB2, no splitting, entire row moved to new page, need forwarding pointer

Select operation using Indexes • Alternative to table scan

Why use an index? • If use a select (or join) on the same attribute frequently • want a way to improve performance - use indexes • For example: Select from Employee where ssn = 333445555

B+-tree • Most commonly used index structure type in DBs today • Based on B-tree • Good for equality and range searches • B+ tree:dynamic, adjusts gracefully under inserts and deletes. • Used to minimize disk I/O • available in DB2, ORACLE also has hash cluster, Ingres has heap structure, B-tree, isam (chain together new nodes)

Index Entries (Direct search) Data Entries ("Sequence set") Structure of B+ Trees • leaf level pointers to data (RIDs) • the remaining are directory (index) nodes that point to other index nodes Fig.

Example of B+Tree 10 20 40 1 2 3 10 12 20 35 40 42 50 Points to data

Characteristics of B+ Tree • Order of tree (fan out) – max number of child nodes • Minimum 50% occupancy (except for root). Each node contains d/2 <= m <= d-1 entries. • Where the parameter d is the order of the tree. • Insert/delete at log F N cost; keep tree height-balanced.(F = fanout, N = # leaf pages) • Supports equality and range-searches efficiently

Cost of I/O for B+-tree • One index node is one page • If tree with depth of 3, 3 I/Os to get pointer to data • Read in index node can remain in memory • likely since frequent access to upper -level nodes of actively used B+-trees

B+ Trees in Practice • Typical order: between 100-200 children • Typical fill-factor: 2/3 full (66.6%) • average fanout = 133 (if 200 children) • Typical capacities: • Height 4: 1334 = 312,900,700 records • Height 3: 1333 = 2,352,637 records • Can often hold top levels in buffer pool: • Level 1 = 1 page = 8 Kbytes • Level 2 = 133 pages = 1 Mbyte • Level 3 = 17,689 pages = 133 MBytes

Why B+-tree • Directory structure - retrieve range of values efficiently • search for leftmost index entry Si such that X <= Si • Index entries always in sequence by value - can use sequential prefetch on index • Index entries shorter than data rows - less I/O

B+-tree • Balancing of B+-trees - insert, delete • Nodes usually not full • Utilities to reorganize to lower disk I/O • Most systems allow nodes to become depopulated- no automatic algorithm to balance • Average node below root level 71% full in active growing B+-trees

Duplicate key values • Duplicate key values in index • leaf nodes have sibling pointers • but a delete of a row that has a heavily duplicated key entails a long search through the leaf-level of the B+-tree • Index compression - with multiple duplicates | header info | PrX keyval RID RID ... RID | PrX keyval RID…RID| where PrX is count of RID values

Create Index Options: multiple columns tablespace storage - initial extents, etc. percent free default = 10 % of each page left unfilled (creation) free page (1 free page for every n index pages during creation)

Types of indexes (textbook) • Primary index - key field is a candidate key (must be unique) – data file ordered by key field • Clustering index - key field is not unique, data file is ordered – all records with same values on same pages • Secondary index - non-clustering index – data file not ordered • First record in the data page (or block) is called the anchor record • Non-dense index - pointer in index entry points to anchor • Dense index - pointer to every record in the file

Clustering • Efficiency advantage read in a page, get all of the rows with the same value • clustering is useful for range queries e.g. between keyval1 and keyval2

Clustering • Can only cluster table by 1 clustering index at a time • In SQL server • creates clustered index on PK automatically if no other clustered index on table and PK nonclustered index not specified • In DB2 – • if the table is empty, rows sorted as placed on disk • subsequent insertions not clustered, must use REORG • In Oracle- • Cluster index – now available for PK in 10g • Define a cluster to create cluster index for 2 tables

Please help me to remember to TURN OFF THE PROJECTOR!!

Indexes vs. table scan • To illustrate the difference between table scan, secondary index (non clustered) and clustered index Assume 10 M customers, 200 cities 2KB/page, row = 100 bytes, 20 rows/page Select * From Customers Where city = Birmingham 1/200 * 10M if assume selectivity = 1/200 50,000 customers in a city

Rules of Thumb for I/O • Assume slightly slower times than before: • Random I/O – 160 pages/second, .00625 • Sequential prefetch I/O – 1600 pages/second, .000625 Will discuss later: • List prefetch I/O – 400 pages/second, .0025

Table Scan Table Scan - read entire table If used an random I/O (RIO) – WHICH ONE WOULD NEVER DO 10,000,000/20 = 500,000 pages 500,000*RIO = 3125 Instead, it makes more sense to use: sequential prefetch (SP) read 32 pages at a time 500,000*SP = 312

Clustering Index Clustering Index – • All entries for B'ham clustered on same pages • 50,000/20 = 2500 data pages (with 20 rows per page) • Assume 3 upper nodes of the tree • Assume 1000 index entries per leaf node, read 50000/1000 = 50 index pages 3 + 50000/1000 + 50,000/20 = number of pages to access • If top 3 levels of tree in memory, count access time as 0 • Access time: (3*0) + (50*SP) + (2500*SP) = 2,550 * .000625 = 1.6

Secondary Index • In the worst case 1 entry for B'ham per page • 50,000 data pages pages (10M/200) 3 + 50 + 50,000 = 50, 053 number of accesses (3*0)+(50*SP) + (50,000*RIO)=312.5 access time REALLY slow – see next slide for a better solution! Use List Prefetch instead of RIO

List Prefetch – Better solution Create list of data pages to access Pages not necessarily in contiguous sequential order System orders pages to minimize disk I/O E.g. elevator algorithm for disk request scheduling Using list prefetch (LP) 0+(50*SP)+50,000*LP=125.03 access time

% Free • Redo the previous calculations assuming relations created with 50% free option specified.

Creating Indexes • When determining what indexes to create consider: • workload - mix of queries and frequencies of requests • 20% of requests are updates, etc. • can create lots of indexes but: • cost to create • insertions • initial load time high if a large table • index entries can become longer and longer as multiple columns included

Multiple Indexes • More than one index on a relation • e.g. age – one index, class - one index, gender - one index

Composite Index • One index based on more than one attribute Create Index index_name on Table (col1, col2,... coln) • Composite index entry - values for each attribute age, class, gender entry in index is: C1, C2, C3, RID

Using Indexes • System must decide if to use index • What if more than one index, which one? • What if composite index?

Select Operation- disk access and Indexing *Some info on slides from Dr. S. Son, U. Va