210 likes | 334 Views
Statistics Profile For Query Optimization. WENYI NI. Introduction. What is statistics profile?. Every object has its own status. In order to know its status, we need statistics. The relation between Statistics profile and statistics. Cost Model. From M.Tamer Oszu.
E N D
Statistics Profile For Query Optimization WENYI NI Spring 2004, CSE8330 Presentition
Introduction What is statistics profile? • Every object has its own status. • In order to know its status, we need statistics. • The relation between Statistics profile and statistics. Spring 2004, CSE8330 Presentition
Cost Model From M.Tamer Oszu When DBMS use statistics profile? Spring 2004, CSE8330 Presentition
What does statistics profile collect? • The central tendency of the data • The range of the data • The size of the data • The distribution of the data Spring 2004, CSE8330 Presentition
Common types of statistics profile • Table profile • Attribute profile • Index profile Spring 2004, CSE8330 Presentition
Typical profiles Spring 2004, CSE8330 Presentition
Three ways to collect statistics • Exhaustive accumulation • Sampling • Piggyback Spring 2004, CSE8330 Presentition
Exhaustive accumulation • Calculate every statistics describer through scanning the related object exhaustively • Advantage Most Accurate • Disadvantage Heavy system load Spring 2004, CSE8330 Presentition
Sampling • Scan part of the related object. Estimate statistics through sample data • Advantage Low system overhead • Disadvantage Still have overhead. Statistics is not 100% accurate. Spring 2004, CSE8330 Presentition
Piggyback • Collect statistics through data in memory. Slightly change SQL statement to make full use of these data. • Types of piggyback • Vertical piggyback • Horizontal piggyback • Mixed piggyback Spring 2004, CSE8330 Presentition
Vertical piggyback • Include extra columns during query processing Example: Select student.name from student; rewrite to: Select student.name,student.age from student; Spring 2004, CSE8330 Presentition
No extra I/O, but extra cpu load. Solution: set piggyback level • AC1 = { x| x is a column in Table Ri referenced by Query Q} • AC2 = { x| x is an index column in Table Ri } – AC1 • AC3 = { x| x is a column in Table Ri and x is a part of the primary key or foreign key or referenced by a foreign key}-AC2 • AC4 = { x| x is a column in Table Ri }-AC3 Advantage: Choose your piggyback level according to the CPU load Spring 2004, CSE8330 Presentition
Horizontal piggyback • Include extra rows during query process Example: Select student.name, student.score From student where score >60; Rewrite to: Select student.name, student.score From student where score >60 or student.pid In (Select student.pid for student Where score>60); • Advantage Spring 2004, CSE8330 Presentition
Mixed piggyback • Use both vertical and horizontal piggyback method • Advantage Spring 2004, CSE8330 Presentition
Value distribution • Why we need it? Example: Select * from Student Where score>60; Size?? Spring 2004, CSE8330 Presentition
Answer: Size = 500*0.81*30 = 121.5 Where 500 is the cardinality of the student table. 30 is the size of each record Spring 2004, CSE8330 Presentition
How to get distribution table? • Histogram • Equal width • Equal height Spring 2004, CSE8330 Presentition
Bucket number • 1+ logn [rule of sturge 1927] Example: student table ( 500 records) 1+log500 = 10 • For equal width, put each value into the proper buckets • For equal height, make an order to the value, if the sampling size is m, decide the height k = m/(bucket number), and put the value in bucket in order Spring 2004, CSE8330 Presentition
Sampling • How many sample do we need? A sample size of 1064 can give a less than 10% error rate with 99% probability (mannino1988) To gain same error rate for varies size of table, Sample rate drops when size of table grows. Drop rate: log(n)/n Example: 20 sample with 2%error rate on table with 100 records We need 1000*0.2*(1-log(1000)/1000) samples to reach 2% error rate on table with 1000 records Spring 2004, CSE8330 Presentition
Summery & Future work • Low overhead • Low error rate, still have room to improve • The way to estimate the size of project and join operations with statistics still need be improved. Spring 2004, CSE8330 Presentition
The end Spring 2004, CSE8330 Presentition