APPROXIMATE QUERY PROCESSING

APPROXIMATE QUERY PROCESSING BY KAVYA REDDY MUSANI

Types of RDBMS There are two types of RDBMS • Operational/Transactional Databases: ---Used for day to day operations. ---It changes rapidly. • DataWarehousing: ---It is useful for decision support and data analysis. ---Not useful for day to day operations. ---It doesn’t change rapidly.

Analysis of DataWarehousing • Decision support • OLAP tools • Statistics • Data Mining

Ways to improve Query Processing • Indexes • Indexes are used to speed up querying on huge databases. • The different kinds of indexes are B+ trees, Hash indexes, Bitmap indexes, etc. • Materialized Views • Materialized views improve query performance by precalculating expensive join and aggregation operations on the databases prior to execution time and storing these results in the database.

Example of Approximation • Suppose we have “Sales” table showing the sales of some products with attributes Product_ID, Product_Name, Price, Quantity and State. • Now we want to know the sales of a product in each state of a country. • Then we write a query which outputs the total sales of that product in each state. • Example of query: Select State as State, Sum(Price) as total_price from Sales group by State. Contd…

Contd… • We can approximate the total_price to an integer so that we can see the behaviour of the product sales in different states clearly. • For example if for the state of Texas the total_price is 56788.9866 dollars then we can approximate it to 60000. • Now we need to indicate the error in the corresponding column. • This error is called Confidence Interval because it shows the amount by which we have approximated the value.

Synopsis Method • Compress the data into a smaller representation (known as “synopsis”). • Execute query against synopsis and produce answers.

Methods of Data Compression • Sampling. • Building models using statistics. • Curve Fitting.

Types of Synopsis • Random Sampling (lossy compression). • Histograms. • Wavelets. • Learning Joint Distribution.

Random Sampling • A sampling procedure that assures that each element in the population has an equal chance of being selected is referred to as simple random sampling. • There are different random sampling techniques like Simple random sampling, Systematic sampling, Stratified sampling, Cluster sampling, multi-stage sampling, etc.

Role of Random sampling in Approximate Query Processing • Suppose we want to find the average of salaries of employees of a big company for a particular year. • Involving every employee’s salary in the average salary slows down the processing. • So, we take a small percentage of random sample (synopsis) of the huge table and find the average salary. Contd…

Contd… • This improves the speed of processing the query. • However, care should be taken while extracting random sample. • If the random sample drawn is assured to be good, then we can query the random sample (synopsis) for further references which reduces the processing time considerably and now we project the result to the actual table by multiplying with sampling fraction.

APPROXIMATE QUERY PROCESSING