170 likes | 274 Views
Preferential top-k search over local data. dissertation thesis RNDr. Martin Šumák supervisor: doc. RNDr . Stanislav Krajči , PhD. consultant: RNDr . Peter Gurský , PhD. Outline. Top-k search motivation and example restrictions and assumptions R-tree-based solution
E N D
Preferential top-k search over local data dissertation thesis RNDr. Martin Šumák supervisor: doc. RNDr. StanislavKrajči, PhD. consultant: RNDr. Peter Gurský, PhD.
Outline • Top-k search • motivation and example • restrictions and assumptions • R-tree-based solution • normalization of data • R++-tree • Grid file-based solution • Experiments • Comparison with B+-trees-based solution, table scan, etc. Preferential top-k search over local data, Dissertationthesis, RNDr. Martin Šumák
Top-k search • Example • find top 20 apartments with 3 or 4 rooms, not at first floor, with price about 60000 not exceeding 70000 euro • moreover, price is the most important attribute and floor is the least important attribute Preferential top-k search over local data, Dissertationthesis, RNDr. Martin Šumák
Top-k query • k = 20 • preferences to attribute’s values – fuzzy functions • importance of attributes – weights wprice = 3 wrooms = 2 wfloor = 1 Preferential top-k search over local data - dissertation thesis - Martin Šumák
Top-k query • Overall value of object O is 3*fprice(Oprice) + 2*frooms(Orooms) + 1*ffloor(Ofloor) • In general c(fprice(Oprice), frooms(Orooms), ffloor(Ofloor)) Function c has to be monotone! Preferential top-k search over local data - dissertation thesis - Martin Šumák
The goal of top-k search • to find top-k objects effectively • by processing minimum amount of data • restrictions and assumptions • all the data is accessible locally • all attributes are numerical Preferential top-k search over local data - dissertation thesis - Martin Šumák
R-tree-based solution • object • a vector of n numbers • a point of n-dimensional space • R-tree, R*-tree, R+-tree, R++-tree Preferential top-k search over local data - dissertation thesis - Martin Šumák
From kNN to top-k search • k nearest neighbour • known incremental algorithm • distance from “query point Z” is the measure of “closeness” Preferential top-k search over local data - dissertation thesis - Martin Šumák
From kNN to top-k search • top-k search • overall value (h) is the measure of “goodness” • by replacing distance with overall value and reversing order we change the result from kNN to top-k Preferential top-k search over local data - dissertation thesis - Martin Šumák
Analogy of kNN and top-k search kNN • Correctness • Efficiency top-k Preferential top-k search over local data - dissertation thesis - Martin Šumák
Disproportion of attribute values • floor, area, price – very different ranges • solution: normalization – linear transformation of attribute values to interval [0; 1] • Another disproportion comes from weights Preferential top-k search over local data - dissertation thesis - Martin Šumák
Normalization applicability • Useful for • R*-tree • Meaningless for • R-tree (proven for the quadratic split method) • R+-tree, R++-tree • Grid file Preferential top-k search over local data - dissertation thesis - Martin Šumák
Why the R++-tree • Zero overlaps & minimum bounding rectangles may cause a problem when adding new object • R+-tree avoids overlaps at the price of rectangles size Preferential top-k search over local data - dissertation thesis - Martin Šumák
The R++-tree idea • Zero overlaps & minimum bounding rectangles may cause a problem when adding new object • R++-tree keeps two rectangles for each node – the minimum one and the parent covering one Preferential top-k search over local data - dissertation thesis - Martin Šumák
The R++-tree properties • Height-balanced • Zero overlaps • Overflow nodes at leaf level only • Minimum node occupancy is 1 • For the top-k search purposes, attribute values can be strings or any other comparable values (not just numbers) Preferential top-k search over local data - dissertation thesis - Martin Šumák
Top-k search over Grid file • Grid file is a spatial index for point data • We used static Grid file without extra directory Preferential top-k search over local data - dissertation thesis - Martin Šumák
Top-k search over Grid file • We have proven correctness and efficiency as well Preferential top-k search over local data - dissertation thesis - Martin Šumák