100 likes | 204 Views
Forms of Retrieval. Sequential Retrieval Two-Step Retrieval Retrieval with Indexed Cases. Retrieval with Indexed Cases. Sources: Textbook, Chapter 7 Davenport & Prusack’s book on Advanced Data Structures Samet’s book on Data Structures. Red light on? Yes Beeping? Yes …
E N D
Forms of Retrieval • Sequential Retrieval • Two-Step Retrieval • Retrieval with Indexed Cases
Retrieval with Indexed Cases Sources: Textbook, Chapter 7 Davenport & Prusack’s book on Advanced Data Structures Samet’s book on Data Structures
Red light on? Yes Beeping? Yes … Transistor burned! Range Search Space of known problems
k-d Trees • Idea: Partition of the case base in smaller fragments • Representation of a k-dimensional space in a binary tree • Similar to a decision tree: comparison with nodes • During retrieval: • Search for a leaf, but • Unlike decision trees backtracking may occur
Definition: k-d Trees • Given: • K types: T1, …, Tk for the attributes A1, …, Ak • A case base CB containing cases in T1 … Tk • A parameter b (size of bucket) • A K-D tree T(CB) for a case base CB is a binary tree defined as follows: • If |CB| < b then T(CB) is a leaf node (a bucket) • Else T(CB) defines a tree such that: • The root is marked with an attribute Ai and a value v in Ai and • The 2 k-d trees T({c CB: c.i-attribute < v}) and T({c CB: c.i-attribute v}) are the left and right subtrees of the root
BWB-Check • Ball-With in-Bounds check: • Suppose that algorithm reaches a leave node M (with at most b cases) while searching for the most similar case to P • Let c be a case in B such that dist(c,P) is the smallest • Then c is a candidate NN for P • For each boundary B of M, dist(P,B) > dist(c,P) then c is the NN • But if for any boundary B of M, if dist(P,B) < dist(c,P) then the algorithm needs to backtrack and check if in the regions of B, there is a better candidate • For computing distance, simply use: f-1 be the inverse of the distance-similarity compatible function: • distance(P,C) = f-1(sim(P,C))
BOB-Check • Ball-Out of-Bounds check: • Used during backtracking • Checks if for the boundary B defined in the node: dist(P,B) < dist(c,P) • Where c is our current candidate for best case (e.g., the closest case to P in the initial bucket) • If the condition is true, The algorithm needs to check if in those boundary’s regions, there is a better candidate
P(32,45) Example A1 (0,100) <35 35 (60,75) Toronto Denver Omaha A2 (80,65) Buffalo <40 (5,45) Denver (35,40) Chicago 40 Atlanta (85,15) A1 <85 (25,35) Omaha (50,10) Mobile 85 (90,5) Miami Atlanta Miami Mobile (0,0) (100,0) A1 • Notes: • Priority lists are used for computing kNN <60 60 Toronto Buffalo Chicago
Variant: InReCA Tree Ai unknown v1 vn … v2 Ai unknown v1 … >vn >v1 v2 Can be combined with numeric attributes Using Decision Trees as Index Standard Decision Tree Ai vn … v1 v2 • Notes: • Supports Hamming distance • May require backtracking (using BOB-check) • Operates in a similar fashion as k-d trees • Priority lists are used for computing kNN
Properties of Retrieval with Indexed Cases • Advantage: • Disadvantages: • Efficient retrieval • Incremental: don’t need to rebuild index again every time a new case is entered • -error does not occur • Cost of construction is high • Only work for monotonic similarity relations