MindReader: Querying databases through multiple examples

MindReader:Querying databases through multipleexamples Yoshiharu Ishikawa (Nara Institute of Science and Technology, Japan) Ravishankar Subramanya (Pittsburgh Supercomputing Center) Christos Faloutsos (Carnegie Mellon University)

Outline • Background & Introduction • Query by Example • Our Approach • Relevance Feedback • What’s New in MindReader? • Proposed Method • Problem Formulation • Theorems • Experimental Results • Discussion & Conclusion

Query-by-Example: an example Searching “mildly overweighted” patients • The doctor selects examples by browsing patient database

Query-by-Example: an example Searching “mildly overweighted” patients • The doctor selects examples by browsing patient database : good : very good

Query-by-Example: an example Searching “mildly overweighted” patients • The doctor selects examples by browsing patient database Weight Height : good : very good

Query-by-Example: an example Searching “mildly overweighted” patients • The doctor selects examples by browsing patient database • The examples have “oblique” correlation Weight Height : good : very good

Query-by-Example: an example Searching “mildly overweighted” patients • The doctor selects examples by browsing patient database • The examples have “oblique” correlation • We can “guess” the implied query Weight Height : good : very good

Query-by-Example: an example Searching “mildly overweighted” patients • The doctor selects examples by browsing patient database • The examples have “oblique” correlation • We can “guess” the implied query Weight q Height : good : very good

Query-by-Example: the question Assume that • user gives multiple examples • user optionally assigns scores to the examples • samples have spatial correlation

Query-by-Example: the question Assume that • user gives multiple examples • user optionally assigns scores to the examples • samples have spatial correlation How can we “guess” the implied query?

Our Approach • Automatically derive distance measure from the given examples • Two important notions: 1. diagonal query: isosurfaces of queries have ellipsoid shapes 2. multiple-level scores: user can specify “goodness scores” on samples

Isosurfaces of Distance Functions q q q generalized ellipsoid distance Euclidean weighted Euclidean

Distance Function Formulas • Euclidean D(x, q) = (x – q)2 • Weighted Euclidean D(x, q) = Simi(xi– qi)2 • Generalized ellipsoid distance D(x, q) = (x – q)TM (x – q)

Relevance Feedback • Popular method in IR • Query is modified based on relevance judgment from the user • Two major approaches 1. query-point movement 2. re-weighting

Relevance Feedback— Query-point Movement — • Query point is moved towards “good” examples — Rocchio’s formula in IR Q0: query point Q0

Relevance Feedback— Query-point Movement — • Query point is moved towards “good” examples — Rocchio’s formula in IR Q0: query point : retrieved data Q0

Relevance Feedback— Query-point Movement — • Query point is moved towards “good” examples — Rocchio’s formula in IR Q0: query point : retrieved data : relevance judgments Q0

Relevance Feedback— Query-point Movement — • Query point is moved towards “good” examples — Rocchio’s formula in IR Q0: query point : retrieved data : relevance judgments Q1: new query point Q1 Q0

Relevance Feedback—Re-weighting — • Standard Deviation Method inMARS(UIUC) image retrieval system

Relevance Feedback—Re-weighting — • Standard Deviation Method inMARS(UIUC) image retrieval system • Assumption: the deviation is high the feature is notimportant

Relevance Feedback—Re-weighting — • Standard Deviation Method inMARS(UIUC) image retrieval system • Assumption: the deviation is high the feature is notimportant f2 f1

Relevance Feedback—Re-weighting — • Standard Deviation Method inMARS(UIUC) image retrieval system • Assumption: the deviation is high the feature is notimportant f2 “good” feature f1 “bad” feature

Relevance Feedback—Re-weighting — • Standard Deviation Method inMARS(UIUC) image retrieval system • Assumption: the deviation is high the feature is notimportant • For each feature, weight wi = 1/si • is assigned f2 “good” feature f1 “bad” feature

Relevance Feedback—Re-weighting — • Standard Deviation Method inMARS(UIUC) image retrieval system • Assumption: the deviation is high the feature is notimportant • For each feature, weight wi = 1/si • is assigned Implied Query f2 “good” feature f1 “bad” feature

Relevance Feedback—Re-weighting — • Standard Deviation Method inMARS(UIUC) image retrieval system • Assumption: the deviation is high the feature is notimportant • For each feature, weight wi = 1/sｊ • is assigned • MARS didn’t provide any • justification for this formula Implied Query f2 “good” feature f1 “bad” feature

What’s New in MindReader? MindReader • does not use ad-hoc heuristics • cf. Rocchio’s expression, re-weighting in MARS • can handle multiple levels of scores • can derive generalized ellipsoid distance

What’s New in MindReader? MindReader can derive generalized ellipsoid distances q

Isosurfaces of Distance Functions q q q Euclidean weighted Euclidean generalized ellipsoid distance

Isosurfaces of Distance Functions q q q Euclidean Rocchio weighted Euclidean generalized ellipsoid distance

Isosurfaces of Distance Functions q q q Euclidean Rocchio weighted Euclidean MARS generalized ellipsoid distance

Isosurfaces of Distance Functions q q q Euclidean Rocchio weighted Euclidean MARS generalized ellipsoid distance MindReader

Method: distance function Generalized ellipsoid distance function • D(x, q) = (x – q)TM (x – q), or • D(x, q) = Sj Skmjk(xj – qj)(xk – qk) • q: query point vector • x: data point vector • M = [mjk]: symmetric distance matrix

Method: definitions • N: no. of samples • n: no. of dimensions (features) • xi: n-dsample data vectors xi= [xi1, …, xin]T • X: N×nsample data matrix X = [x1, …, xN]T • v:N-dscore vector v = [v1, …, vN]

Method: problem formulation Problem formulation Given • Nsample n-dvectors • multiple-level scores (optional) Estimate • optimaldistance matrixM • optimalnew query pointq

Method: optimality • How do we measure “optimality”? • minimization of “penalty” • What is the “penalty”? • weighted sum of distances between query point and sample vectors • Therefore, • minimizeSi (xi – q)TM (xi – q) • under the constraintdet(M) = 1

Theorems: theorem 1 • Solved with Lagrange multipliers • Theorem 1: optimal query point • q = x = [x1, …, xn]T= XTv / S vi • optimal query point is the weighted average of sample data vectors

Theorems: theorem 2 & 3 • Theorem 2: optimal distance matrix • M = (det(C))1/nC–1 • C = [cjk] is the weighted covariance matrix • cjk = S vi(xik - xk)(xij - xj) • Theorem 3 • If we restrictMto diagonal matrix, our method is equal to standard deviation method • MindReader includes MARS!

Experiments 1. Estimation of optimal distance function • Can MindReader estimate target distance matrixMhidden appropriately? • Based on synthetic data • Comparison with standard deviation method 2. Query-point movement 3. Application to real data sets • GIS data

Experiment 1: target data Two-dimensional normal distribution

Experiment 1: idea • Assume that the user has “hidden”distanceMhiddenin his mind • Simulate iterative query refinement • Q: How fast can we discover “hidden” distance? • Query point is fixed to (0, 0)

Experiment 1: iteration steps 1. Make initial samples: computek-NNs with Euclidean distance 2. For each object x, calculate itsscore that reflects the hidden distanceMhidden 3. MindReader estimates the matrixM 4. Retrieve k-NNs with the derived matrixM 5. If the result is improved, go to step 2

MindReader: Querying databases through multiple examples