190 likes | 282 Views
Search for Approximate Matches in Large Databases. Eugene Fink Jaime Carbonell. Aaron Goldstein Philip Hayes. Motivation. Fast identification of approximate matches in large sets of records. Applications: Medical databases Customer records National security. Outline.
E N D
Search for Approximate Matchesin Large Databases Eugene Fink Jaime Carbonell Aaron Goldstein Philip Hayes
Motivation Fast identification of approximatematches in large sets of records. Applications: • Medical databases • Customer records • National security
Outline Records and queries Search for matches Experimental results
Table of records We specify a table of records by a list of attributes. Example We can describe patients in a hospitalby their sex, age, and diagnosis.
Example Record Sex: female Age: 30 Dx: asthma Records and queries A record includes a specificvalue for each attribute. A query may include lists ofvalues and numeric ranges. Query Sex: male, female Age: 20..40 Dx: asthma, flu
A point query includes a specificvalue for each attribute. A region query includes lists of values or numeric ranges. Example Region query Sex: male, female Age: 20..40 Dx: asthma, flu Point query Sex: female Age: 30 Dx: asthma Query types
Record Dx Age Query Sex Exact matches A record is an exact match for a query if every value in the record belongs tothe respective range in the query.
Dx Age Query Sex Approximate matches A record is an approximate match for aquery if it is “close” to the query region. Record
Approximate queries An approximate query includes: Point or region Distance function Number of matches Distance limit
Outline Records and queries Search for matches Experimental results
Group nodes into fixed-size disk blocks diagnosis diagnosis diagnosis diagnosis age age sex female, 30,fracture female, 50,flu female, 30,ulcer female, 30,asthma male, 30,asthma male, 40,flu Indexing structure Maintain a PATRICIA tree of records male female 30 50 40 30 asthma ulcer fracture flu asthma flu
diagnosis diagnosis diagnosis diagnosis age age sex female, 30,fracture female, 50,flu female, 30,asthma female, 30,ulcer male, 30,asthma male, 40,flu Search for matches Depth-first search for exact matches Best-first search for approximate matches male female 30 50 40 30 asthma ulcer fracture flu asthma flu
Outline Records and queries Search for matches Experimental results
Performance Experiments with a database of all patientsadmitted to Massachusetts hospitals fromOctober 2000 to September 2002 : • Twenty-one attributes • 1.6 million records Use of a Pentium computer: • 2.4 GHz CPU • 1 Gbyte memory • 400 MHz bus
Variables Control variables: • Number of records • Memory size • Query type Measurements: • Retrieval time
100 Approximatequeries Availablememory n0.5 Rangequeries Retrieval Time (msec) 10 n0.15 lg n Exact queries lg n 1 103 105 106 102 104 Number of Records Small memory Number of records: 100 to 1,672,016 Memory size: 4 MByte
10,000 1,000 Approximatequeries 100 Retrieval Time (msec) Range queries 10 Exact queries 1 128 512 1,024 64 256 Memory Size (MBytes) Large memory Number of records: 1,672,016 Memory size: 64 to 1,024 MByte
Summary Retrieval time grows as fractional power (about 0.5) of database size If we extrapolate this growth rate, retrieval times are reasonable for very large databases
Summary Retrieval time grows as fractional power (about 0.5) of database size If we extrapolate this growth rate, retrieval times are reasonable for very large databases: