210 likes | 322 Views
Data Frame Augmentation of Free Form Queries for Constraint Based Document Filtering. Andrew Zitzelberger. Problem. Constraint Based Queries. Queries. Test Queries 1) Find me a Wii game. 2) Find me a Honda for under 15 thousand dollars. 3) Roller Coaster more than 150 feet high
E N D
Data Frame Augmentation of Free Form Queries for Constraint Based Document Filtering Andrew Zitzelberger
Queries Test Queries 1) Find me a Wii game. 2) Find me a Honda for under 15 thousand dollars. 3) Roller Coaster more than 150 feet high 4) mountains at least 15K feet 5) games under $25 6) mountains less than 4 km 7) ps games < $40 8) coasters longer than 1000 feet 9) car for under 5 grand newer than 1990 with less than 115K miles 10) more than 15K miles under 5 grand newer than 2004
Keywords + Semantics • Semantic queries are computationally expensive • Keyword queries are fast and simple • People are used to keyword queries • Synergistic solution: • extract numerical constraints from the query • use keywords to quickly narrow the search space • use constraints as a filter
Data Frames Price internal representation: Double external representation: \$[1-9]\d{0,2}(,\d{3})*|... ... right units: (K)?\s*(cents|dollars|[Gg]rand|...) canonicalization method: toUSDollars comparison methods: LessThan(p1: Price, p2: Price) returns (Boolean) external representation: (less than|<|under|...)\s*{p2}|... ... end
Free Form Query • Car under 6 grand newer than 1990 with less than 115K miles
Step 1: Condition Extraction • Car under 6 grand newer than 1990 with less than 115K miles • Extracted Conditions • (Price < 6000) • (Year > 1990) • (Distance < 115000)
Step 2: Remove Condition Values • Car under newer than with less than
Step 3: Remove Stopwords • Car
Step 5: Filter Document on Constraints • Keep page if every constraint is satisfied by at least one extracted value
Experimental Setup • 300 web documents • 100 car+trucks pages from http://provo.craigslist.org • 100 video gaming pages from http://provo.craigslist.org • 50 mountain pages from http://en.wikipedia.org • 50 roller coaster pages from http://en.wikipedia.org • 10 queries • 8 with usable conditions • 2 data sets • test-development • blind test
Precision@3/Query Type Keyword Queries Reduced Queries Data Frame Augmented Queries Dev-Test Queries 33% 40% 60% Blind-Test Queries 50% 46% 63% Overall 42% 43% 62% Results Summary • Precision increase for 56% of queries • 75% for test-dev, 50% for blind-test • Precision never worse than keyword query • Most effective for short, focused documents
Discussion • Issues: • inadequate narrowing or ranking of search space • noise caused by other numbers • Distance < 115000
Future Work • Scalability • Indexing data frame extracted terms • Precision vs Recall trade-offs • Pay-as-you-go search construction
Related Work • Question-Answering Systems • Keyword search over databases and semantic stores
Query Keyword Condition Removed Keyword Data Frame Augmentation Find me a Wii game. 0.33 0.33 0.33 Find me a Honda for under 15 thousand dollars. 0.67 1.00 1.00 roller coaster more than 150ft high 0.33 0.33 0.67 mountains at least 15K ft 1.00 0.67 1.00 games under $25 0.00 0.33 0.67 mountains less than 4 km 0.00 0.00 0.33 ps games < 40 bucks 0.33 0.00 0.33 coasters longer than 1000 feet 0.33 1.00 1.00 car for under 6 grand newer than 1990 with less than 115K miles 0.33 0.33 0.67 more than 15K miles under 10 grand newer than 2000 0.00 0.00 0.00 Results (Test-Dev Set)
Query Keyword Condition Removed Keyword Data Frame Augmentation Find me a Wii game. 0.67 0.67 0.67 Find me a Honda for under 15 thousand dollars. 0.67 1.00 1.00 roller coaster more than 150ft high 0.67 0.67 0.67 mountains at least 5K ft 0.33 0.33 0.67 games under $25 0.67 0.67 1.00 mountains less than 4 km 0.00 0.00 0.00 ps games < 40 bucks 0.33 0.33 0.33 coasters longer than 1000 feet 0.67 0.67 0.67 car for under 6 grand newer than 1990 with less than 115K miles 0.67 0.00 1.00 more than 15K miles under 10 grand newer than 2000 0.33 0.33 0.33 Results (Blind Test Set)