90 likes | 212 Views
How to Make Manual Conjunctive Normal Form Queries Work in Patent Search. Le Zhao and Jamie Callan Language Technologies Institute School of Computer Science Carnegie Mellon University. Technology Survey Task @ Chem. Document Collection
E N D
How to Make Manual Conjunctive Normal Form Queries Workin Patent Search Le Zhao and Jamie Callan Language Technologies Institute School of Computer Science Carnegie Mellon University
Technology Survey Task @ Chem • Document Collection • 1.3 million patents + 0.18 million scientific articles • Tend to be long, have XML field structure • Topics • 6 topics (last year only 2 groups submitted runs, not reusable) • About use/detection of chemicals (in certain applications) • Similar to Ad hoc retrieval queries
Example Topic: TS-20 • <title>tests for HCG hormone</title><narrative>The hormone Human Chorionic Gonadotrophin (HCG) is produced when a women becomes pregnant. Tests are usually carried out by analysing blood or urine. We are looking for articles and patents on these pregnancy test kits or the chemical tests used to produce them.</narrative><details><chemicals>Human Chorionic Gonadotrophin OR HCG</chemicals><condition>pregnancy</condition><target>Human Chorionic Gonadotrophin OR HCG</target></details>
Our Runs • Automatic Queries • Unweighted bag of word baseline • Weighting and combining words from different query fields • Manual Queries • Interactive search using Boolean CNF queries • (test OR check OR detection OR detect)AND(HCG OR “Human Chorionic Gonadotrophin” OR “Chorionic Gonadotropin” OR Choriogonadotropin OR Choriogonin) • Effective, used by lawyers, librarians, medical, IR thesaurus & interaction check top ranked results MeSH etc. thesauri
Lemur CGI Identify synonyms 0.5 hours per topic
Results at Large (xinfAP) Not much difference on average Worst manual queries have reasonable AP Manual queries lower some high AP topics slightly Figure credit: MihaiLupu
Observations • Weighting different query fields helped. • Boolean CNF query (manual interaction) • Good • Expressive • Helps a lot for hard (low AP) queries • Bad • Takes time & care to create & interact • Manual error in formulating those queries • Phrase or window restrictions improves top precision, but destroys lower level recall/precision • Difficult to identify from top rank, new tools needed
Comparisons with Best Runs • Fraunhofer-SCAI • Semantic search (similar to our CNF queries) • IPC classification filtering • Doc field based term weighting • Topics that our manual queries got better • TS-22 detect => detection test predict check determine determination • TS-29 minimum inhibitory concentration => … • Expanded all terms, but not all resulted in
Thanks to track organizers • NSF grant IIS-1018317 • Questions?