140 likes | 253 Views
Using Physical Quantities to Find Similar Documents. John Tait johntait.net Ltd. john@johntait.net. Overview. The Problem Finding relevant physical quantities in documents Some Solutions Concluding Remarks. The Problem.
E N D
Using Physical Quantities to Find Similar Documents John Tait johntait.net Ltd. john@johntait.net
Overview • The Problem • Finding relevant physical quantities in documents • Some Solutions • Concluding Remarks
The Problem • Many searches in Business Intelligence areas like Technology Lanscaping, or Techno-Legal Areas like Freedom-to-Operate searches involve searching documents which mention physical quantities like metres, kilograms and degrees centigrade • Document Similarity Searching (including Boolean) based on bag-of-words models works badly for these sort of queries
Example Searches 1 • Are there any enforceable patents related to Manufacturing Process using trifluormethanesulfonate as a reagent at approximately 22°C • Relevant Documents include those with temperatures expressed in K and °F • Note also the implied range
Example Search 2 • Have we any internal test documents which report a torque in excess of 1000 ft lbf for an electric motor suitable for installation in a car? • Relevant documents include documents reporting N m, possibly also bhp, KW etc.
Four solutions Based on a survey in the LinkedIn Group “Information Access and Search Professional” Thanks to: Mathew Kesler, Helmut Berger, Seth Grimes, Kevin Watters, Gerard DuPont, Christopher Frenz, Robert Peterson, Marat Shaidulatov
Solution 1: Synonym Query Expansion • Use a comprehensive list of units (e.g. http://www.unc.edu/~rowlett/units/index.html ) to identify synonyms for the search specification units and e.g. Boolean searching and manual result set refinement to obtain a suitabel result set • Probably effective but heavy on searcher effort
Solution 2: System with faceting • Use physical quantities as a facet • Ensure documents contain suitable metadata to facilitate the search • Endeca looks good here: http://www.endeca.com/en/home.html although it remains to see what will happen under Oracle ownership • Requires good metadata – can be hard to arrange for large existing collections
Solution 3: Normalise input documents • Use a text annotation system like Gate Mimir (http://gate.ac.uk/family/mimir.html ) combined with the Tagger Measurements (http://gate.ac.uk/gate/doc/plugins.html#Tagger_Measurements ) and possibly machine learning to annotate the documents with normalised measurements • Use a standard search system (e.g. Lucene/SOLR) to do the searching. • Requires a project for your application
Solution 4: Specialised Search system • Use a system with in-built knowledge of ranges, units and physical quantities on both query and indexing sides • E.g. Max.recall’sQuantalyze (https://www.quantalyze.com/en/ )
Conclusions • Searching for physical quantities is a real and pressing problem for many professional searchers • Effective solutions now exist for both one off requirements and long term needs
Acknowledgements • Francisco De Sousa Webber, CEO of the IRF, who originally introduced me to the problem • Mike Baycroft, CEO of Fairview Research and IFI Claims for many stimulating discussions