1 / 14

Using Physical Quantities to Find Similar Documents

Using Physical Quantities to Find Similar Documents. John Tait johntait.net Ltd. john@johntait.net. Overview. The Problem Finding relevant physical quantities in documents Some Solutions Concluding Remarks. The Problem.

gianna
Download Presentation

Using Physical Quantities to Find Similar Documents

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Using Physical Quantities to Find Similar Documents John Tait johntait.net Ltd. john@johntait.net

  2. Overview • The Problem • Finding relevant physical quantities in documents • Some Solutions • Concluding Remarks

  3. The Problem • Many searches in Business Intelligence areas like Technology Lanscaping, or Techno-Legal Areas like Freedom-to-Operate searches involve searching documents which mention physical quantities like metres, kilograms and degrees centigrade • Document Similarity Searching (including Boolean) based on bag-of-words models works badly for these sort of queries

  4. Example Searches 1 • Are there any enforceable patents related to Manufacturing Process using trifluormethanesulfonate as a reagent at approximately 22°C • Relevant Documents include those with temperatures expressed in K and °F • Note also the implied range

  5. Example Search 2 • Have we any internal test documents which report a torque in excess of 1000 ft lbf for an electric motor suitable for installation in a car? • Relevant documents include documents reporting N m, possibly also bhp, KW etc.

  6. Four solutions Based on a survey in the LinkedIn Group “Information Access and Search Professional” Thanks to: Mathew Kesler, Helmut Berger, Seth Grimes, Kevin Watters, Gerard DuPont, Christopher Frenz, Robert Peterson, Marat Shaidulatov

  7. Solution 1: Synonym Query Expansion • Use a comprehensive list of units (e.g.  http://www.unc.edu/~rowlett/units/index.html ) to identify synonyms for the search specification units and e.g. Boolean searching and manual result set refinement to obtain a suitabel result set • Probably effective but heavy on searcher effort

  8. Solution 2: System with faceting • Use physical quantities as a facet • Ensure documents contain suitable metadata to facilitate the search • Endeca looks good here: http://www.endeca.com/en/home.html although it remains to see what will happen under Oracle ownership • Requires good metadata – can be hard to arrange for large existing collections

  9. Solution 3: Normalise input documents • Use a text annotation system like Gate Mimir (http://gate.ac.uk/family/mimir.html ) combined with the Tagger Measurements (http://gate.ac.uk/gate/doc/plugins.html#Tagger_Measurements ) and possibly machine learning to annotate the documents with normalised measurements • Use a standard search system (e.g. Lucene/SOLR) to do the searching. • Requires a project for your application

  10. Solution 4: Specialised Search system • Use a system with in-built knowledge of ranges, units and physical quantities on both query and indexing sides • E.g. Max.recall’sQuantalyze (https://www.quantalyze.com/en/ )

  11. Conclusions • Searching for physical quantities is a real and pressing problem for many professional searchers • Effective solutions now exist for both one off requirements and long term needs

  12. Acknowledgements • Francisco De Sousa Webber, CEO of the IRF, who originally introduced me to the problem • Mike Baycroft, CEO of Fairview Research and IFI Claims for many stimulating discussions

  13. For more information Email john@johntait.net For

More Related