310 likes | 427 Views
Participating institution: Humboldt Universität zu Berlin - IDSL. Multiple Retrieval Models and Regression Models for Prior Art Search. Patrice Lopez also at EPO Berlin, Germany. Laurent Romary INRIA Gemo - Saclay, France HUB-IDSL – Berlin, Germany. Plan.
E N D
Participating institution: Humboldt Universität zu Berlin - IDSL Multiple Retrieval Models and Regression Modelsfor Prior Art Search Patrice Lopezalso at EPO Berlin, Germany Laurent Romary INRIA Gemo - Saclay, France HUB-IDSL – Berlin, Germany
Plan • Searching Scientific and Technical Documents • Issues related to Prior Art Search • Overview of PATATRAS • Patent Document processing • Combining metadata & text in four steps • Results • Future work
PATATRAS !! • PATent and Article Tracking, Retrieval and AccesS addresses Scientific and Technical Publications in general: • Scientific and Technical Publications have 5 dimensions: • metadata • document structure • textual content • supporting content • experimental data • How is this instantiated in patent publications?
Patent Publications • Metadata encode procedure-related data: • Date, applicant, inventors, language(s) • Classification: hierarchy of technical fields IPC, ECLA (+ICO) G06F17/30T2P2X • Citations Information retrieval Query expansion
EC classified Pat: 95% NPL: 24% JP: 17% 1% EPO Citation Statistics EPO Search Reports produced in the last 5 years (tot. 775.000)
Patent Publications • Metadata encode procedure-related data: • Date, applicant, inventors, language(s) • Classification: hierarchy of technical fields IPC, ECLA (+ICO) G06F17/30T2P2X • Citations • Patent Document Structure: Title, Abstract, Claims, Description (description of prior art, "subjective" technical problem, description of embodiments) • Strong interrelations between these structures • Each of these structures serves different goals Information retrieval Query expansion
Patent Publications • Textual Content of Patent: • Attornish, multilinguality • Supporting content: • tables, mathematical and chemical formulas, citations, technical drawing, etc. • Experimental data: absent
PATATRAS !! • Scientific and Technical Publications have 5 dimensions: • metadata • document structure • textual content • supporting content • experimental data What are the known practices in prior art search ?
Prior art search The patent examiner’s real life CLEF-IP biases Topic patents are granted patent publication (richer documents then applications; ECLA classes, extra citations; final claims) Patent application • Motivation and approach: • Non exhaustivity • Recall-oriented search is a myth (titles and abstracts; lack of elaborate tool) • Usage of classification for search (a priori restriction of the result set) • Usage of meta-data (thickets; patents are continuation of previous applications) Result set extended to all EPO documents introduced during the prosecution the application Search report All documents without EPO counterpart (via patent family) are discarded (patent applications never filed at the EPO and non-patent literature omitted) Prior art Prior art Prior art Prior art Prior art
PATATRAS !! • Scientific and Technical Publications have 5 dimensions: • metadata • document structure • textual content • supporting content • experimental data • We investigated only 1 and 3 in CLEF IP 2009 • However... • how to combine metadata-based and text-content retrieval? • How to combine results in different languages? • How to combine different retrieval approaches?
Index Lemma en Index Lemma fr Index Lemma de Index Phrase en Index Concept Patent Collection Ranked Results (10) Initial Working Set Lemur 4.9 - KL divergence - Okapi BM25 Ranked Merged Results Init Merging Query Lemma en Tokenization Post-Ranking Patent Topic POS Tagging Query Lemma fr Phrase Extraction Query Lemma de Final Ranked Results Concept Tagging Query Phrase en Query Concept Overview of PATATRAS
Overview of PATATRAS Lemur 4.9 - KL divergence - Okapi BM25 Init Merging Post-Ranking
Overview of PATATRAS Lemur 4.9 - KL divergence - Okapi BM25 Post-Ranking Merging Init
Overview of PATATRAS Lemur 4.9 - KL divergence - Okapi BM25 Post-Ranking Merging Init
Patent Document Processing: Text Indexing • Sound linguistic processing as groundwork: • No stemming: POS tagging & lemmatisation • No stop words: Only open grammatical categories are considered (N, V, Adj., Adv., numbers) • A total of 5 indexes: • One word form* (lemma) index per language (en, fr, de) • English phrase indexing (Dice coefficient) • Conceptual indexing • *ISO/DIS 24611, Language resource management — Morpho-syntactic annotation framework
Conceptual indexing • Creation of a multilingual terminological database base based on a conceptual model* covering scientific & technical fields • Sources: MeSH, UMLS, Gene Ontology, SUMO, WordNet/WordNet-Domains/WOLF, Wikipedia en/fr/de • Merging on concept based on: • Domain matches (manual mappings between sources) • Term matches • Represent terms/term variants/synonyms/acronyms and multilingual correspondences • Term disambiguation based on IPC class • 2,6 millions terms for en, 190.000 for de, 140.000 for fr • 1,4 millions concepts (71.000 realized in de, 65.000 in fr) • *ISO 16642:2003, Computer applications in terminology — Terminological markup framework
Limitations of text-only retrieval • Queries are based on all the textual content of the topic patent documents Model Index Language base with citation text KL lemma en0.10680.1083 KL lemma fr 0.0611 0.0612 KL lemma de 0.0627 0.0634 KL phrase en 0.0717 0.0720 KL concept all 0.0671 0.0680 Okapi lemma en 0.0806 0.0813 Okapi lemma fr 0.0301 0.0303 Okapi lemma de 0.0598 0.0612 Okapi phrase en 0.0328 0.0330 Okapi concept all 0.0510 0.0516
Overview of PATATRAS Lemur 4.9 - KL divergence - Okapi BM25 Post-Ranking Merging Init
Overview of PATATRAS Lemur 4.9 - KL divergence - Okapi BM25 Post-Ranking Merging Init
Patent Document Processing: Metadata • Additional extraction of cited patents in the descriptions (regular expressions) • 7960 additional cited EP doc. found in XL set • Metadata representation: basic normalization (author, applicant), • Storage in a MySQL database (total 2,48 Go for the collection)
Prior working sets • Goal: For a given patent topic, create the smallest set of patents containing the relevant documents • Iterative expansion from a core list of documents based on metadata: citation tree, common applicant/author, patent family relation, classifications → patent examiner's strategies • Result: micro-recall of 0.7303, approx. 2600 doc. per patent topic (415 results per topic after final cutoff) • Significant improvement of MAP results: Model Index Language with cit. text with prior sets KL lemma en0.1083 0.1516 (+40%) KL lemma de 0.0634 0.1145 (+81%) KL phrase en 0.0720 0.1268 (+76%) Okapi lemma en 0.0813 0.1365 (+68%)
Overview of PATATRAS Lemur 4.9 - KL divergence - Okapi BM25 Post-Ranking Merging Init
Overview of PATATRAS Lemur 4.9 - KL divergence - Okapi BM25 Post-Ranking Merging Init
Merging of results • Strong complementarities between the results sets • So many examples ! → fully supervised ML • Regression model for estimating for each patent topic the pertinence of a result set • Features: language, query size, init. working set size, max./min. & range of retrieval scores, IPC main & class, average phrase length • Training set: 500 + Addition of 4131 patents of the collection • Linear combination of weights:
Merging of results • Feat. LeastMedSq MP SMO ν-SVM • f1 0.1681 (+5.8%) 0.1711 (+7.7%) 0.1706 (7.4) 0.1691 (+6.4%) • f1-6 0.1689 (+6.3%) 0.1797 (+13.1%) 0.1807 (+13.7) 0.1976 (+24.3%) • all 0.1786 (+12.4) 0.1898 (+19.4%) 0.2016 (+26.9%) 0.2281 (+43.5%) • f1 language • f2-6 related to the retrieval score • f7-8 IPC (domains) • f9 av. phrase length
Overview of PATATRAS Lemur 4.9 - KL divergence - Okapi BM25 Post-Ranking Merging Init
Overview of PATATRAS Lemur 4.9 - KL divergence - Okapi BM25 Post-Ranking Merging Init
Post-ranking • Regression model for estimating the pertinence of a patent in the result set for a given patent topic • Features: citation, # of common IPC & ECLA classes, prob. of citation, same applicant & inventors • Training set: 500 + Addition of 4131 patents of the collection
Final Results • Measures S M XL en-XL fr-XL de-XL • MAP 0.2714 0.2783 0.2802 0.2358 0.1787 0.2092 • Prec. at 5 0.2780 0.2766 0.2768 0.2365 0.1855 0.2122 • Prec. at 10 0.1768 0.1748 0.1776 0.1575 0.1338 0.1467 • In average approx. 43s per topic • Final runs (10.000 patent topics) for all, en, fr, de took 5 days on 4 machines
Conclusion • We have proposed an architecture for retrieving Scientific and Technical Publications • We have adapted the architecture to patent search practices • Need • improve terminological representations • address document structures • refine query representations • Full text available in HAL: http://hal.archives-ouvertes.fr/hal-00411835/fr/