150 likes | 163 Views
This report discusses the implementation and advancements of the IPCCAT-Neural system, which allows for automatic prediction of IPC symbols based on text input. It covers the training collection, IPC coverage, precision, and potential future developments.
E N D
9.a Report on IPC-related IT systems IPC Committee of Experts 50 Patrick Fiévet Head of IT Systems Section International Classifications and Standards Division Geneva February 08, 2018
Agenda • Artificial Intelligence: IPC Text categorization in the IPC i.e. IPCCAT-Neural • What is it / what is new ? • Demonstration • What it could be for • What comes next in the short term • What comes next in the longer term
IPCCAT-neural text categorization in the IPC • What is it about? • Automatic Prediction (guess) of the most appropriate IPC symbols on the basis on a text input (e.g. patent abstract) i.e. 3 guesses among N categories with an associated level of confidence in this prediction • Implementation based on several neural networks • IP knowledge added value and technology used in the processing of the training collection is as important as the technology used in the classifier
IPCCAT-neural text categorization in the IPC at subgroup level ! • IPCCAT neural 2016 at IPC main group level : • Number of categories: 7,374 • Precision (three guesses): 80% • Number of Neural networks: ~700 • IPCCAT neural 2018 at Subgroup level: • Number of categories: 72,137 • Precision (three guesses) based on 1.5 million of test cases: 82% • Number of Neural networks: ~8,000
IPCCAT-neural text categorization in the IPC at subgroup level • Why was It actually doable? • Recent evolution of the IPCCAT classifier (available on-demand as open source by the Olanto foundation) • Added value in data processing: • Training based on patent documents computed from DOCDB XML excerpts • Computation of both IPC and CPC classifications • Progress in computing power opens new R&D horizons e.g. GPU, text processing,…
Evolution of IPCCAT R&D over years 2018: IPC Group level ~73,000 categories 2003-2008: IPC Main Group level (~7,000 categories) 2017
IPCCAT-neural 2018: text categorization in the IPC at subgroup level Training collection, IPC coverage and precision: • Training collection: 27.7 million in EN and 4.4 in FR • Coverage of the IPC (using IPC and CPC through concordance): • 99% at subgroup level (EN) • 91% at subgroup level (FR) • Precision (three guesses): • 82.5 % at subgroup level (EN) !! • 72% at subgroup level (FR)
IPCCAT-neural 2018: text categorization in the IPC at subgroup level Training collection, IPC coverage and precision: • Side-effects of n-gram improvements on precision at IPC main Group level (three guesses): • 89 %at Main Group level (EN) • 83% at Main Grouplevel (FR)
IPCCAT-neural text categorization in the IPC at subgroup level • Demonstration • http://icwscommonacc.wipo.int/classifications/ipc/ipcpub/?searchmode=ipccat
Artificial Intelligence / IPCCAT-neural: on the way to assist IPC reclassification • Chronology: (Still a long way to go) • Evidence that text categorization works at IPC subgroup level with acceptable precision: Done • Integration of IPCCAT neural at sub-group level into IPCPUB v 7.5 (February 2018) • Confirmation that Cross-lingual text categorization can assist in other languages than EN, even in absence of large training collections: to be prototyped based on a commercial CAT tool and limited testing (for costs containment reasons)
Artificial Intelligence / IPCCAT-neural: on the way to assist IPC reclassification • Chronology: (Still a long way to go) • Incentives for R&D in automated text categorization: WIPO DELTA training collection (Bilateral discussion EPO-WIPO in progress) Q2 2018? • Propose alternatives to Default Transfer e.g. more than one symbol based on IPCCAT guesses and confidence levels • CE Decisions, WIPO resource planning, etc… (2019) • Developmentof the production-scale solution integrating neural cross-lingual text categorization (based on IPCCAT neural and WIPO translate ?) (202x) • Integration into IPCWLMS for Stage 3 reclassification (202x)
Incentive to R&D in text categorization: WIPO-Alpha training collection
Incentive to R&D in text categorization: WIPO-Delta training collection • Short term perspective: • Further AI incentives for research and development institutes interested in automatic text categorization e.g. in patent classification • Fully specified XML format (DONE) • Complement the public WIPO-ALPHA training collection with a WIPO-DELTA XML collection ? (see http://www.wipo.int/classifications/ipc/en/ITsupport/Categorization/dataset/index.html ) from IPCWLMS (upload in database for R&D purpose and XML training collection export)
Text categorization in the IPC • Other 2018 perspectives: • Cross lingual text categorization in the IPC at subgroup level • Confirmation of expectations through prototyping of ES, FR, EN, DE, RU support through use of automatic translation by commercial product (bound by budget limitations) e.g. DE text translated text into EN and submitted to IPCCAT neural trained with EN documents • Available through IPCPUB interface or web service (Q2 2018) • IPCCAT retraining based on IPC 2018.01 (Q3 2018)
Thank you for your attention! • QUESTIONS? contact WIPO at ipc.mail@wipo.int