1 / 24

Disambiguating Japanese Inventors

Disambiguating Japanese Inventors. Yusuke Naito,Naotoshi Tsukada. Agenda. Motivation Issues and Topics Outline of Program Data Details Data acquisition via advanced questionnaire system Result Future work. Motivation. Innovation research using patents Trace the individual inventions

Jimmy
Download Presentation

Disambiguating Japanese Inventors

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Disambiguating Japanese Inventors Yusuke Naito,NaotoshiTsukada ESF-APE-INV 3rd "Name Game" workshop Yusuke Naito,Naotoshi Tsukada

  2. Agenda • Motivation • Issues and Topics • Outline of Program • Data • Details • Data acquisition via advanced questionnaire system • Result • Future work ESF-APE-INV 3rd "Name Game" workshop Yusuke Naito,Naotoshi Tsukada

  3. Motivation • Innovation research using patents • Trace the individual inventions • Calculate statistics for inventors • “Name Game” is common theme within international researchers as Scientometrics • Generally the solution of same name problem have scoring style in its history [Aizawa2005] • The research paper in NBER is pioneer[Trajtenberg 2006] [Kim2006] • The items and methodology depend on the research purpose or country location • Ex. When one get the mobility of inventors, he/she cannot use affiliation item. [Kim2006] In the Korea, same names cover major population. • There exists no Japanese published inventors list which are disambiguated. ESF-APE-INV 3rd "Name Game" workshop Yusuke Naito,Naotoshi Tsukada

  4. Issues • The usable items for identification of inventor is limited. • There are both type 1 error and type 2 error to identify each inventor. • If we could use birth day or social insurance number … • There exist lots of employee inventions in big company, then we cannot ignore same name problem • The expression of inventor’s address is not restricted whether home or company. It is difficult to identify even the company address. There exist no common rule it may be HQ or divisional address. • Even for the one inventor, the expression of address is varied in plural invention. • It is hard to identify in case of change affiliation of same person ESF-APE-INV 3rd "Name Game" workshop Yusuke Naito,Naotoshi Tsukada

  5. Topics • Algorithms depends on language characteristics • Natural language processing • Use all usable data • GIS • Phone book • Patent database • Data acquisition from NEDO project • Questionnaire System Development from scratch ESF-APE-INV 3rd "Name Game" workshop Yusuke Naito,Naotoshi Tsukada

  6. Outline of Program • Classifying detail attribute which described on patent document • Scoring in normalized value and weighting for each items • Comparing total score and threshold between two inventors ESF-APE-INV 3rd "Name Game" workshop Yusuke Naito,Naotoshi Tsukada

  7. Data • Input as target • Description as inventor in patent • Except one time inventor • where no existance of similar string and similar Yomi • Except same name and same address in near application dates • Where belonging company of inventor is not over mid-class company and different technical field • Output as result • For public inventors identification • Grouping the same persons • For investigating about proper inventor(s) • Maximum members in each group with target, or group with highest average score • As relating data, output another group which belong different group for target ESF-APE-INV 3rd "Name Game" workshop Yusuke Naito,Naotoshi Tsukada

  8. Details • Items • Issues in items and its solving methods • Scoring • Evaluate Functions • Machine Learning ESF-APE-INV 3rd "Name Game" workshop Yusuke Naito,Naotoshi Tsukada

  9. Items(1)name and address • Name • Levehnstein distance in Kanji • Rewriting from string ambiguation in Kanji of current and old which can be used as same(斉藤 and 齊藤,嶋田 and 島田,etc) or wrong character(二郎 and 次郎,祐介 and 裕介,etc) as same Yomi, Japanese Kana pronunciation(by Name Yomi Dictionary). Making this as rough candidate pairs. (scoring after this between these candidate pairs.) • Contributing to recall rate • Deciding specificity in order to patent frequency 3.5 • Adress • Disambiguate from political history such as M & A in city-size government, etc by Address History Dictionary • Transform to geographical latitude and longitude date in smallest area level, and calculate into distance • Contributing to recall rate • Deciding specificity in order to patent frequency 4046 ESF-APE-INV 3rd "Name Game" workshop Yusuke Naito,Naotoshi Tsukada

  10. Name distributions ESF-APE-INV 3rd "Name Game" workshop Yusuke Naito,Naotoshi Tsukada

  11. Geographic distributions ESF-APE-INV 3rd "Name Game" workshop Yusuke Naito,Naotoshi Tsukada

  12. Items(2)Network • Co-inventor • Candidate pairs are identical in 2 or 3 length path in co-inventor network • Citation • Network of citation written by inventor(not examiner) • Candidate pairs are identical in length path less than 4 in network • Network patterns in citation • 1 length path(1 pattern) • 2 length path(3 patterns) • 3 length path(4 patterns) citing cited cited citing cited citing cited citing citedciting citing cited citedciting citedciting citing cited citedciting citing cited citing citing cited citing cited citedciting cited citing cited ESF-APE-INV 3rd "Name Game" workshop Yusuke Naito,Naotoshi Tsukada

  13. Items(3)affiliation and applicant • Affiliation • Score depends on inverse of size of the organization which name described in inventors address • Distinguish divisional name and company name(refering applicant name) • Applicant • In case of no description of organization in inventor address and same applicant in candidate pair • Score depends on inverse of size of the applicant ESF-APE-INV 3rd "Name Game" workshop Yusuke Naito,Naotoshi Tsukada

  14. Items(4)application date and IPC • Application date • Score depends on inverse of the period between candidate pair’s application dates. • 1000 days as maxmum period • IPC • Score depends on matching rate in Publication IPCs • FI(Search IPC) • In all patents, common and expanded IPC ver. 4 • Easy to compair ESF-APE-INV 3rd "Name Game" workshop Yusuke Naito,Naotoshi Tsukada

  15. Issues in items and its solving methods • Issue:time costing • Induce candidates from matrix in all inventors • Text calculation • Method 1:get speed by indexing tables • Using exact indexed tables in all relational (join) process • Method 2:matrix will be sparse • Calculation by pairwise • Method 3:embedding user difined function • Using compiled programs from C code other than join process • Numeric calculation in distance or similarity • Enhancement:100 times faster • 30 targets 20 days → 300 targets 2days • Taking suitable time for create indexes once and reuse after ESF-APE-INV 3rd "Name Game" workshop Yusuke Naito,Naotoshi Tsukada

  16. Scoring • Simple method • Summation of each items score • parameters:items allotment • Weighted method • Normalized allotment in items • Parameters:weight of items • Tuning parameters • By manual • Machine learning ESF-APE-INV 3rd "Name Game" workshop Yusuke Naito,Naotoshi Tsukada

  17. Machine Learning(not completed) • Teaching data • Disambiguated by manual for Highest 30 inventors in patents of NEDO • Proved the fact depends on name frequency • Enforcement learning(Q-learning) • Weighting depends on sensitivity • Genetic Algorithm(Classifier System) • Genotype:weightsPhenotype:total score • Converge to near optimized • Support Vector Machine(SVM) • Maximize margins in evaluating of Kernel function(polynominal) • Tentative result • High weights in high performance items • Comparable with convnetional methods (like “hill climb”) ESF-APE-INV 3rd "Name Game" workshop Yusuke Naito,Naotoshi Tsukada

  18. Evaluating Function • Recall rate • Precision rate • F measure • Dividing rate • For items • N:true positive (no error) • V:false positive (type 1 error) • M:false negative (type 2 error) • For group • Divide wrong set • A:sum of right set size • B:dividing count • D as penalty ESF-APE-INV 3rd "Name Game" workshop Yusuke Naito,Naotoshi Tsukada

  19. Final result from clustering • In the above process, result the matrix value between candidate pairs • Clustering induce disambiguated set from matrix • Transitive rule • Score 0.9 in candidate Aand B, 0.9 in B and C and 0.1 in Aand C means that the target changed in B situation ESF-APE-INV 3rd "Name Game" workshop Yusuke Naito,Naotoshi Tsukada

  20. Data acquisition via advanced questionnaire system • Questionnaire system connected Database • Generating question when answering person matched with inventor in patent database • Answer by inventor him/her self • Issues • In case of the much numbers of patents, selecting lots of patents • Targetting by address • Remaining probability of wrong answer caused from answerer restriction • Record answer to database • Easy to calculate statistics • Generating e-mail of request and remind • Detecting skip in mistake • Auto enable/disable by notating dependencies in questionairs ESF-APE-INV 3rd "Name Game" workshop Yusuke Naito,Naotoshi Tsukada

  21. Proof by NEDO questionnaire • Evaluation • Number of inventors corresponding to NEDO patents on database W=848 • Number of inventors corresponding to questionnaire answer Q=854 • Result from program execution(manual→ML→improve in manual) • Right result N=412→532→654 • Type 1 error V=128→168→305 • Type 2 error M=442→314→200 • Singlton inventor who have no candidate L=36 • Evaluated value • Recall rate R=0.50→0.61→0.75 • Precision rate P=0.75→0.75→0.67 • Dividing rate D=0.61→0.31→0.28 ESF-APE-INV 3rd "Name Game" workshop Yusuke Naito,Naotoshi Tsukada

  22. Remaining works • Utilizing name frequency • Phone book vs. Inventors • Frequency of Affiliation or applicant • Maintaining of name (yomi) dictionary • There exists hard reading name(1%) • Low frequency and easy to miswriting • Performance tuning • Enhance 10 times more • Preventing to increase time cost by small program change • Comparing ML variety(parameters or kernel function) • Use all inventors attributes items • Attorney • Feature words 山本示 ヤマモトシメス 前田維 マエダユイ 高橋召 タカハシミコト ESF-APE-INV 3rd "Name Game" workshop Yusuke Naito,Naotoshi Tsukada

  23. Items in near future • Attorney • Same inventor may apply via same attorney • Feature words • Hypothese • Inventor uses same words in plural patents • Words vector calclated from TF・IDF • TF:Text Frequency (of word) • IDF:Inverse Document Furequency (of word) • Conventional way in retrieval systems • Similarity by inner product between text ESF-APE-INV 3rd "Name Game" workshop Yusuke Naito,Naotoshi Tsukada

  24. Reference • [Trajtenberg 2006]Manuel Trajtenberg, Gil Shiff, Ran Mclamed, THE “NAMES GAME”: HARNESSING INVENTOR’S PATENT DATA FOR ECONOMIC RESEARCH, NBER Working Paper 12479, 2006 • [Kim2006] Jinyoung Kim, Sangjoon John Lee, Gerald Marschke, International Knowledge Flows: Evidence from an Inventor-Firm Matched Data Set, NBER Working Paper 12692, 2006 • [Aizawa2005] Akiko Aizawa, KeizoOyama, AtsuhiroTakasu, Jun Adachi, Research Issues and Current Solution for Identification of Records, IECE Journal, Vol. J88-DI, No.3, 2005 ESF-APE-INV 3rd "Name Game" workshop Yusuke Naito,Naotoshi Tsukada

More Related