Disambiguating Patent Inventors: A Non-Name-Matching Approach

Disambiguating Patent Inventors: A Non-Name-Matching Approach Presenter: Hsini Huang Co-authors: Li Tang and John P. Walsh Georgia institute of Technology ESF-APE-INV 2nd “Name Game” workshop, Dec 9, 2010 Madrid, spain

A challenge to undertake • Authorship identification has been the Achilles' heel of bibliometric analyses at the individual level, e.g. citation impact analysis (Tang and Walsh, 2010). • Raffo and Lhuillery (2009) also warned, the reliability of the statistical results regarding patenting inventors highly depends on the accuracy rate derived from a fine matching heuristic.

Why solve the “John Smith” problem differently? • Several reasons why name-matching is probably not a good idea: • Cleaning typos of names (inventor, assignee, etc.) is a difficult task • Those matching criteria are often used as dependent variables too, e.g. co-authorship, knowledge flows and geographical spillover (Singh, 2004) • “Name plus affiliation plus address” could be effective if inventors are not mobile

ASE Method: Key Concepts • Cognitive map • A process of a series of psychological transformations by which an individual acquires, codes, stores, recalls, and decodes information in spatial/information environment • Structural equivalence • In a single-relation network, two actors are structurally equivalent if they have identical ties to and from all the other actors • Approximate Structural Equivalence (A.S.E.) • Actors within a structural equivalent cluster are more similar to each other than those outside the cluster

ASE Method: Intuition • The references in a publication or patent should reflect the cognitive map of the author or inventor • If two documents share one or more references, they are more likely to be by the same creator --> This is especially true if they share a rare reference • Therefore, ASE of reference networks should partition documents by creators, especially if we weight the matrix by how rare the references are, and by how many references are in the documents • Validated on publication data (70-80% accuracy), (Tang and Walsh, 2010)

Graphically, the Approximately Structure Equivalence (A.S.E) is Source: Tang and Walsh (2010)

The measurement of cognitive homogeneity • w1 and w2 are two weights w1 = Popularity of the cited references w2 = Number of references in a patent document - D[ i, j] is the patent-reference matrix defined as [inventorsIDs X cited references] Mathematically, the score of similarity between authors is calculated as:

A comparison of different citation governances in EPO and USPTO • In the EPO, patent references are added by patent examiners. The concept of citation is to indicate the most technically relevant information with “minimum” references • In the USPTO, inventors or applicants should provide a complete list of all prior-arts they are aware of • Thus, USPTO data should more accurately reflect the cognitive maps of inventors H: The A.S.E algorithm performs better in US patents than in EPO patents -In fact, should perform poorly in EPO case

Data and experiment strategy • The golden rule dataset: The French Benchmark Dataset (APE-INV project, Lissoni et al., 2009) • Exp1&2: EPO citation vs. USPTO citation We retrieved reference data from PATSTAT • Exp3: A.S.E vs. Multi-stage matching method Thanks to the open access dataset provided by Lai and his colleagues (2009), the “careers and co-authorship networks of U.S. patent-holders since 1975”

The flow chart of our experiment

Calculation of the accuracy rate Misclassified as a singleton Over-clustering

Experiment 1: For all the records • Among all the 1850 patents in the French Benchmark dataset (incl. patents with no cites) • Using EPO references data, the A.S.E method can reach 77% accuracy • Using USPTO references data, the A.S.E method can reach 78% accuracy

Experiment 2: patents with at least one patent references Among all the patents with at lease one patent reference, • Using EPO references data, the A.S.E method can reach 79% accuracy (N=1051) • Using USPTO references data, the A.S.E method can reach 82% accuracy (N=361)

Experiment 3: A.S.E vs. Multistage method • Among the 361 US patents, 299 records were found in Lai, D’Amour and Fleming’s inventor dataset • the A.S.E method can reach 80% accuracy (on either EPO or USPTO data) • The multistage name-matching method reaches 61% accuracy

Sensitivity analysis: Accuracy by Threshold, EPO vs. USPTO

Summary of results • The finding is not completely support our hypothesis, the A.S.E. method performs slightly better for the US patents than the European patents. • The French Benchmark dataset has many singletons • The EPO examiners did very good job reviewing each inventors’ prior works? • The A.S.E method reaches a higher accuracy rate than the more elaborate multi-stages method • Thus, our method works, but perhaps not for the reasons we think, company benchmark data should be applied to double check this method in the future.

Discussions Advantages: • Researchers using the A.S.E method will have less worry about the mobility issue because the algorithm is insensitive to the change of address and/or affiliations. The only thing A.S.E. captures is the trajectory of the knowledge footprint • Less time consuming and less computational resources. The A.S.E method requires only a few pieces of information, i.e. patent no., patent references and the popularity of the cited references • A.S.E does not use affiliation or co-inventors in the disambiguation, so that these can be used to track mobility or collaboration

Discussion-cont. Negatives: • The A.S.E method can only be applied if the inventor’s patent has at least one linkage with the rest of his patents. Patents with no references will be treated as singletons automatically • EPO examiners cite less references. Around 50% of the EPO patents in this study are singletons (vs. 5% in the USPTO) - In this experiment, although even including these, the result still yields nearly 80% accuracy, since many are in fact singletons using the French scientist data)

Discussion-cont. Limitations: • The A.S.E method may not be able to relate inventors if someone radically changes project from one technical field to the other (although if they shift over time, the method will capture this with a transitivity rule) • Although the A.S.E method requires less parameters in the algorithm, it might be hard to apply to an X million by X million matrix. Some level of simple classification could help.

Thanks for your attention. Comments or suggestions?

Disambiguating Patent Inventors: A Non-Name-Matching Approach

Disambiguating Patent Inventors: A Non-Name-Matching Approach

Presentation Transcript

Inventors

Inventors

Disambiguating Japanese Inventors

Name matching for PATSTAT data

BiOnym A flexible workflow approach to taxon name matching

A Comparison of String Matching Distance Metrics for Name-Matching Tasks

Academic Inventors, Technological Profiles and Patent Value:

Inventors!

Inventors

Inventors

Matching Logic - A New Program Verification Approach -

Inventors

Inventors

Matching Logic A New Program Verification Approach

Disambiguating Lisbon.

Non-Redundant Patent Sequence Databases

Finding a Maximum Matching in Non-Bipartite Graphs

Company Name Fuzzy Matching Software

Company Name Matching Software

Similarity Search: A Matching Based Approach

Inventors

A Comparison of String Matching Distance Metrics for Name-Matching Tasks