270 likes | 513 Views
Swedish inventors ‐ matching to registers and descriptive data Presentation at APE-INV Brussels September 5 th 2011. Lina Ahlin and Olof Ejermo lina.ahlin@circle.lu.se olof.ejermo@circle.lu.se.
E N D
Swedish inventors ‐ matching to registers anddescriptive dataPresentation at APE-INVBrussels September 5th2011 Lina Ahlin and Olof Ejermo lina.ahlin@circle.lu.se olof.ejermo@circle.lu.se C I R C L ECentre for Innovation, Research and Competence in the Learning EconomyL U N D U N I V E R S I T YP.O.Box 117, SE-221 00 Lund, Sweden
On the agenda • What is so special with Swedish data • 1st matching • 2nd matching • Future – how to reach 100% match rate? • (Results)
Linkinginventors to registers • EPO applied patents 1978-2009 for inventors with addresses in Sweden. • Matchingdoneon name-homeaddress combinations • Problem 1: different inventors may have the same name • Problem 2: addressesmaybe old • How to verifyperson identity and connectto Swedish register data?
Swedish data Q: What makes Swedish data so exciting (and why we want a high match rate)? A: Through Statistics Sweden it is possible to connect individuals to register data whichconnectsseverallevels of information relevant for innovation studies: • Individuallevel: field/level of education, age, income, gender, workplace • Regions: workplace, home municipality • Sectoral level: sectors, firm size, level of R&D... can give a multifacetted view of innovation, but need a personal identifier ”personnummer” to do this e.g. 19500131-3422 Birth date Jan 31st, 1950 Evennumber = female
1st matching (Oct-Dec 2010) • All Swedes (incl. Personnummer) listed on address register ”SPAR” • Matching of addressesthroughInfoTorgstores addresses/addresschangeslatest3 years addition of personnummer • Individuals under 16 not matched • Old patents added under the assumption that: Sven Ivar Johanson Sven Ivar Johanson Storgatan 1 = Storgatan 1 111 00 Stockholm 111 00 Stockholm Match rate 64% of inventor-patent pairs. Lowpeak 23% in 1978 to high peak 93% in 2008. This is because of mobility of inventors. Register 2008-2010 Patent applied for in 1992
InfoTorgreturned 56% match rate • Manual check (visual – no robot) + 8%
64% match rate 1985-2005: present access to individual registers at Statistics Sweden 2006-2009: additions as of Sep. 30th 2011
2nd matching (April-Sep 2011) • Use public access to registers (Swedish geneaological association ) • CD:s of Swedish population (1980)/1990 published by oldaddresses and birth date • CD ”Book of dead” 1901-2009 address at death + personnummer • Match birth date + name to personnummer using service by InfoTorg or online sources
Methodology • Extract data from Swedish deadbook and Swedish genealogy records for 1990 (to some extent also 1980) on all individuals in the population by letter • Generate a variable containing name, address and postal address for all individuals in the population as well as for inventors who are not fully matched
Normalized Levenshtein (”strgroup”) in STATA • An example of the ”name-address”string: ”Sven Ivar Johanson, Storgatan 1, 111 00 Stockholm” (from EPO) = ”Sven Ifwar Johanson, Storgatan 1, 111 00 Stockholm” (from Swedish population 1990) • Replace/insert 3 letters to make strings equal • Divided by length of shortest string (48) (3/48) = 0.0625 (=a good hit)
Adding date of birth • 1990 Levensthein names & adresses • 1990 Levensthein unique names • Levenshtein from CD dead 1901-2009 - names and adresses • Strgroup: similarity on name-address hits 1-3 • Some manual additions and minor changes • 1980 Levenshtein names and addresses (letters D&H)
Methodology: continued • Manually examine each match to see whether Levenshtein-command has matchedcorrectly • Some hits discardedinclambiguousname match hits
Adding personnummer (ongoing) New match rate 80%, but not full personnummer. What to do? • Use date of birth-part of personal number for fully matched inventors • Join all possible combinations of birth dates for those fully matched and those with only birth dates. • RunLevenshtein-distance on inventornames • Small Levenshtein-distance: accept that the inventors are the same since name and birth date match • Large Levenshtein-distance: reject • Further, manually check remaininginventors. Look at addresses for further confirmation if uncertain.
Adding personnummer ctd. • UseDeathbook yrs 1975-2009. Use date of birth-part of personal numbers • Re-runstep 2-6 on previous slide
Adding personnummer ctd. Problem: not all inventors were previouslyidentified no 4 last digits Two options to get full personal numbers from birth dates: • Use InfoTorg again with name + addedparameter ”birthdate” • Manually addfourlast digits by using internet service (www.upplysning.se)
Somematching problems • Difficult to match individuals who change last names (mainly women) or with common names and who move a lot. • Two people with the same name can live on the same address (i.e. father names his son after himself) – possibility to match the wrong person. If detected, oldest person is chosen. • For inventors affiliated with somefirms (AstraZeneca), companyaddress given
Towards 100% • Idea: scoringmethodsbased on identifiedinventors • Name • Identifiedco-inventors • Technology class • City • Postal code • Whichalgorithm? • Statistics Sweden for validatingparent/childnamesimilarity problem? • Use 1980 population CD? • Strategy of focusing on highlyproductiveunmatchedinventors?
Patent distribution in manufacturing (share of total patenting)
Sectors, SNI92-codes, # inventors, contribution 2004-2005. * ”Contribution” counts patent fractions which adjusts for co-inventorship. ** ”Academia” can also in a few cases be found in the sectors R&D in technical and natural sciences (73101-73104) and in technical testing and analysis (74300).
The most important patenting academic institutions 2004-2005