330 likes | 441 Views
Untangling Names. Lessons learned (so far) from the linking of IPNI and TROPICOS Julius Welby RBG Kew j.welby@kew.org. TROPICOS + IPNI. Why match?. Why is this difficult?. Variation. Calophyllum kiong K.Schum. & Lauterb. Fl. Deutsch. Sudsee, 450. Calophyllum kiong Lauterb. & K.Schum.
E N D
Untangling Names Lessons learned (so far) from the linking of IPNI and TROPICOS Julius Welby RBG Kew j.welby@kew.org
Variation Calophyllum kiong K.Schum. & Lauterb. Fl. Deutsch. Sudsee, 450. Calophyllum kiong Lauterb. & K.Schum. Die Flora der Deutschen Schutzgebiete in der Sudsee 1900
Duplication • Poa annua L. -- Sp. Pl. 68. 1753 (GCI) • Poa annua L. -- Species Plantarum 2 1753 (APNI) • Poa annua L. -- Sp. Pl. 68. (IK)
Duplication • Calophyllum microphyllum Scheffin Tijdschr. Nederl. Ind. xxxii. (1871) 406. (IK) • Calophyllum microphyllum Planch. & Trianain Ann. Sc. Nat. Ser. IV. xv. (1861) 282. (IK) • Calophyllum microphyllum T.Anders.Fl. Brit. Ind. (J. D. Hooker). i. 272. (IK)
Fields 1 Calophyllum Calophyllum 2 kiong kiong 3 K.Schum. & Lauterb. Lauterb. & K.Schum. • Fl. Deutsch. Sudsee Die Flora der Deutschen… • 450. 1900
Lesson 1 Speed matters
Speed matters 2,500 by 2,000 by 4 fields 20,000,000 comparisons ~5.5 hours at 1ms per comparison
Be lazy • Do as little as possible • Do easy things if possible • Do hard things only if necessary • Only expend effort when it’s worth it
Be lazy • Do as little as possible • Specify fields as ‘must match’ • If a ‘must match’ field fails • Mark the match as failed • Stop comparing fields
speciesinfragenusinfraspeciesauthorsrank … Parameterised matching
Optimising • The order of field matching is important • Choose suitable fields to match first • Aim to fail matches early • Significant speed-up
Also, for speed • Do as little as possible • Do escaping or standardisation once • Done on import for each dataset • Keep field matching functions clean
More speed optimisation • Do easy things if possible • Define cascading tests • Do easy tests first, if practical • Length comparisons • Composition comparisons
Speed Lessons • Speed matters • Minimise comparisons made • ‘Must match’ parameters • Match fields in an efficient order • Do data cleaning once, up front • Look for ways to fail matches cheaply
Accuracy False - OK False +
Strict match F- OK
Fuzzy match OK F+
One approach • Currently, to get best results: • Tend towards strictness • Handle false negatives
One approach • Currently, best results from: • Tend towards strictness • Handle false negatives • Failures on ‘rightmost’ fields can be written to a report • Checked and fed back in as escapes • Rerun
Predictable variation • Gendered endings • Common alternatives • Endings: • ii,i • Iae,ae • Dataset specific quirks: • &, &
The framework • Python • Psyco • Modular • Extensible • In progress • More details will be available on the TDWG website • Source code availability
The framework • Some results (HTML)
Thanks to • Bob Magill • Sally Hinchcliffe • The Moore Foundation • Contact: • j.welby@kew.org • or after Jan 2007 :julius.welby@gmail.com