1 / 33

Untangling Names

Untangling Names. Lessons learned (so far) from the linking of IPNI and TROPICOS Julius Welby RBG Kew j.welby@kew.org. TROPICOS + IPNI. Why match?. Why is this difficult?. Variation. Calophyllum kiong K.Schum. & Lauterb. Fl. Deutsch. Sudsee, 450. Calophyllum kiong Lauterb. & K.Schum.

adrina
Download Presentation

Untangling Names

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Untangling Names Lessons learned (so far) from the linking of IPNI and TROPICOS Julius Welby RBG Kew j.welby@kew.org

  2. TROPICOS + IPNI

  3. Why match?

  4. Why is this difficult?

  5. Variation Calophyllum kiong K.Schum. & Lauterb. Fl. Deutsch. Sudsee, 450. Calophyllum kiong Lauterb. & K.Schum. Die Flora der Deutschen Schutzgebiete in der Sudsee 1900

  6. Duplication • Poa annua L. -- Sp. Pl. 68. 1753 (GCI) • Poa annua L. -- Species Plantarum 2 1753 (APNI) • Poa annua L. -- Sp. Pl. 68. (IK)

  7. Duplication • Calophyllum microphyllum Scheffin Tijdschr. Nederl. Ind. xxxii. (1871) 406. (IK) • Calophyllum microphyllum Planch. & Trianain Ann. Sc. Nat. Ser. IV. xv. (1861) 282. (IK) • Calophyllum microphyllum T.Anders.Fl. Brit. Ind. (J. D. Hooker). i. 272. (IK)

  8. Matching

  9. Fields 1 Calophyllum Calophyllum 2 kiong kiong 3 K.Schum. & Lauterb. Lauterb. & K.Schum. • Fl. Deutsch. Sudsee Die Flora der Deutschen… • 450. 1900

  10. Lesson 1 Speed matters

  11. Speed matters 2,500 by 2,000 by 4 fields 20,000,000 comparisons ~5.5 hours at 1ms per comparison

  12. Be lazy • Do as little as possible • Do easy things if possible • Do hard things only if necessary • Only expend effort when it’s worth it

  13. Be lazy • Do as little as possible • Specify fields as ‘must match’ • If a ‘must match’ field fails • Mark the match as failed • Stop comparing fields

  14. speciesinfragenusinfraspeciesauthorsrank … Parameterised matching

  15. How lazy?

  16. Optimising • The order of field matching is important • Choose suitable fields to match first • Aim to fail matches early • Significant speed-up

  17. Also, for speed • Do as little as possible • Do escaping or standardisation once • Done on import for each dataset • Keep field matching functions clean

  18. More speed optimisation • Do easy things if possible • Define cascading tests • Do easy tests first, if practical • Length comparisons • Composition comparisons

  19. Speed Lessons • Speed matters • Minimise comparisons made • ‘Must match’ parameters • Match fields in an efficient order • Do data cleaning once, up front • Look for ways to fail matches cheaply

  20. Accuracy

  21. Accuracy False - OK False +

  22. Strict match F- OK

  23. Fuzzy match OK F+

  24. Doughnut of uncertainty

  25. Lesson 2:Look at near misses

  26. Near misses are checkable

  27. One approach • Currently, to get best results: • Tend towards strictness • Handle false negatives

  28. One approach • Currently, best results from: • Tend towards strictness • Handle false negatives • Failures on ‘rightmost’ fields can be written to a report • Checked and fed back in as escapes • Rerun

  29. Lesson 3:Remove predictable variation

  30. Predictable variation • Gendered endings • Common alternatives • Endings: • ii,i • Iae,ae • Dataset specific quirks: • &, &

  31. The framework • Python • Psyco • Modular • Extensible • In progress • More details will be available on the TDWG website • Source code availability

  32. The framework • Some results (HTML)

  33. Thanks to • Bob Magill • Sally Hinchcliffe • The Moore Foundation • Contact: • j.welby@kew.org • or after Jan 2007 :julius.welby@gmail.com

More Related