1 / 14

Enhancing Catalog Understanding with Authify at ADLUG Annual Conference

Andrea Gazzarini, Software Architect at @Cult, presented how Authify improves catalog understanding at the 36th ADLUG Annual Conference.

nrolon
Download Presentation

Enhancing Catalog Understanding with Authify at ADLUG Annual Conference

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Hey you! What are you doing in my records?Using @Cult Authify for a better understanding of your catalog Andrea Gazzarini Software Architect, @Cult 36th ADLUG International Annual Conference Fundación San Pablo Andalucía 27 – 29 September 2017

  2. Andrea Gazzarini Software Architect, @Cult 36th ADLUG International Annual Conference Fundación San Pablo Andalucía 27 – 29 September 2017 https://www.linkedin.com/in/andreagazzarini https://twitter.com/agazzarini http://www.atcult.it http://andreagazzarini.blogspot.it https://github.com/agazzarini http://www.slideshare.net/AndreaGazzarini http://people.apache.org/map.html?person=agazzarini https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-essentials

  3. What is Authify? • Authify is a RESTFul module that provides search and detection services. It isn’t a frontend tool, it is a service. • The project started at the very beginning for overcoming some limitation of the public VIAF Web API • VIAF APIs, being part of a public and free of charge service, aren’t supposed to be massively invoked: for those use cases where such requirement is needed, the project provides a download of the whole dataset. • On top of that, sometimes we weren’t so happy about VIAF results: what we were looking for wasn’t returned in the first position among matches! • The project we were working on was supposed to manage millions of records: even assuming an optimistic average of 2 entities per record (one name and one title) that would result in millions of invocations • That was mainly the initial reason why we started implementing Authify: download, index, store the VIAF clusters dataset and provide, on top of that, powerful full-text and bibliographic search services 3

  4. VIAF API: FAO / Mozart 4

  5. Authify: FAO / Mozart 5

  6. Authify: cluster search service • The Authify cluster search service provides, as the name suggest, a full-text search service among names and works clusters. • A cluster is a group of variant forms associated with a given entity. For example, a name cluster is a group that contains all available name headings for a given “name” entity (i.e. a person, a corporate) • Technically, Authify is composed by two main parts: a Solr infrastructure which indexes the datasets and provides search services on top of them, and alogic layer which orchestrates those search services in order to (try to) find a match, as much precise as possible, within the clusters • The so-called invisible queries approach allows us to make everything transparent to the caller: on top of a single search request, the system executes a chain of different search strategies with a different priority; the first match that produces a result will populate the returned response • What is the goal of all of this? Starting from bibliographic record, a system can detect names and works headings and then, it can ask to Authify: “can you tell me if a cluster exists for this heading?” 6

  7. Authify: Request / Response Request http://labs.atcult.it/authify/names?q=Bertrand Meyer Response "responseHeader" : { "matching-strategy" : "name::headings-exact-match" }, "response" : { "matches" : [ { "id" : "51714577", "type" : "Personal", "uri" : "http://viaf.org/viaf/51714577/", "headings" : [ "Meyer, Bertrand, 1950-....", "Bertrand Meyer", "Meyer, Bertrand" ], "sources" : [ "BNF|12079479", "DNB|112127843", … 7

  8. Clusters search service: why? 001 …. 100 $aA.B. Normal$cMr. … ... Mr. A.B. Normal AUTHIFY Cluster ID: #992 Mr. ABNormal 001 …. 100 $aABNormal$cMr. … ... Cluster ID: #992 ...so Mr. A.B. Normal and Mr. ABNormal seem to be the same person! 8

  9. Authify: cluster search service • Authify provides a different search logic for each managed entity. The system has been built with extensibility in mind, so the chain described below is fully configurable • Subfields matching (names only): the Authify query language allows the caller to decompose the input heading in subfields, which is actually the structure of the heading in the source records • Heading exact match (names & works): the system gives priority to heading exact matches • Full text search (names and works): a regular full text search, which takes in account proximity search for names (e.g. Bertrand Meyer = Meyer Bertrand) and special detection for some entity (e.g. birth and death dates). • Initials (names only): As last chance, the system executes a search by “initials”, in order to find a valid match in those cases when the input string (or the indexed heading) contains the name in its short form. Same as the previous point, this could lead to a response with minor precision. 9

  10. Authify: relator terms detection • Goal: starting from a MARC record Authify analyses the tags that contain a name and, for each of them, tries to figure out what is the role of such entity within the work represented by the given record • Lazyness: If the record / tag already provides a role for a given entity (e.g. $eaut) Authify just skips the entity, assuming that role has been authoritatively assigned. • No AI involved (not yet): the whole detection process is executed using plain text matching algorithms (fulltext, token, shingles and n-gram analysis) which operates on top of a set of rules coming from our functional experts • Source data: The detection process uses the information found within the tag, specifically the name and the statements of responsibility • It is using Authify itself! Otherwise, how can we reconcilate the different forms of the same name in the statement of responsibilities? • No NER (again): we will introduce it later as last step in the processing chain 10

  11. Authify...uses Authify? =100 1 $aAdamczyk, Mieczysław. =245 10 $aKonstytucje polskie w rozwoju dziejowym, 1791-1982 /$cMieczysław Adamczyk, Stefan Pastuszka. =700 1 $aPastuszka, Stefan Józef. 2 1 (from 100$a) Adamczyk, Mieczyslaw (from 245$c) Mieczyslaw, Adamczyk Authify /names Authify /detect “Same person here!” 3 (from 700) Pastuszka, Stefan Josef. (from 245) Stefan Pastuszka 4 6 5 “Same person here!” MUMBLE, MUMBLE 7 Adamczyk, Mieczyslaw and Stefan Pastuszka are co-authors of “Konstytucje polskie w rozwoju dziejowym” 11

  12. RT Detection service: Swagger UI 12

  13. RT Detection service: Swagger UI 13

  14. Hey you! What are you doing in my records?Using @Cult Authify for a better understanding of your catalog Thank you! 36th ADLUG International Annual Conference Fundación San Pablo Andalucía 27 – 29 September 2017

More Related