140 likes | 164 Views
Andrea Gazzarini, Software Architect at @Cult, presented how Authify improves catalog understanding at the 36th ADLUG Annual Conference.
E N D
Hey you! What are you doing in my records?Using @Cult Authify for a better understanding of your catalog Andrea Gazzarini Software Architect, @Cult 36th ADLUG International Annual Conference Fundación San Pablo Andalucía 27 – 29 September 2017
Andrea Gazzarini Software Architect, @Cult 36th ADLUG International Annual Conference Fundación San Pablo Andalucía 27 – 29 September 2017 https://www.linkedin.com/in/andreagazzarini https://twitter.com/agazzarini http://www.atcult.it http://andreagazzarini.blogspot.it https://github.com/agazzarini http://www.slideshare.net/AndreaGazzarini http://people.apache.org/map.html?person=agazzarini https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-essentials
What is Authify? • Authify is a RESTFul module that provides search and detection services. It isn’t a frontend tool, it is a service. • The project started at the very beginning for overcoming some limitation of the public VIAF Web API • VIAF APIs, being part of a public and free of charge service, aren’t supposed to be massively invoked: for those use cases where such requirement is needed, the project provides a download of the whole dataset. • On top of that, sometimes we weren’t so happy about VIAF results: what we were looking for wasn’t returned in the first position among matches! • The project we were working on was supposed to manage millions of records: even assuming an optimistic average of 2 entities per record (one name and one title) that would result in millions of invocations • That was mainly the initial reason why we started implementing Authify: download, index, store the VIAF clusters dataset and provide, on top of that, powerful full-text and bibliographic search services 3
Authify: cluster search service • The Authify cluster search service provides, as the name suggest, a full-text search service among names and works clusters. • A cluster is a group of variant forms associated with a given entity. For example, a name cluster is a group that contains all available name headings for a given “name” entity (i.e. a person, a corporate) • Technically, Authify is composed by two main parts: a Solr infrastructure which indexes the datasets and provides search services on top of them, and alogic layer which orchestrates those search services in order to (try to) find a match, as much precise as possible, within the clusters • The so-called invisible queries approach allows us to make everything transparent to the caller: on top of a single search request, the system executes a chain of different search strategies with a different priority; the first match that produces a result will populate the returned response • What is the goal of all of this? Starting from bibliographic record, a system can detect names and works headings and then, it can ask to Authify: “can you tell me if a cluster exists for this heading?” 6
Authify: Request / Response Request http://labs.atcult.it/authify/names?q=Bertrand Meyer Response "responseHeader" : { "matching-strategy" : "name::headings-exact-match" }, "response" : { "matches" : [ { "id" : "51714577", "type" : "Personal", "uri" : "http://viaf.org/viaf/51714577/", "headings" : [ "Meyer, Bertrand, 1950-....", "Bertrand Meyer", "Meyer, Bertrand" ], "sources" : [ "BNF|12079479", "DNB|112127843", … 7
Clusters search service: why? 001 …. 100 $aA.B. Normal$cMr. … ... Mr. A.B. Normal AUTHIFY Cluster ID: #992 Mr. ABNormal 001 …. 100 $aABNormal$cMr. … ... Cluster ID: #992 ...so Mr. A.B. Normal and Mr. ABNormal seem to be the same person! 8
Authify: cluster search service • Authify provides a different search logic for each managed entity. The system has been built with extensibility in mind, so the chain described below is fully configurable • Subfields matching (names only): the Authify query language allows the caller to decompose the input heading in subfields, which is actually the structure of the heading in the source records • Heading exact match (names & works): the system gives priority to heading exact matches • Full text search (names and works): a regular full text search, which takes in account proximity search for names (e.g. Bertrand Meyer = Meyer Bertrand) and special detection for some entity (e.g. birth and death dates). • Initials (names only): As last chance, the system executes a search by “initials”, in order to find a valid match in those cases when the input string (or the indexed heading) contains the name in its short form. Same as the previous point, this could lead to a response with minor precision. 9
Authify: relator terms detection • Goal: starting from a MARC record Authify analyses the tags that contain a name and, for each of them, tries to figure out what is the role of such entity within the work represented by the given record • Lazyness: If the record / tag already provides a role for a given entity (e.g. $eaut) Authify just skips the entity, assuming that role has been authoritatively assigned. • No AI involved (not yet): the whole detection process is executed using plain text matching algorithms (fulltext, token, shingles and n-gram analysis) which operates on top of a set of rules coming from our functional experts • Source data: The detection process uses the information found within the tag, specifically the name and the statements of responsibility • It is using Authify itself! Otherwise, how can we reconcilate the different forms of the same name in the statement of responsibilities? • No NER (again): we will introduce it later as last step in the processing chain 10
Authify...uses Authify? =100 1 $aAdamczyk, Mieczysław. =245 10 $aKonstytucje polskie w rozwoju dziejowym, 1791-1982 /$cMieczysław Adamczyk, Stefan Pastuszka. =700 1 $aPastuszka, Stefan Józef. 2 1 (from 100$a) Adamczyk, Mieczyslaw (from 245$c) Mieczyslaw, Adamczyk Authify /names Authify /detect “Same person here!” 3 (from 700) Pastuszka, Stefan Josef. (from 245) Stefan Pastuszka 4 6 5 “Same person here!” MUMBLE, MUMBLE 7 Adamczyk, Mieczyslaw and Stefan Pastuszka are co-authors of “Konstytucje polskie w rozwoju dziejowym” 11
Hey you! What are you doing in my records?Using @Cult Authify for a better understanding of your catalog Thank you! 36th ADLUG International Annual Conference Fundación San Pablo Andalucía 27 – 29 September 2017