Ranking by Relevance and/or Similarity in HunCRIS

Ranking by Relevance and/or Similarity in HunCRIS Ádám Tichy-Rács BME OMIKKhead of HunCRIS unit

Contents • Introduction to HunCRIS • Building, storing, reusing and generating query • Sets of found entities • Search methods • Boolean search • relevance and similarity based ranking/sorting • Combination of ranking and Boolean filter • Structure of Thesarus in HunCRIS • Plans for the near future • Live demonstration of HunCRIS operation

About HunCRIS • HunCRIS is a CERIF/CRIS platform to integrate research information from different databases • Our tasks are • to help research workflow • to help international cooperation • projects: CISTRANA, IST-World, Biodiversa, Aspera, etc. • to do some experimental solutions within our autonomy to improve usability • The next slides explain our achievement in the knowledge layer

Building query • Using free text or list elements • Two-level Boolean expressions • Storing and reusing query • Advantage: • Boolean query with multi level logical structure • Reusing complex queries • Comment: available only for registered users • Example: „Nanotechnology” – structured Boolean query with eighty elements

Sets of hits • Default setting: projects that meet Boolean conditions • Options: OrgUnits or Persons in projects that meet Boolean conditions • Example: OrgUnits in the projects of the „Budapest University of Technology and Economics” (BME) are departments of BME and their partners in research projects • Clicking on the elements in the hit list one initializes and executes new Boolean query automatically (dynamic hyperlink) • The new query is hidden, but available for editing

Searching structures • Boolean search using keywords • Projects that are linked to the selected keywords • Boolean search using Thesaurus • Projects that are characterized by the selected expressions or by their subordinates • Problem: the elements of hit list are sorted by non-essential properties • Alphabetic order, starting date, finishing date, financial support, etc

Example: binary thesaurus with three roots and four levels

„A” project

„B” project

Elements of query „Q”

„Q” query elements with their „subterms” along tree of Thesaurus

„A” and „Q” Result: „A” is selected by „Q”

„B” and „Q” Result: „B” is not selected by „Q”

„Search optimized” „O” project

„O” and „Q” Result: „O” will be selected by any „Q”

Relevance based rankingWarning! Mathematics! • Projects should be ranked by the percentage of correlation between two sets of expressions • Set {A}: expressions linked directly to project „A” and their superterms up to the root • Set {Q}: expressions inserted in query „Q” and their superterms up to the root • n({X}) is the number of elements in set ({X}) • Definition of relevance r(A,Q)=n({A}∩{Q})/[n({A})*n({Q})]^(1/2) • Property of relevance 0≤r(A,Q)≤1

„A” project with all superterms in the Thesaurus n({A})=13

„B” project with all superterms in the Thesaurus n({B})=12

„C” project with all superterms in the Thesaurus n({C})=5

„Q” query with all superterms in the Thesaurus n({Q})=5

„Q” and „A” n({Q} ∩{A})=2 r(A,B)=2/(5*13)^(1/2)=24,81%

„Q” and „B” n({Q} ∩{B})=2 r(Q,B)=2/(5*12)^(1/2)=25,82%

„Q” and „C” n({Q} ∩{C})=2 r(A,B)=2/(5*5)^(1/2)=40,00%

„Search optimized” „O” project with all superterms in the Thesaurus n({O})=45

„Q and „O” n({Q} ∩{O})=n({Q})=5 r(Q,O)=5/(5*45)^(1/2)=[n({Q})/n({O})}^1/2=33,33% The larger the Thesaurus, the smaller the relevance of „O”

Projects ranked by relevance to „Q”

Ranking projects by similarity to each other • Query „Q” could be generated from expressions describing project „D” r(A,Q(D))=s(A;D) • Properties 0≤s(A,D)≤1 s(A,D)= s(D,A) s(A,B)+s(B,C)≥s(A,B) s(A,A)=s(D,D)=1 • Consequences • if s(A,D)=0, then „A” orthogonal to „D” • any query is a would be project and vice versa

Similarity of „A” and „B” n({A} ∩{B})=6 s(A,B)=6/(13*12)^(1/2)=48,04%

Similarity of „A” and „C” n({A} ∩{C})=5 s(A,C)=2/(13*5)^(1/2)=24,81%

Similarity of „B” and „C” n({B} ∩{C})=1 s(A,B)=1/(12*5)^(1/2)=12,91%

Advantages of relevance based method • Symmetry in processing queries and projects • The list is sorted by essential property • Conceptual generalization of expressions is carried out automatically along the semantic structure of the Thesaurus • „Optimized” projects that are included in all hit lists are ranked low • No search optimization of projects will work • General query is relevant only to general research topics • Very specific query is not too relevant to general research topics

Combination of ranking and Boolean query • Only those project are sorted by relevance that meet the conditions of Boolean query • „B” is not included in the ranked hit list!

Thesaurus with more than one superterms • Classification systems allow single parent structure • To keep cardbox linear • Cyclic reference is not allowed • Some topics should be found from multiple directions • Use index terms to redirect user • The structure can be crawled by computer from arbitrary starting point Humanities Science History Physics Science history History of physics

Thesaurus of HunCRIS • Consists of • Nearly 19600 elements and keeps developing • Six top level expressions • Inhomogeneous in depths • 5-10 levels • n-to-m type relation of elements

Modified Thesaurus with more than one parent

New items in {B} ! „B” project with all parents along tree of modified Thesaurus n({B})=14

Similarity map of projects in HunCRIS I.

Similarity map of projects in HunCRIS II.

Plans for the near future • Tests of ranking method • Accuracy • Comparison of calculated and subjective similarities • Comparison of calculations using different thesauri • Scalability • Implementation of the ranking method into other, larger information systems • Library catalogues • Adding visualization tools to HunCRIS • Participation in international (euroCRIS led) projects

Live demonstration Compatibility with • MS Internet Explorer • Mozilla Firefox • Google Chrome • Opera • Apple Safari http://www.info.omikk.bme.hu https://nkr.info.omikk.bme.hu

Ranking by Relevance and/or Similarity in HunCRIS