1 / 40

Ranking by Relevance and/or Similarity in HunCRIS

Ranking by Relevance and/or Similarity in HunCRIS. Ádám Tichy-Rács BME OMIKK head of HunCRIS unit. Contents. Introduction to HunCRIS Building, storing, reusing and generating query Sets of found entities Search methods Boolean search relevance and similarity based ranking/sorting

Download Presentation

Ranking by Relevance and/or Similarity in HunCRIS

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Ranking by Relevance and/or Similarity in HunCRIS Ádám Tichy-Rács BME OMIKKhead of HunCRIS unit

  2. Contents • Introduction to HunCRIS • Building, storing, reusing and generating query • Sets of found entities • Search methods • Boolean search • relevance and similarity based ranking/sorting • Combination of ranking and Boolean filter • Structure of Thesarus in HunCRIS • Plans for the near future • Live demonstration of HunCRIS operation

  3. About HunCRIS • HunCRIS is a CERIF/CRIS platform to integrate research information from different databases • Our tasks are • to help research workflow • to help international cooperation • projects: CISTRANA, IST-World, Biodiversa, Aspera, etc. • to do some experimental solutions within our autonomy to improve usability • The next slides explain our achievement in the knowledge layer

  4. Building query • Using free text or list elements • Two-level Boolean expressions • Storing and reusing query • Advantage: • Boolean query with multi level logical structure • Reusing complex queries • Comment: available only for registered users • Example: „Nanotechnology” – structured Boolean query with eighty elements

  5. Sets of hits • Default setting: projects that meet Boolean conditions • Options: OrgUnits or Persons in projects that meet Boolean conditions • Example: OrgUnits in the projects of the „Budapest University of Technology and Economics” (BME) are departments of BME and their partners in research projects • Clicking on the elements in the hit list one initializes and executes new Boolean query automatically (dynamic hyperlink) • The new query is hidden, but available for editing

  6. Searching structures • Boolean search using keywords • Projects that are linked to the selected keywords • Boolean search using Thesaurus • Projects that are characterized by the selected expressions or by their subordinates • Problem: the elements of hit list are sorted by non-essential properties • Alphabetic order, starting date, finishing date, financial support, etc

  7. Example: binary thesaurus with three roots and four levels

  8. „A” project

  9. „B” project

  10. Elements of query „Q”

  11. „Q” query elements with their „subterms” along tree of Thesaurus

  12. „A” and „Q” Result: „A” is selected by „Q”

  13. „B” and „Q” Result: „B” is not selected by „Q”

  14. „Search optimized” „O” project

  15. „O” and „Q” Result: „O” will be selected by any „Q”

  16. Relevance based rankingWarning! Mathematics! • Projects should be ranked by the percentage of correlation between two sets of expressions • Set {A}: expressions linked directly to project „A” and their superterms up to the root • Set {Q}: expressions inserted in query „Q” and their superterms up to the root • n({X}) is the number of elements in set ({X}) • Definition of relevance r(A,Q)=n({A}∩{Q})/[n({A})*n({Q})]^(1/2) • Property of relevance 0≤r(A,Q)≤1

  17. „A” project with all superterms in the Thesaurus n({A})=13

  18. „B” project with all superterms in the Thesaurus n({B})=12

  19. „C” project with all superterms in the Thesaurus n({C})=5

  20. „Q” query with all superterms in the Thesaurus n({Q})=5

  21. „Q” and „A” n({Q} ∩{A})=2 r(A,B)=2/(5*13)^(1/2)=24,81%

  22. „Q” and „B” n({Q} ∩{B})=2 r(Q,B)=2/(5*12)^(1/2)=25,82%

  23. „Q” and „C” n({Q} ∩{C})=2 r(A,B)=2/(5*5)^(1/2)=40,00%

  24. „Search optimized” „O” project with all superterms in the Thesaurus n({O})=45

  25. „Q and „O” n({Q} ∩{O})=n({Q})=5 r(Q,O)=5/(5*45)^(1/2)=[n({Q})/n({O})}^1/2=33,33% The larger the Thesaurus, the smaller the relevance of „O”

  26. Projects ranked by relevance to „Q”

  27. Ranking projects by similarity to each other • Query „Q” could be generated from expressions describing project „D” r(A,Q(D))=s(A;D) • Properties 0≤s(A,D)≤1 s(A,D)= s(D,A) s(A,B)+s(B,C)≥s(A,B) s(A,A)=s(D,D)=1 • Consequences • if s(A,D)=0, then „A” orthogonal to „D” • any query is a would be project and vice versa

  28. Similarity of „A” and „B” n({A} ∩{B})=6 s(A,B)=6/(13*12)^(1/2)=48,04%

  29. Similarity of „A” and „C” n({A} ∩{C})=5 s(A,C)=2/(13*5)^(1/2)=24,81%

  30. Similarity of „B” and „C” n({B} ∩{C})=1 s(A,B)=1/(12*5)^(1/2)=12,91%

  31. Advantages of relevance based method • Symmetry in processing queries and projects • The list is sorted by essential property • Conceptual generalization of expressions is carried out automatically along the semantic structure of the Thesaurus • „Optimized” projects that are included in all hit lists are ranked low • No search optimization of projects will work • General query is relevant only to general research topics • Very specific query is not too relevant to general research topics

  32. Combination of ranking and Boolean query • Only those project are sorted by relevance that meet the conditions of Boolean query • „B” is not included in the ranked hit list!

  33. Thesaurus with more than one superterms • Classification systems allow single parent structure • To keep cardbox linear • Cyclic reference is not allowed • Some topics should be found from multiple directions • Use index terms to redirect user • The structure can be crawled by computer from arbitrary starting point Humanities Science History Physics Science history History of physics

  34. Thesaurus of HunCRIS • Consists of • Nearly 19600 elements and keeps developing • Six top level expressions • Inhomogeneous in depths • 5-10 levels • n-to-m type relation of elements

  35. Modified Thesaurus with more than one parent

  36. New items in {B} ! „B” project with all parents along tree of modified Thesaurus n({B})=14

  37. Similarity map of projects in HunCRIS I.

  38. Similarity map of projects in HunCRIS II.

  39. Plans for the near future • Tests of ranking method • Accuracy • Comparison of calculated and subjective similarities • Comparison of calculations using different thesauri • Scalability • Implementation of the ranking method into other, larger information systems • Library catalogues • Adding visualization tools to HunCRIS • Participation in international (euroCRIS led) projects

  40. Live demonstration Compatibility with • MS Internet Explorer • Mozilla Firefox • Google Chrome • Opera • Apple Safari http://www.info.omikk.bme.hu https://nkr.info.omikk.bme.hu

More Related