1 / 68

Vivien Petras UC Berkeley School of Information

Translating Dialects in Search: Mapping between Specialized Languages of Discourse and Documentary Languages. Vivien Petras UC Berkeley School of Information. Overcoming the Language Problem in Search.

Download Presentation

Vivien Petras UC Berkeley School of Information

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Translating Dialects in Search: Mapping between Specialized Languages of Discourse and Documentary Languages Vivien Petras UC Berkeley School of Information

  2. Overcoming the Language Problem in Search How can someone searching for violins be made aware that there are also fiddles (and vice versa)?

  3. Outline • The Language Problem in Information Retrieval • Dialects & Contexts • The Search Term Recommender • 4 Research Questions • Exploratory Web Interface

  4. IR = Language Mapping Exercise Concept Space Concept Space Match! Text Question Search Statement Document Author Searcher • Mapping between searcher and IR system • Mapping between author and IR system • Mapping between search statement and document

  5. IR = Language Mapping Exercise Concept Space Match! Question Search Statement Document Searcher Information Retrieval • A search statement needs to describe the: • searcher’s question (information need) • documents that are relevant to a searcher’s question

  6. The Language Problem In Semiotics:  Unlimited semiosis In Information Science: Inter-indexer inconsistency

  7. Dialects and Contexts How to alleviate language ambiguity for search term selection?

  8. Dialects and Contexts How to alleviate language ambiguity for search term selection? Wittgenstein Philosophy of language: Language is disambiguated within contexts and specialized dialects.

  9. Dialects and Contexts How to alleviate language ambiguity for search term selection? Wittgenstein Philosophy of language: Language is disambiguated within contexts and specialized dialects. Support search term selection:

  10. Dialects and Contexts How to alleviate language ambiguity for search term selection? Wittgenstein Philosophy of language: Language is disambiguated within contexts and specialized dialects. Support search term selection: • Within the dialect of a specialized community

  11. Dialects and Contexts How to alleviate language ambiguity for search term selection? Wittgenstein Philosophy of language: Language is disambiguated within contexts and specialized dialects. Support search term selection: • Within the dialect of a specialized community • In context

  12. Dialects and Contexts How to alleviate language ambiguity for search term selection? Wittgenstein Philosophy of language: Language is disambiguated within contexts and specialized dialects. Support search term selection: • Within the dialect of a specialized community • In context • Using the language of documents (for term matching)

  13. Search Term Recommender Did you mean… Specialty Specialty Search Statement Specialty Specialty Specialty Specialty Term Specialty Term Specialty Term Specialty Term Specialty Specialty Information Collection

  14. Search Term Recommender

  15. The Search Term Recommender: Applications • Term selection support (query expansion & reformulation) • Automatic classification • Terminology mapping

  16. The Search Term Recommender - Questions • How can specialties & specialty dialects be identified in an information collection?

  17. The Search Term Recommender - Questions • How can specialties & specialty dialects be identified in an information collection? • Do specialty dialects really differ?

  18. The Search Term Recommender - Questions • How can specialties & specialty dialects be identified in an information collection? • Do specialty dialects really differ? • Is performance improved when focusing on specialty dialects?

  19. The Search Term Recommender - Questions • How can specialties & specialty dialects be identified in an information collection? • Do specialty dialects really differ? • Is performance improved when focusing on specialty dialects? • How specific should specialties be?

  20. The Search Term Recommender - Questions • How can specialties & specialty dialects be identified in an information collection? • Do specialty dialects really differ? • Is performance improved when focusing on specialty dialects? • How specific should specialties be? • Tested on 2 bibliographic databases: • Inspec • Medline (Ohsumed collection)

  21. Inspec • Physics, Electrical and Electronic Engineering, Computers and Control • Document: author, title, source, publication year, abstract, Inspec thesaurus descriptors, Inspec classification codes • Test collection:

  22. Medline Ohsumed Collection • Biomedicine and Health • Document: author, title, source, publication year, publication type, abstract, Mesh Headings • Test collection:

  23. The Search Term Recommender System - Questions • How can specialties be identified in an information collection? • Do specialty dialects really differ? • Is performance improved when focusing on specialty dialects? • How specific should specialties be? • Tested on 2 bibliographic collections: • Inspec • Medline (Ohsumed collection)

  24. Determine specialty documents in the collection: • Domain terminology

  25. Determine specialty documents in the collection: • Domain terminology • Publication source

  26. Determine specialty documents in the collection: • Domain terminology • Publication source • Bibliometric analysis

  27. Determine specialty documents in the collection: • Domain terminology • Publication source • Bibliometric analysis • Social network analysis

  28. Determine specialty documents in the collection: • Domain terminology • Publication source • Bibliometric analysis • Social network analysis • Subject-specific classification

  29. Identification of Specialties in an Information Collection Inspec test collection • by top-level categories in the Inspec classification • 3 specialties: Physics, Electrical & Electronic Engineering, Computers & Control

  30. Identification of Specialties in an Information Collection Inspec test collection • by top-level categories in the Inspec classification • 3 specialties: Physics, Electrical & Electronic Engineering, Computers & Control Ohsumed test collection • by journals grouped by subject • 33 specialties

  31. The Search Term Recommender System - Questions • How can specialties be identified in an information collection? • Do specialty dialects really differ? • Is performance improved when focusing on specialty dialects? • How specific should specialties be? • Tested on 2 bibliographic collections: • Inspec • Medline (Ohsumed collection)

  32. Differences in Language • Differences in specialty dialects (specialty term overlap)

  33. Differences in Language • Differences in specialty dialects (specialty term overlap) • Differences in documentary languages (subject metadata term overlap)

  34. Differences in Language • Differences in specialty dialects (specialty term overlap) • Differences in documentary languages (subject metadata term overlap) • Differences in search term recommender suggestions (term suggestion overlap)

  35. Inspec Dialects (specialty term overlap) terms analyzed: 60,601 Subject metadata term overlap: 87% Suggested term overlap: 30%

  36. Ohsumed Dialects (Specialty term overlap) terms analyzed: 11,663 Subject metadata term overlap: 32% Suggested term overlap: 30%

  37. The Search Term Recommender System - Questions • How can specialties be identified in an information collection? • Do specialty dialects really differ? • Is performance improved when focusing on specialty dialects? • How specific should specialties be? • Tested on 2 bibliographic collections: • Inspec • Medline (Ohsumed collection)

  38. Automatic classification • Suggest subject metadata for documents • Comparison: specialty vs. general term suggestions

  39. Automatic Classification Title: “A search for clusters of protostars in Orion cloud cores”

  40. Automatic Classification Title: “A search for clusters of protostars in Orion cloud cores” Evaluation

  41. Performance of the STR: Inspec First 3 suggested: Recall: 13.6% Precision: 11.2% Test Documents: 42,735 Specialties: 3

  42. Performance of the STR: Ohsumed First 3 suggested: Recall: 26% Precision: 25.6% Test Documents: 18,733 Specialties: 33

  43. The Search Term Recommender System - Questions • How can specialties be identified in an information collection? • Do specialty dialects really differ? • Is performance improved when focusing on specialty dialects? • How specific should specialties be? • Tested on 2 bibliographic collections: • Inspec • Medline (Ohsumed collection)

  44. Specificity of Specialties • Language differences • Representative sample of specialty language for training

  45. Specificity of Specialties - Inspec Identifying subspecialties by classification hierarchy • e.g. Computers & Control -- Computer Hardware -- Circuits & Devices

  46. Specificity of Specialties - Inspec Identifying subspecialties by classification hierarchy • e.g. Computers & Control -- Computer Hardware -- Circuits & Devices Test documents: 2425 Specialties: 3

  47. Specificity of Specialties - Ohsumed Identifying subspecialties by journal within subject • e.g. Orthopedics -- Clinical Orthopaedics & Related Research journal

  48. Specificity of Specialties - Ohsumed Identifying subspecialties by journal within subject • e.g. Orthopedics -- Clinical Orthopaedics & Related Research journal Test documents: 745 Specialties: 3

  49. Exploratory Web Interfaces Inspec http://metadata.sims.berkeley.edu/str/inspec/inspec.html Ohsumed http://metadata.sims.berkeley.edu/str/ohsumed/ohsumed.html

  50. Summary • How can specialties be identified in an information collection? • Inspec: subject-specific classification • Ohsumed: journal specialty area

More Related