680 likes | 775 Views
Translating Dialects in Search: Mapping between Specialized Languages of Discourse and Documentary Languages. Vivien Petras UC Berkeley School of Information. Overcoming the Language Problem in Search.
E N D
Translating Dialects in Search: Mapping between Specialized Languages of Discourse and Documentary Languages Vivien Petras UC Berkeley School of Information
Overcoming the Language Problem in Search How can someone searching for violins be made aware that there are also fiddles (and vice versa)?
Outline • The Language Problem in Information Retrieval • Dialects & Contexts • The Search Term Recommender • 4 Research Questions • Exploratory Web Interface
IR = Language Mapping Exercise Concept Space Concept Space Match! Text Question Search Statement Document Author Searcher • Mapping between searcher and IR system • Mapping between author and IR system • Mapping between search statement and document
IR = Language Mapping Exercise Concept Space Match! Question Search Statement Document Searcher Information Retrieval • A search statement needs to describe the: • searcher’s question (information need) • documents that are relevant to a searcher’s question
The Language Problem In Semiotics: Unlimited semiosis In Information Science: Inter-indexer inconsistency
Dialects and Contexts How to alleviate language ambiguity for search term selection?
Dialects and Contexts How to alleviate language ambiguity for search term selection? Wittgenstein Philosophy of language: Language is disambiguated within contexts and specialized dialects.
Dialects and Contexts How to alleviate language ambiguity for search term selection? Wittgenstein Philosophy of language: Language is disambiguated within contexts and specialized dialects. Support search term selection:
Dialects and Contexts How to alleviate language ambiguity for search term selection? Wittgenstein Philosophy of language: Language is disambiguated within contexts and specialized dialects. Support search term selection: • Within the dialect of a specialized community
Dialects and Contexts How to alleviate language ambiguity for search term selection? Wittgenstein Philosophy of language: Language is disambiguated within contexts and specialized dialects. Support search term selection: • Within the dialect of a specialized community • In context
Dialects and Contexts How to alleviate language ambiguity for search term selection? Wittgenstein Philosophy of language: Language is disambiguated within contexts and specialized dialects. Support search term selection: • Within the dialect of a specialized community • In context • Using the language of documents (for term matching)
Search Term Recommender Did you mean… Specialty Specialty Search Statement Specialty Specialty Specialty Specialty Term Specialty Term Specialty Term Specialty Term Specialty Specialty Information Collection
The Search Term Recommender: Applications • Term selection support (query expansion & reformulation) • Automatic classification • Terminology mapping
The Search Term Recommender - Questions • How can specialties & specialty dialects be identified in an information collection?
The Search Term Recommender - Questions • How can specialties & specialty dialects be identified in an information collection? • Do specialty dialects really differ?
The Search Term Recommender - Questions • How can specialties & specialty dialects be identified in an information collection? • Do specialty dialects really differ? • Is performance improved when focusing on specialty dialects?
The Search Term Recommender - Questions • How can specialties & specialty dialects be identified in an information collection? • Do specialty dialects really differ? • Is performance improved when focusing on specialty dialects? • How specific should specialties be?
The Search Term Recommender - Questions • How can specialties & specialty dialects be identified in an information collection? • Do specialty dialects really differ? • Is performance improved when focusing on specialty dialects? • How specific should specialties be? • Tested on 2 bibliographic databases: • Inspec • Medline (Ohsumed collection)
Inspec • Physics, Electrical and Electronic Engineering, Computers and Control • Document: author, title, source, publication year, abstract, Inspec thesaurus descriptors, Inspec classification codes • Test collection:
Medline Ohsumed Collection • Biomedicine and Health • Document: author, title, source, publication year, publication type, abstract, Mesh Headings • Test collection:
The Search Term Recommender System - Questions • How can specialties be identified in an information collection? • Do specialty dialects really differ? • Is performance improved when focusing on specialty dialects? • How specific should specialties be? • Tested on 2 bibliographic collections: • Inspec • Medline (Ohsumed collection)
Determine specialty documents in the collection: • Domain terminology
Determine specialty documents in the collection: • Domain terminology • Publication source
Determine specialty documents in the collection: • Domain terminology • Publication source • Bibliometric analysis
Determine specialty documents in the collection: • Domain terminology • Publication source • Bibliometric analysis • Social network analysis
Determine specialty documents in the collection: • Domain terminology • Publication source • Bibliometric analysis • Social network analysis • Subject-specific classification
Identification of Specialties in an Information Collection Inspec test collection • by top-level categories in the Inspec classification • 3 specialties: Physics, Electrical & Electronic Engineering, Computers & Control
Identification of Specialties in an Information Collection Inspec test collection • by top-level categories in the Inspec classification • 3 specialties: Physics, Electrical & Electronic Engineering, Computers & Control Ohsumed test collection • by journals grouped by subject • 33 specialties
The Search Term Recommender System - Questions • How can specialties be identified in an information collection? • Do specialty dialects really differ? • Is performance improved when focusing on specialty dialects? • How specific should specialties be? • Tested on 2 bibliographic collections: • Inspec • Medline (Ohsumed collection)
Differences in Language • Differences in specialty dialects (specialty term overlap)
Differences in Language • Differences in specialty dialects (specialty term overlap) • Differences in documentary languages (subject metadata term overlap)
Differences in Language • Differences in specialty dialects (specialty term overlap) • Differences in documentary languages (subject metadata term overlap) • Differences in search term recommender suggestions (term suggestion overlap)
Inspec Dialects (specialty term overlap) terms analyzed: 60,601 Subject metadata term overlap: 87% Suggested term overlap: 30%
Ohsumed Dialects (Specialty term overlap) terms analyzed: 11,663 Subject metadata term overlap: 32% Suggested term overlap: 30%
The Search Term Recommender System - Questions • How can specialties be identified in an information collection? • Do specialty dialects really differ? • Is performance improved when focusing on specialty dialects? • How specific should specialties be? • Tested on 2 bibliographic collections: • Inspec • Medline (Ohsumed collection)
Automatic classification • Suggest subject metadata for documents • Comparison: specialty vs. general term suggestions
Automatic Classification Title: “A search for clusters of protostars in Orion cloud cores”
Automatic Classification Title: “A search for clusters of protostars in Orion cloud cores” Evaluation
Performance of the STR: Inspec First 3 suggested: Recall: 13.6% Precision: 11.2% Test Documents: 42,735 Specialties: 3
Performance of the STR: Ohsumed First 3 suggested: Recall: 26% Precision: 25.6% Test Documents: 18,733 Specialties: 33
The Search Term Recommender System - Questions • How can specialties be identified in an information collection? • Do specialty dialects really differ? • Is performance improved when focusing on specialty dialects? • How specific should specialties be? • Tested on 2 bibliographic collections: • Inspec • Medline (Ohsumed collection)
Specificity of Specialties • Language differences • Representative sample of specialty language for training
Specificity of Specialties - Inspec Identifying subspecialties by classification hierarchy • e.g. Computers & Control -- Computer Hardware -- Circuits & Devices
Specificity of Specialties - Inspec Identifying subspecialties by classification hierarchy • e.g. Computers & Control -- Computer Hardware -- Circuits & Devices Test documents: 2425 Specialties: 3
Specificity of Specialties - Ohsumed Identifying subspecialties by journal within subject • e.g. Orthopedics -- Clinical Orthopaedics & Related Research journal
Specificity of Specialties - Ohsumed Identifying subspecialties by journal within subject • e.g. Orthopedics -- Clinical Orthopaedics & Related Research journal Test documents: 745 Specialties: 3
Exploratory Web Interfaces Inspec http://metadata.sims.berkeley.edu/str/inspec/inspec.html Ohsumed http://metadata.sims.berkeley.edu/str/ohsumed/ohsumed.html
Summary • How can specialties be identified in an information collection? • Inspec: subject-specific classification • Ohsumed: journal specialty area