270 likes | 356 Views
Indexing Mathematical Abstracts by Metadata and Ontology IMA Workshop, April 26-27, 2004. Su-Shing Chen, University of Florida suchen@cise.ufl.edu. Abstract.
E N D
Indexing Mathematical Abstracts by Metadata and OntologyIMA Workshop, April 26-27, 2004 Su-Shing Chen, University of Florida suchen@cise.ufl.edu
Abstract • OAI extensions to federated search and other services for MathML-based metadata indexing and subject classification of mathematical abstracts. • Construction of ontology or conceptual maps of mathematics. Mathematical formulas are considered as elements of the ontology. • Ontology indexing by clustering mathematical abstracts or full papers into an information visualization interface so that users may select using ontology as well as metadata.
Harvest API Harvester OAI_DC Data Mining Data Provider Service Provider DL Server Data Provider Service Provider OAI_XXX Federated Search A DL Server with OAI Extensions: Managing the Metadata Complexity
A DL Server with OAI Extensions: Managing the Metadata Complexity Built in capabilities: • Harvester – harvest various OAI compliant data providers • Data provider – expose harvested and existing metadata sets • Service provider – federated search and data mining capabilities on metadata sets
Harvest API Data Providers • Harvester Interface: • URL to harvest • Selective harvesting • parameters harvest Harvester parameters harvest Harvested metadata … DL Server Harvester
Data Provider • Expose single or combined metadata sets harvested to other harvesters • Reformat metadata from different data providers to be harvested by other service providers (e.g., originally Dublin Core, reformat to MARC before exposing)
Service Provider: Federated Search • Emulating a federated search service on existing and combined harvested metadata sets • Federated search across potentially other search protocols
Service Provider: Data Mining • Knowledge discovery on harvested metadata sets • Metadata classification using the Self-Organizing Map (SOM) algorithm • Improving retrieval effectiveness by providing concept browsing and search services
Self-Organizing Map Algorithm • Competitive and unsupervised learning algorithm • Artificial neural network algorithm for visualizing and interpreting complex data sets • Providing a mapping from a high-dimensional input space to a two-dimensional output space
Data Mining Service Provider System Architecture Browser Browser Concept browsing request Concept search request Response Response Request Response Concept Harvester SOM Categorizer Input Vector Generator Noun Phraser Fetch metadata Save SOM Metadata Database
Concept Harvester • Screenshot of the SOM Categorizer
Construction of Two-level Concept Hierarchy • Constructing the SOM for each harvested metadata set • SOMs of the lower layer are added to the upper-layer SOM. VTETD
MEDLINE Database • Developed by the National Library of Medicine (NLM) • Bibliographic citations and abstracts from more than 4,600 biomedical journals published in the United States and 70 other countries. • Covering the fields of medicine, nursing, dentistry, veterinary medicine, the health care system, and the preclinical sciences. • Over 12 million citations • Searchable via PubMed or the NLM Gateway
MeSH (Medical Subject Headings) • MEDLINE uses MeSH as its controlled vocabulary for indexing database articles • Indexers scan an entire article and assign MeSH headings (or MeSH descriptors) to each article • MeSH descriptors are arranged in both an alphabetic list and a hierarchical structure. • Updated annually to reflect the changes in medicine and medical terminology
Our Experimentation • Problems • It is well known that searching by descriptors will greatly improve the search precision. • However, it is very difficult for naïve users to know and use exact MeSH descriptors to search. • In addition, as the database of MEDLINE grows, information overload would prevent users from finding relevant information of their interest. • Proposed Approach • Categorizations according to MeSH terms, MeSH major topics, and the co-occurrence of MeSH descriptors • Clustering using the results of MeSH term categorization through the Knowledge Grid • Visualization of categories and hierarchical clusters
Data Access Services MeSH Major Topic Tree View SOM Tree View
Knowledge Grid • Knowledge Grid Architecture Courtesy of Cannataro and Talia (Knowledge Grid: An Architecture for Distributed Knowledge Discovery)
Future Directions • Develop a federated search service for OAI-compliant mathematical abstracts. • Develop an ontology or conceptual maps for mathematics. • Develop an ontology search service for mathematical abstracts and full papers. • Develop an interoperable architecture with other services, such as OCR of mathematical formulas.
Acknowledgement • Many thanks to the NSF NSDL Program. • Collaborators – Joe Futrelle (NCSA), Ed Fox (Virginia Tech) • Student Team – Hyunki Kim, Chee Yoong Choo, Xiaoou Fu, Yu Chen