240 likes | 408 Views
High-Performance Digital Library Classification Systems:. From Information Retrieval to Knowledge Management. PI : Hsinchun Chen, The University of Arizona. DLI-2 All-Projects Meeting. Cornell, October 18-19, 1999. Research Plan. Research Goals:.
E N D
High-Performance Digital Library Classification Systems: From Information Retrieval to Knowledge Management PI: Hsinchun Chen, The University of Arizona DLI-2 All-Projects Meeting Cornell, October 18-19, 1999
Research Plan Research Goals: • Automatic generation of large-scale classification systems (CL) • Integration of system and human-generated classification systems • High-performance simulation and visualization of Object Oriented Hierarchical Automatic Yellowpage (OOHAY)
Research Plan 10 M 1 M 250 K 800 K 26 K 250 K Geoscience Medicine The Web Testbed: • Geoscience: Georef and Petroleum Abstracts (800K) and Georef thesaurus (26K terms) • Medicine: CancerLit (1M) and UMLS (250K concepts) • The Web: Indexable pages (10M) and Yahoo directory (250K nodes)
Research Plan • Computing: PA • Collections: Georef Arizona Health Science Library Arizona Cancer Center Arizona Science and Engineering Library • User Evaluation: Partners:
The Field Knowledge Management/Knowledge Networking: Definition “The Knowledge Networking (KN) initiative focuses on the integration of knowledge from different sources and domains across space and time... KN research aims to move beyond connectivity to achieve new levels of interactivity, increasing the semantic bandwidth, knowledge bandwidth, activity bandwidth, an cultural bandwidth among people, organizations, and communities.”
The Field Knowledge Management Functionality: (Source: GartnerGroup, 1998) Concept “Yellow Pages” Retrieved Knowledge • Clustering — categorization “table of contents” • Semantic Networks “index” • Dictionaries • Thesauri • Linguistic analysis • Data extraction • Collaborative filters • Communities • Trusted advisor • Expert identification Semantic Value “Recommendation” Collaboration
Techniques Illinois DLI-1 project: “Federated Search of Scientific Literature” Research goal: Semantic interoperability across subject domain Technologies: Semantic retrieval and analysis technologies • Text Tokenization • Part-of-speech-tagging • Noun phrase generation Natural Language Processing Automatic Generation of CL: Foundation from NSF/DARPA/NASA Digital Library Initiative-1
Techniques • Text Tokenization • Part-of-speech-tagging • Noun phrase generation Natural Language Processing Automatic Generation of CL: Foundation from NSF/DARPA/NASA Digital Library Initiative-1
Techniques Illinois DLI project: “Federated Search of Scientific Literature” Research goal: Semantic interoperability across subject domain Technologies: Semantic retrieval and analysis technologies Natural Language Processing Co-occurrence analysis • Heuristic term weighting • Weighted co-occurrence analysis Automatic Generation of CL: Foundation from NSF/DARPA/NASA Digital Library Initiative-1
Techniques • Heuristic term weighting • Weighted co-occurrence analysis Co-occurrence analysis Automatic Generation of CL: Foundation from NSF/DARPA/NASA Digital Library Initiative-1
Techniques Illinois DLI project: “Federated Search of Scientific Literature” Research goal: Semantic interoperability across subject domain Technologies: Semantic retrieval and analysis technologies Natural Language Processing Co-occurrence analysis Neural Network Analysis • Document clustering • Category labeling • Optimization and parallelization Automatic Generation of CL: Foundation from NSF/DARPA/NASA Digital Library Initiative-1
Techniques • Document clustering • Category labeling • Optimization and parallelization Neural Network Analysis Automatic Generation of CL: Foundation from NSF/DARPA/NASA Digital Library Initiative-1
Techniques Illinois DLI project: “Federated Search of Scientific Literature” Research goal: Semantic interoperability across subject domain Technologies: Semantic retrieval and analysis technologies Natural Language Processing Co-occurrence analysis Neural Network Analysis Advanced Visualization • 1D: alphabetic listing of categories • 2D: semantic map listing of categories • 3D: interactive, helicopter fly-through using VRML Automatic Generation of CL: Foundation from NSF/DARPA/NASA Digital Library Initiative-1
Techniques • 1D, 2D, 3D Advanced Visualization Automatic Generation of CL:
Techniques Automatic Generation of CL: (Continued) • Entity Extraction and Co-reference based on TREC and MUG • Text segmentation and summarization based on Textile and Wavelets • Visualization techniques based on Fisheye, Fractal, and Spotlight
Techniques Integration of CL: • Lexicon-enhanced indexing (e.g., UMLS Specialist Lexicon) • Ontology-enhanced query expansion (e.g., WordNet, UMLS Metathesaurus) • Ontology-enhanced semantic tagging (e.g., UMLS Semantic Nets) • Spreading-activation based term suggestion (e.g., Hopfield net)
Techniques High-performance Simulation and Visualization: • Algorithmic optimization and parallelization on NCSA supercomputers (time machine) • Advanced, interactive 2D/3D visualization via Java, VRML, and OpenGL
Research Status Y A H O O Y A H O O Y A H O O Y A H O O O O H A Y O O H A Y O O H A Y O Y H A O From YAHOO! To OOHAY? Y A H O O ! Object Oriented Hierarchical Automatic Yellowpage ?
Research Status Arizona DLI-2 project: “From Interspace to OOHAY?” Research goal: automatic and dynamic categorization and visualization of ALL the web pages in US (and the world, later) Technologies: OOHAY techniques Multi-threaded spiders for web page collection High-precision web page noun phrasing and entity identification Multi-layered, parallel, automatic web page topic directory/hierarchy generation Dynamic web search result summarization and visualization Adaptive, 3D web-based visualization OOHAY: Visualizing the Web
Research Status ROCK MUSIC … 50 6 OOHAY: Visualizing the Web
For project information and free download: http://ai.bpa.arizona.edu Research Status OOHAY: CI Spider, Meta Spider, Med Spider 1. Enter Starting URLs and Key Phrases to be searched 2. Search results from spiders are displayed dynamically
For project information and free download: http://ai.bpa.arizona.edu Research Status OOHAY: CI Spider, Meta Spider, Med Spider 3. Noun Phrases are extracted from the web ages and user can selected preferred phrases for further summarization. 4. SOM is generated based on the phrases selected. Steps 3 and 4 can be done in iterations to refine the results.
Research Status Digital Library Research on New York Times, Cover article, Sep 30, 1999
Research Status • IEEE Computer, May 1996 (Schatz/Chen) • IEEE Computer, February 1999 (Schatz/Chen) DL Special Issues and Activities: • Second Asia DL Workshop, November 8-9, 1999, Taipei, Taiwan • JASIS, 2000, forthcoming (Chen) Berkeley (Wilensky), UCSB (Hill/Smith), Maryland (Greene/Shneiderman), Xerox PARC (Baldonado), IBM (Liu), Texas A&M (Shipman/Furuta), NASA (Kaplan)