360 likes | 467 Views
Federating Repositories of Scientific Literature. www.canis.uiuc.edu. The Interspace Prototype (1997-2000) Digital Libraries Initiative (1994-1998) Worm Community System (1990-1993) Telesophy System (1984-1989).
E N D
Federating Repositoriesof Scientific Literature www.canis.uiuc.edu The Interspace Prototype (1997-2000) Digital Libraries Initiative (1994-1998) Worm Community System (1990-1993) Telesophy System (1984-1989)
Federating Repositoriesof Scientific LiteratureThe University of Illinois Digital Libraries Initiative (DLI)Project Status & RetrospectiveBruce R. Schatz dli@uiuc.eduhttp://dli.grainger.uiuc.eduAAAS-98, Digital Libraries SessionPhiladelphia, February 1998
Concept Search Document Search Text Search Grand Visions 1960 1970 1980 1990 2000 2010 Syntax Structure Semantics Evolution of Information Retrieval across the Net from: Bruce R. Schatz, “Information Retrieval in Digital Libraries: Bringing Search to the Net” cover article in Science, vol 275, Jan 17, 1997 special issue on Bioinformatics
Illinois DLI Status • Production Testbed based in a Real Library • Document Search based on Structure • SGML Publisher Stream deployed at U of Illinois • Technology Research for Scalable Federation • Concept Search based on Semantics • Statistical Indexes across subjects and media
Production Testbed Status • Based in major Engineering Library • Production Stream - in testbed before on shelves • Full-text SGML -- Federated Structure Search • 5 publishers, 55 journals, 40,000 articles • Web version campus rollout October 1997 • integrated within library information services
Production Testbed Evaluation • 700 users, steadily increasing to max 1500 • used in intro Computer Science classes • developers and evaluators work closely • needs assessment and usability studies • careful multi-modal usage evaluation • session observations and transaction logs
Primary Partners • journal/magazine Publishers: • American Institute of Physics (AIP) • American Physical Society (APS) • American Astronomical Society (AAS) • American Society of Civil Engineers (ASCE) • American Society of Mechanical Engineers (ASME) • American Society of Agricultural Engineers (ASAE) • American Institute of Aeronautics & Astronautics (AIAA) • Institute of Electrical and Electronics Engineers (IEEE) • Institution of Electrical Engineers (IEE) • IEEE Computer Society (IEEE-CS) • testbed: SoftQuad, OpenText • infrastructure: Hewlett-Packard, Microsoft
Testbed Difficulties • Original plan was to modify Mosaic for search • Web became commercial -- we lost control of developers • Plan to use standard BRS as fulltext backend • needed to use SGML specific OpenText search engine • good-quality SGML simply not available • we had to train every publisher; nothing was ready • SGML interactive display not journal quality • physics requires equations -- hard to display well • Custom software hard to deploy widely • Web widespread but too lowend for professional search
Testbed Successes • Willing to build custom encoding procedures • so succeed with SGML where Elsevier and OCLC failed • Canonical encoding for structure tags • so can federate across publishers and journals • Willing to build custom software for Search • so able to do multiple views not single stream like Web • Production repositories for real Publishers • became R&D arm of major scientific publishers • Changing the nature of libraries with research • research prototype becomes standard service
Technology Transfer • Illinois DLI considered R&D arm of publishers • broad spectrum of major publishers in scientific literature • successful annual partner’s workshop plus high-level visits • Technology transferred to Publisher partners • contract with AIP to clone testbed software & processing • arrangements with ASCE for a second cloning • Testbed Continuance by University Library • industrial partners program between Library & Publishers • company formed to provide software and service
Technology Research • Scalable Semantics becoming feasible • statistical clustering proves useful interactively • concept spaces and category maps • Semantic indexes for large collections • 400K Inspec (1995) • 4M Compendex (1996) • Simulation of Community Repositories • 1000 collections across all of engineering • testbed for vocabulary switching (federation)
Vocabulary Switching • Grand Challenge of Digital Libraries • semantic interoperability across subject domains • vocabulary switching to suggest across domains • Generating 1000 community repositories • 600 categories across engineering (38 top-level) • 150 categories across EE, CS, physics • 3M raw abstracts, about 10M in community spaces • large-scale supercomputer simulation • 7 days of dedicated computation (10 days overall) • have space navigation; need space intersection
Multimedia Federation • Semantic Indexing within Media • Text, Image, Number • Semantic Interoperability across Media • Spatial Data (GIS) dataset intersection • Multi-site DLI Collaboration • U Illinois: systems and supercomputers • U Arizona: algorithms and experiments • UC Santa Barbara: collections and metadata
Semantic Analysis of Multimedia • Collections of Objects containing Units • Text: community repository (topic proximity) document abstracts containing noun phrases • Image: aerial photograph (spatial proximity) feature regions containing texture tiles • Units are media-dependent (statistical parsers) • Text: phrase segmentation (nouns on word parts of speech) • Image: texture segmentation (orientation on pixel densities) • Indexes are media-independent (statistical clusters) • Concept: co-occurrence similarity of units within objects • Category: self-organizing maps of objects within collections
Media Interoperability Experiment • Feature regions containing texture tiles in aerial photos • 1M regions in 5K photos around southern California (GIS) • text concept space and category map in geoscience • 10M phrases in 500K abstracts from Georef and Petroleum Abstracts • image concept space and category map in aerial photos • tile similarity space and visual thesaurus maps (10M tiles) • numeric satellite sensor data • 1M NASA AVHRR temperature records, 2M GNIS feature names • spatial gazetteer as bridge image<=>text<=>number • images are labeled by GNIS gazetteer (feature names for text search)
Federated Search • Multiple Indexes in Distributed Repositories • text search: SGML for full-text articles (Testbed) bibliographic abstracts for full coverage (INSPEC) • term suggestion: thesaurus for taxonomy (INSPEC) concept spaces for term coverage (SGML) • Multiple View User Interface Client • uniform displays for multiple indexes • drag-and-drop between display views to mix-and-match • uniform search across multiple repositories • Multiple Protocol Stateful Gateway • single query stream analog to single user interface • will handle distributed repositories for federation, e.g. AAS • Opentext (socket), term-suggest (SQL), Ovid/DRA (Z39.50)
Building a new Community starting the field of Digital Libraries • IEEE Computer DLI special issue May 1996 • Computer DLI retrospective planned for 1999 • Allerton workshops on DL Sociology • edited book planned on DL Evaluation • DLI National Coordination effort • Illinois DLI retrospective conference (Mar 98)
The 21st Century: Analysis • Beyond Search to Analysis • Cross-Correlating Information from many sources across the Net • The Net solves problems • Every community has its own special library • Every community and every person does indexing !! • The Internet evolves into the Interspace