The National Science Digital Library (NSDL) as an Example of Information Science Research

The National Science Digital Library (NSDL) as an Example of Information Science Research William Y. Arms Cornell University October 25, 2002

Some Light Reading William Y. Arms, "Economic models for open-access publishing." iMP, March 2000. http://www.cisp.org/imp/march_2000/03_00arms.htm William Y. Arms, "Automated digital libraries." D-Lib Magazine, July/August 2000. http://www.dlib.org/dlib/july20/07contents.html William Y. Arms, "What are the alternatives to peer review? Quality control in scholarly publishing on the web." Journal of Electronic Publishing, 8(1), August 2002. http://www.press.umich.edu/jep/08-01/arms.html William Y. Arms, et al., "A Spectrum of Interoperability: The Site for Science Prototype for the NSDL." D-Lib Magazine, 8(1), January 2002. http://www.dlib.org/dlib/january02/arms/01arms.html

A Scenario A faculty member wished to find a paper for students to read in a class. He began by asking an expert. She suggested the original research paper as suitable. Later, he typed a few terms into Google, browsed the hits, selected one that led to ResearchIndex, found the paper, and downloaded a PDF version from the author's web site.

Computer Science Cognitive Studies Society HCI Viewpoints

HCI: Eye Tracking

Information Science Computer Science Cognitive Studies Applications Society HCI

Open Access to Scientific, Scholarly and Professional Information

Before the Web Access to Scientific, Medical, Legal Information In the United States: excellent if you belonged to a rich organization (e.g, a major university) very poor otherwise (e.g., most K-12 schools) In many countries of the world: very poor for everybody

Research Libraries are Expensive staff library materials buildings & facilities

Baumol's Cost Disease Price Labor-intensive services Bundle of goods and services Manufactured goods 2050 1950 2000 1900 Year

Baumol's Cost Disease Price Labor-intensive services Moore's Law Bundle of goods and services Manufactured goods 2050 1950 2000 1900 Year

Brute Force Computing Few people really understand Moore's Law Computing power doubles every 18 months Increases 100 times in 10 years Increases 10,000 times in 20 years Simple algorithms plus immense computing power can outperform human intelligence

Example: Catalogs and Indexes Cost disease: catalogs and indexes Catalog, index and abstracting records are very expensive when created by skilled professionals Moore's Law: automatic indexing of full text Retrieval effectiveness using automatic indexing can be at least as effective as manual indexing with controlled vocabularies (Cleverdon 1967, reporting on experiments by Salton)

Brute Force Computing:Substitutes for Human Intelligence Automated algorithms for information discovery Similarity of two documents Vector space and statistical methods (Salton, Sparc Jones, et al.) Importance of digital object Rank importance of web pages by analysis of the graph of web links (Kleinberg, Page, et al.)

Information Discovery:1992 and 2002 1992 2002 Content print digital Computing expensive inexpensive Choice of content selective comprehensive Index creation human automatic Frequency one time monthly Vocabulary controlled not controlled Query Boolean ranked retrieval Users trained untrained

Brute Force Computing: Automated Metadata Extraction Informedia (Carnegie Mellon) Automatic processing of segments of video, e.g., television news. Algorithms for: dividing raw video into discrete items generating short summaries indexing the sound track using speech recognition recognizing faces (Wactlar, et al.)

Computer Science HCI Brute Force Computing + Intelligence of the User Simple algorithms plus immense computing power plus the intelligenceof the user can replace labor-intensive services Cognitive Studies

20 The National Science Foundation'sNational Science Digital Library(NSDL) http://www.nsdl.org

21 Scope All digital information relevant to any level of education in any branch of science. Scientific and technical information Materials used in education Materials tailored to education

22 How Big might the NSDL be? All branches of science, all levels of education, very broadly defined: Five year targets • 1,000,000 different users • 10,000,000 digital objects • 10,000 to 100,000 independent sites

23 The Integration Task ... ... to provide a coherent set of collections and services across great diversity

24 Resources Integration team Budget $4-6 million Staff 25 - 30 Management Diffuse How can a small team, without direct management control, create a very large-scale digital library?

25 Philosophy It is possible to build a very large digital library with a small staff. But ... • Every aspect of the library must be planned with scalability in mind. • Some compromises will be made.

26 Example 1: The Mortal behind the Portal [This space left intentionally blank.]

27 Example 2: Interoperability The Problem Conventional approaches require partners to support agreements (technical, content, and business) But NSDL needs thousands of very different partners ... most of whom are not directly part of the NSDL program The challenge is to create incentives for independent digital libraries to adopt agreements

28 Function Versus Cost of Acceptance Cost of acceptance Few adopters Many adopters Function

29 Example: Textual Mark-up Cost of acceptance SGML XML HTML Function ASCII

30 The Spectrum of Interoperability Level Agreements Example Federation Strict use of standards AACR, MARC (syntax, semantic, Z 39.50 and business) Harvesting Digital libraries expose Open Archives metadata; simple metadata harvesting protocol and registry Gathering Digital libraries do not Web crawlers cooperate; services must and search engines seek out information

31 Example 3: Searching Basic Assumptions The integration team will not manage any collections The integration team will not create any metadata

32 Effective Information Retrieval Comprehensive metadata with Boolean retrieval (e.g., monograph catalog). Can be excellent for well-understood categories of material, but requires expensive metadata, which is rarely available. Full text indexing with ranked retrieval (e.g., news articles). Excellent for relatively homogeneous material, but requires available full text. Full text indexingwith contextual information and ranked retrieval (e.g., Google). Excellent for mixed textual information with rich structure. Contextual information without non-textual materialsand ranked retrieval (e.g., Google image retrieval). Promising, but still experimental.

33 The NSDL Search Service Full Text or Metadata? Full text indexing is excellent, but is not possible for all materials (non-textual, no access for indexing). Comprehensive metadata is available for very few of the materials. What Architecture to Use? Few collections support an established search protocol (e.g., Z39.50).

34 Broadcast Searching does not Scale Collections User interface server User

35 The Metadata Repository Services The metadata repository is a resource for service providers. It holds information about every collection and item known to the NSDL,including contextual information. Users Metadata repository Collections

36 The Metadata Repository as a Resource Support for Service Providers Records are exposed through Open Archives Initiative protocol for metadata harvesting. Core Integration team provides some services based on the metadata repository. The architecture encourages others to build services.

37 Search Service Metadata repository Portal OAI SDLIP Search andDiscoveryServices Portal http Portal Collections James Allan,Bruce Croft (University of Massachusetts, Amherst)

38 Where is the Center of the Universe? Alexandria Library of Congress Elsevier NSDL Joe's Pictures Informedia Math DL

39 Where is the Center of the Universe? British Library Internet Archive Library of Congress Elsevier OCLC Harvard NSDL

40 Where is the Center of the Universe? Google email Office Course web sites Bill Arms Directories News and weather NSDL Technical documentation

41 Acknowledgement The NSDL is a program of the National Science Foundation's Directorate for Education and Human Resources, Division of Undergraduate Education. The NSDL Core Integration is a collaboration between the University Center for Atmospheric Research (Dave Fulker), Columbia University (Kate Wittenberg) and Cornell University (Bill Arms). The Technical Director is Carl Lagoze (Cornell University).

The National Science Digital Library (NSDL) as an Example of Information Science Research

The National Science Digital Library (NSDL) as an Example of Information Science Research

Presentation Transcript

The African Digital Library

LIVE INTERACTIVE LEARNING @ YOUR DESKTOP

The Connecticut Digital Library

http://institute.nsta.org/web_seminars.asp

NSDL – A Tool for Teaching and Learning

Digital Library

Fedora Commons Educational Digital Library Projects

Opportunities and Challenges for the NSDL Program

The National SMET Education Digital Library (NSDL) Program: Context and Vision

The digital library

Integrating Digital Libraries with Traditional Libraries

COLLABORATING TO BUILD THE DIGITAL WATER EDUCATION LIBRARY DWEL

The NSDL Center for Sustaining Broader Impacts

Metadata and OAI

The ESIP Federation and Digital Libraries

Digital Library Evaluation: Measuring Impact, Quantifying Quality, or Tilting at Windmills?

Collection Building Tools: Contributing to the NSDL

National Digital Library for Agriculture

The NSDL, OAI and Your Metadata

Working with the NSDL 2.0 Data Repository

institute.nsta/web_seminars.asp

Interoperability in Digital Libraries Open Archives Initiative and the NSDL