Martin Grötschel

On the Road to Scientific Information Portals:Cooperative Digital LibrariesRemarks, Visions, Proposals Martin Grötschel IuK 2001, Universität Trier

Contents Introduction • All Information is Part of the Web Can we make this true? • The Visible Web and the Deep Web • There could be an Interconnected Network of Science • Integrating All Types of Resources • We should Organize the Cyber Space • To the Benefit of our Society

Personal Motivation • I have broad interests. • I (have to) search a lot. • I do find things I look for. • However, this process costs too much time and money. • The „scientific information system“ could be much better. • It seems that some scientists have to get involved. • The situation is similar with respect to communication.

Acting Forces • Science drives Technology • Technology drives Change • Change induces Pressure Some Consequences: • Higher Speed and Efficiency • Lower Costs • Universal Connectivity • More and Global Competition What does this imply for Science?

The World of Information • Tons of Printed Material Zillions • of Scientific Web Sites • of E-Journals, E-Prints • of Databases and CD-Roms • of Multimedia Documents • of E-Mail • of Digital Photos and Videos • etc.

The Players • The Author • The Publisher • The Librarian • The Software Developer • The Service Provider • The Scientific Information Center • The Scientific Society • etc. the user

Some Unsolved Issues • Accessability • Searchability • Stability • Compatibility • Pricing • Heterogeneity • Diversity and Complexity of Structures • Quality • Authenticity • etc.

Solution • Scientists have to get involved • Solution must be user driven • Cooperation of players • Consensus about structures Some Suggestions in this Talk

Contents • All Information is Part of the Web Can we make this true?

Current Mathematical Resources • Papers and Preprints • Journals and Books • Reviews and Abstracts • Software and Data Collections • Projects and Persons • Voice, Images, and Video Information • Links, Mail, and Virtual Libraries

Math Papers and Preprints • Preprints of the Math-Net • MPRESS (including ArXiv math,...) • EULER • Digital Library @ ACM

Math Journals and Books • SUB Göttingen („Sondersammelgebiet“) • TIB Hannover (Tech Information Library) • ELib @ Uni Osnabrück • EMIS • Springer LINK • DOCUMENTA MATHEMATICA • Lehmanns.de

Math Reviews and Abstracts • MATH @ Zentralblatt • MathSci @ AMS • MATHDI @ FIZ-Karlsruhe • Jahrbuch der Mathematik

Math Software and Data Collections • Netlib @ ANL • eLib @ ZIB • MuPad @ Uni Paderborn • Algebraic Groups • Cinderella • OpenMath

Projects and Persons • Web Sites of Math Research Institutes • Web Sites of Math Departments • BerNAM • Directory of Mathematicians @ ACM • Comb. Membership List AMS, SIAM, MAA • PERSONA MATHEMATICA @ mat-net.de • SIGMA @ math-net.de

Voice, Images, and Video • Computer Museum • MSRI Video Server • Electronic Geometric Models Application Servers and Software • MATHEMATICA • Cinderella • Inverse Calculator

Links, Mail, and Virtual Libraries • mathematik.de • Math-Net.de • Mathematical Archives • Opt-Net @ ZIB • MathML

There are zillions of Math Resources in the Net.

The Situation is Similar in all other Sciences • How do you know that all this • material exists and where it is? • Old Approach: • Link Lists = WWW Virtual Libraries • But, much more has come up in the recent years!

Is Everything in the Web? • Printed Books • Printed Journals • CD-ROMs • Some Data Bases • Historic Archives • Catalog Cards • ... are not electronically available

Is Everything from the Web in the Web?

Contents • All Information is Part of the Web Can we make this true? • The Visible Web and the Deep Web

The Invisible / Deep Web A fundamental Problem with Search Engines: A Vast Amount of Information is Invisible • Surface Web / Web Robots Start at some „Hubs“ • Interlinked Web Pages • Deep Web • Isolated Web Sites • There are huge Isolated Islands in the Web • Information within Databases, behind CGI Interfaces • Information without Links (e.g. within OPACs of Libraries) • Protected Material, Excluded Explicitly

A Web Search Engine Collecting Visible Information From „The Deep Web: Surfacing Hidden Value; BrightPlanet.com, Jan-2000“

A Direct Meta Search Engine Fishing for Invisible Information From „The Deep Web: Surfacing Hidden Value; BrightPlanet.com, Jan. 2000“

Characteristics of the Deep Web- in Comparison to the Visible Web - • Public information is currently 400 to 500 times larger than the commonly defined World Wide Web • 7,500 terabytes of information (550 Billion individual documents), compared to 19 terabytes (1 Billion documents) From:The Deep Web: Surfacing Hidden Value; BrightPlanet.com, Jan 2000

Characteristics of the Deep Web- in Comparison to the Visible Web - • More than 100,000 Deep Web sites currently exist • 60 of the largest Deep Web Sites collectively contain about 750 terabytes of Information (... narrower, with deeper content) • More than half of the Deep Web content resides in topic specific databases (BrightPlanet concentrates on about 20,000 sites) • A full 95% of the Deep Web is publicly accessible information – not subject to fees or subscriptions • The Deep Web is the largest growing category of new information on the Internet. But theDeep Web is widely unknown. From:The Deep Web: Surfacing Hidden Value; BrightPlanet.com, Jan 2000

Making the Deep Web Visible Technology: • Meta Search Engines • Bibliographic Meta Search Engines • Virtual Catalogs and Link Lists Organisational Issues: • Building Networks of Digital Libraries • Forming Library and other Cooperatives • Working on Standards and Formats (Common, Open, Metadata,...)

Categories of Information Systems • Web Sites – Collection, Query Interface • Publications – E-Journals, Preprints, ... • Regional/Nat. Collections – Harvesting Systems • Topical Databases – Subject Specific Aggregation • OPACs – Library Holdings • Journal Archives – Archive of Publishers Software/Data Collection – Commercial / Public Archive • Compute Servers – Math. Calculations /Demos • Mailing Lists/Archive – Topical Communication Forum • Topical Portals – Wide Spectrum Information System

Problems: Wide Variety of Servers Problems with Search Engines (Web Robots) • Impose High Load on Servers and Networks • Perverted use of Metadata • Robots can‘t see behind CGI-Interfaces • Access Rights, Range of Licenses Problems with Cascading Search Engines • Diversity of data formats (MAB, MARC Formats, DC, ...) • Multitude of protocols (Z39.50, HTTP, proprietary) Specialized Repositories and Archives • Scientific Journals provided by Commercial Publishers • Document Delivery Systems and Specialized Historic Archives • Maps, Music, Photos, Videos, Multimedia

Contents • All Information is Part of the Web Can we make this true? • The Visible Web and the Deep Web • There could be an Interconnected Network of Science

Virtual Search index Links Metadata OPAC catalog entries Digital Structured digital contents Full texts Data bases Virtual/Digital Library

Towards a Scientific Portalto Interconnect the Digital World Virtual Library Information Portal: Cooperative Virtual Digital Digital Library Scientific Library The Scientific Portal (Information Portal for the Sciences) is an Entry Point to all Types of Information Products from the Sciences. Behind the Scientific Portal is a Structured Network to be coordinated and organized by the Sciences in a cooperative way. A Task for the IuK Initiative?

Lots of Examplesalready exist

An Example in the Making Virtuelle Fachbibliothek Technik der TIB Hannover

Example: The DOE Information Bridge • Started in 1997 with 60.000 searchable full text reports online @ DOE Office of Scientific and Technical Information (OSTI) • Direct Search based on the Distributed Explorer developed by a small Internet Company: Innovative Web Application Ltd. (IWA) • A public version in partnership with the Government Printing Office (GPO) of the USA • Many other Federal Deep Web collections added to the DOE Virtual Library • PubScience • PubMed • NTIS Electronic Catalog (450,000 Titles) • NASA Technical Report Server • Energy Portal Search • Digitization efforts for Gray Literature (@ OSTI)

OSTI Virtual Library

PubScience

The GrayLit Information Network Graphic from „Searching The Deep Web; W.L. Warnick et al.“ D-Lib Magazine, Vol. 7, No. 1, January 2001; www.dlib.org

Preprint Network

DOE OSTI

Energy Portal Search

PubMed

NASA Image Exchange

Federal R & D Architecture Graphic from „Searching The Deep Web; W.L. Warnick et al.“ D-Lib Magazine, Vol. 7, No. 1, January 2001; www.dlib.org

An Observation The Voluntary Work contributed so far was and will stay important. There will, however, be no satisfactory solution without substantial amounts of personal and financial investment. We need to become more professional, e.g., Google versus Math-Net.

Contents • All Information is Part of the Web Can we make this true? • The Visible Web and the Deep Web • There could be an Interconnected Network of Science • Integrating All Types of Resources

Distributed Meta Search Engines Exist What they do: • Query Search Engines, OPACs, Databases • Perform Distributed Searches in Parallel • Cascade Search to reach Large/Vast Amounts of Targets • Deliver Links, Metadata, and/or Full Texts • Handle a Diversity of Data Structures • Use a Multitude of Internet/Web Protocols • Structure Heterogeneous/Large Result Sets They Rely on a Series of Small Configuration Files

Combination of Search Engines • Math-Net: Harvest+DC • KOBV Search Engine • Shared Index • Distributed Search • Shared Index • EULER and Dublin Core • DigiBib NRW As studied by J. Lügger in „Über Suchmaschinen, Verbünde und die Integration von Informationsangeboten“; ABI-Technik, June, 2000

Martin Grötschel