650 likes | 779 Views
Introduction to Digital Libraries. Thanks to Michael L. Nelson Robert Allen. The Ascent of Homo Nettus -. DLs vs archives vs repositories. All deal with collections, objects, processes DLs change over time Archives are never touched Unpublished works - grey literature
E N D
Introduction to Digital Libraries Thanks to Michael L. Nelson Robert Allen
DLs vs archives vs repositories All deal with collections, objects, processes • DLs change over time • Archives are never touched • Unpublished works - grey literature • Repository - more general term • A mechanism for storing any information about the definition of a system at any point in its life-cycle. Repository services would typically be provided for extensibility, recovery, integrity, naming standards, and a wide variety of other management functions. • A repository is a network-accessible storage system in which digital objects may be stored for possible subsequent access or retrieval.
SOAP Model for collection management • Selection • Organization • Access • Persistence
What is a Library? Main Entry: li·brary Pronunciation: 'lI-"brer-E; British usually and US sometimes -br&r-E; US sometimes -brE, ÷-"ber-E Function: noun Inflected Form(s): plural -brar·ies Etymology: Middle English, from Medieval Latin librarium, from Latin, neuter of librarius of books, from libr-, liber inner bark, rind, book 1 a :a place in which literary, musical, artistic, or reference materials (as books, manuscripts, recordings, or films) are kept for use but not for sale b:a collection of such materials 2 a :a collection resembling or suggesting a library <a library of computer programs> <wine library > b:MORGUE 2 3 a :a series of related books issued by a publisher b:a collection of publications on the same subject 4:a collection of sequences of DNA and especially recombinant DNA that a re maintained in a suitable cellular environment and that represent the genetic material of a particular organism or tissue http://www.m-w.com/cgi-bin/dictionary?book=Dictionary&va=library&x=0&y=0
A Tool For Communicating With The Future… SCROLLS FROM THE DEAD SEA The Ancient Library of Qumran and Modern Scholarship http://www.ibiblio.org/expo/deadsea.scrolls.exhibit/intro.html
A History of Libraries in 1 Slide • Lyceum - Ancient Greece • http://en.wikipedia.org/wiki/Lyceum • Alexandria - Ancient Egypt • http://en.wikipedia.org/wiki/Library_of_Alexandria • (…skipping a bit…) • Boston Public Library - First US public lending library (1848) • http://en.wikipedia.org/wiki/Boston_Public_Library • http://www.bpl.org/ • “The commonwealth requires the education of the people as the safeguard of order and liberty” more info: http://www.dlib.org/dlib/january00/01levy.html
“Lone Scientist” Stereotypes Max Munk http://history.nasa.gov/SP-4103/ch4.htm H. J. E. Reid http://history.nasa.gov/SP-4103/ch4.htm Enrico Fermi http://www.anl.gov/Media_Center/logos20-1/fermi01.htm John Stack http://www.hq.nasa.gov/office/pao/History/x1/stack.html Albert Einstein http://www.artnet.com/artist/92724/Vishniac_Roman.htm
Vannevar Bush (1890-1974) • Director of the Office of Scientific Research and Development • lead 6000 scientists in R&D for WWII • previously, science lacked large scale teams • also director of NACA (1939)! • Predicted many technological advances • the “memex” is one whose spirit we are implementing • the purpose was to provide scientists the capability to exchange information; to have access to the totality of recorded information image from: http://www.ibiblio.org/pioneers/bush.html
Memex • Integrated computer, keyboard, and desk • “mechanized private file and library” • remove drudgery from information retrieval • suggested implementation was microfilm • various user operations are suggested • Associative indexing was the main purpose • “the process of tying two items together is the important thing” • prelude to hypertext... Image from: http://www.dynamicdiagrams.com/case_studies/mit_memex.html
Memex • Information could come pre-associatively indexed, but the key point was user customization • WWW still does not provide that today • Bush observes that tools change our way of doing, and expand the horizons before us • full impact of WWW and DLs still not known • Interesting: Bush’s AM article did not predict free-text searching... • knowledge trails only; DMOZ w/o keyword searching
digital preservation research www.digitalpreservation.gov “digital information lasts forever -- or 5 years, whichever comes first” -- Jeff Rothenberg from Lesk, http://www.lesk.com/mlesk/
from Lesk, http://www.lesk.com/mlesk/
What is a Digital Library (DL)? • “…a managed collection of information, with associated services, where the information is stored in digital formats and accessible over a network” (Arms) • there are any number of alternate definitions, but this seems fair enough • no mention of architecture, implementation, content, etc.
How is a DL different from a database? • A traditional SQL database has as its basic element data items in a relation: • select name • from employee, project • where employee.deptnumber = “25” AND • project.number = “100” • databases exploit known structures and relations • DBMS retrieval is not probabilistic (Frakes, Baeza-Yates)
How is a DL different from the WWW? • The keyword is managed • The WWW is not managed • Some meta searchers (Yahoo, Lycos, DMOZ) attempt to add an organizational framework to their web holdings • However, most are focused on keyword searching (i.e., Google)
A Garden vs. Desert Scrub http://www.gingerbread-mansion.com/ourgarden.html http://www.filmdeserts.com/Open%20Desert%20Areas_1.htm
How is a DL different from the WWW? • Another key difference is who controls the input into the system • most web searchers hunt down their holdings • Lycos is short for Lycosidae lycosa (the “wolf spider”), which pursues its prey and does not build a web (Mauldin, IEEE Expert, 1/97) • some (DMOZ, Yahoo) have humans in the loop for review and classification • DLs are generally more tightly controlled, and have a targeted customer set
DL = Content + Services • “Why not just use the WWW” ? • WWW by itself has low archival & management characteristics • “Why not use a RDBMS?” • In the same way that a card catalog is not a TL, a RDBMS is a candidate technology for use in DLs • DL is the union of the content and services defined on the content
The Study of Digital Libraries is Multidisciplinary • computer science • tools, protocols, transport, indexing, ontology, dbs • information science • information access and storage, services • human factors • usability, adaptability • law • rights management, open access • economics • business models, services
How is a DL Different from a Traditional Library? • TL has as its focus physical objects • even if the card catalog (metadata) is electronic, the purpose is to point you to a physical location • trafficking in physical objects has both obvious and subtle implications • object can exist only in 1 place • if you have it, I can’t have it (zero-sum distribution) • I have to go to the object, or wait for it to come to me
TLs vs. DLs • DLs clearly better than TLs at: • Dissemination, storing information variety • TL objects are more survivable • Who will archive the research information? • the publishers? • the institutions? • the authors? • Will the average DL object still be accessible in 10 years? • Internet archive image from: http://www.ancientegypt.co.uk/writing/rosetta.html
How is a DL Different from a Traditional Library? • Digital Library • removing the physical restriction has obvious benefits • multiple access, multiple listings, electronic transmission • also complicates many other issues... • intellectual property, terms and conditions, etc. • Note that a TL offers additional social and educational benefits • Most TLs also offer hybrid services too.
from Lesk, http://www.lesk.com/mlesk/
TLs vs. DLs • Where does publishing stop, and libraries begin? • there has always been tensions between TLs and traditional publishers, but the roles were fairly well defined • DLs can muddle the separation of these responsibilities • result: conflict, and/or new models
Traditional Players book store publisher service library archive responsibility over time
What is Scientific and Technical Information (STI)? • STI is the collection of materials, independent of format,used in research, development, and other technical activities • Papers, reports, data sets, images, videos, software, etc. • It is also the output of such R&D activities • STI includes both white and grey literature
White and Grey Literature • The line between the two is not always clear • Grey Net offers an admittedly obsolete definition of grey literature: • “Information produced on all levels of government, academics, business and industry in electronic and print formats not controlled by commercial publishing" • http://www.greynet.org/ • CiteSeer indexes the grey literature and counts citations
White and Grey Literature • Intuitively : • White: author and publisher are often different, the work has been independently reviewed, how to obtain the work is straightforward • Grey: may not be reviewed, often “published” from the source origin, may be difficult to obtain
Literature Examples • White • Journals, books, edited conference proceedings, etc. • Grey • technical reports, government reports, unedited proceedings, non-document STI, etc. • others?
So Why Worry About Grey Literature? • White is generally perceived as having a higher pedigree, easier to obtain (in a sense), etc. • it is generally less timely • and is often a summary or abstract of a larger body of work Pyramid of STI
History of STI Distribution • Originally, scientists published books to document their findings • but the delay was terribly long • Then, scientists exchanged personal letters among themselves for rapidity • but this is point-to-point communication, not broadcast
History of STI Distribution • The current system of journals evolved in the 17th century as the synthesis of both previous models • more timely than books, more available than letters • in fact, some journals with the emphasis on “speed” still have “Letters” in their title • historical information from (Odlyzko, 1995)
But Are Journals Still Relevant? • People still publish in them (tenure and promotions are still largely “count the journal publications” exercises) • But do people read them? • The current use of journals is now: • “a medium for priority claiming, quality control, and archiving scientific work” (Bennion, 1994)
But Are Journals Still Relevant? • How important is refereeing anyway? • Most rejected papers end up published somewhere else (Lesk) • Referees have rejected many worthy papers, including some that are the most cited in their respective journals (Campanario, 1996)
But Are Journals Still Relevant? • Different disciplines have adapted: • physics - “the small amount of filtering provided by refereed journals plays no effective role in our research” (Ginsparg, 1994) • math - “it is rare for experts in any mathematical subject to learn of a major new development in their area through a journal publication” (Odlyzko, 1995)
But Are Journals Still Relevant? • computer science - • “in his area, journals have become irrelevant” (Odlyzko, quoting Rob Pike) • “if it did not happen at a conference, it didn’t happen” (Odlyzko, quoting Joan Feigenbaum) • “if I read it in a journal, I’m not in the loop” (Grycz, 1992)
Solutions by Discipline • Physics • pre-prints • arxiv • Mathematics • pre-prints • Computer Science • technical reports, conference proceedings • citeseer • Chemistry • still mainly journals, but review is cursory (Quinn, 1995) • Economics • working papers
Journal System - Economic Problems • 20,000 primary research journals (Bennion, 1994) • the number of scientific papers published annually doubles every 10-15 years (Price, 1956) • STI does not enjoy economies of scale • intended audiences are generally static; the content becomes more specialized (Odlyzko, 1995)
Journal System - Economic Problems • Because of the academic pressures, journals tend to stay the same size, but the number of titles goes up (Quandt, 1996) • The acquisition budget of a library is constant (or decreasing), so it must be more selective in which titles it provides • If libraries cancel subscriptions, the cost to the remainder of the subscribers goes up
Journal System - Economic Problems • The rising cost causes other libraries to cancel subscriptions, causing the price to go up further... • Journals driving themselves out of business is a well studied problem - contact me for more information • Odlyzko estimates that: American universities spend as much buying mathematics journals as the NSF spends doing mathematical research
DL Economic Drivers Google Scholar? M. Lesk
original data from the ALA; slide from http://lib-www.lanl.gov/~herbertv/presentations/vala-2004-hvds.pdf
Journal System - Economic Problems • Chemical Abstracts (Lesk) • begun in 1950s, used to cost dozens of dollars per year, and invidual chemists subscribed • today, it costs $17,400 / year. • Okerson & Stubbs, 1992 • university book purchases down 15% 1986-1991 • journals/faculty 14 -> 12 in same period • by year 2017, libraries would buy nothing at all!
from Lesk, http://community.bellcore.com/lesk/columbia/session1/ figure 9.2 in text
Journal System - Coverage Problems • But journals only cover a fraction of available STI • approximately 100K domestic, unrestricted STI technical reports (grey literature) produced annually (Esler & Nelson, 1998) • Print journals, by definition, cannot provide access to non-report STI • software, datasets, etc.
Electronic Journals? • An experiment that most scholars agree is good, is the eventual path, and is a great idea for everyone else’s papers... • until tenure is given based on publications in electronic journals, they will not be fully accepted
Many DL Projects Are “Journal Centric” • Many DL projects (JSTOR, TULIP, etc.) are focused on automating the traditional journal methods • this is acceptable for archiving past issues, but seems unsatisfying for future STI
Prediction for Journals M. Nelson • Highly specialized titles will go completely electronic, driven by the rising cost and static readership • economics and academic acceptance will determine when this happens • “Popular” titles with broader appeal will exist in a hybrid format, both paper and electronic version • “subscribers” are likely to receive the value added material (soft copy, additional materials, etc.)