620 likes | 780 Views
Thomas Krichel 2006-12 -13. LIS510 lecture 12. today. Leftovers from last time. I discuss some elements of Bill Arms’ book on Digital Libraries. It’s introductory book that general, but smartly written. It is not a book to each someone to become a digital librarian.
E N D
Thomas Krichel 2006-12-13 LIS510 lecture 12
today • Leftovers from last time. • I discuss some elements of Bill Arms’ book on Digital Libraries. • It’s introductory book that general, but smartly written. • It is not a book to each someone to become a digital librarian. • LIS650 and LIS651 are for that. They really deal with the introduction to digital information. • I also talk generally about understanding some digital contents.
definition • An informal definition of a digital library is a “managed collection of information, with associated services, where the information is stored in digital formats and accessible over a network.” • “managed” in the key word here.
benefits of digital libraries • The digital library brings the library to the user. • Computer power is used for searching and browsing. • Information can be shared. • Information is easier to keep current. • The information is always available. • New forms of information become possible.
costs • Non-digital libraries are very expensive. • Digital libraries are also expensive. Many publishers charge more for online editions that for traditional print. • However the cost of the infrastructure is dropping. • And there are potentials for changes in the way information is supplied in digital libraries.
technical change • Electronic storage is becoming cheaper than paper. • Personal computer displays are becoming more pleasant to use. • High-speed networks are becoming widespread. • Computers have become portable.
libraries adapt • Libraries get wired • They offer electronic access, even to the home user. • Other actions depend on the library type • Some shift from information access to community center. • Some adopt digital reference with 24/7 asynchronous help. • Some get involved in digital archiving of institutional assets.
digital library cost • The digital library material will cost more initially because publishers want to see a return in the extra functionality they have developed. • In the longer run, digital library costs may be lower than in print • lower storage cost • less risk to the items • fewer staff (but differently trained) requirements
classic roles for the library with digital material • Investigation what to buy • Negotiation of the purchase • Acquisition of access to a service • Installation of access devices • Training of users • Maintenance: update, migrate, replace
beyond the library • The classic roles will at best a stagnating, if not declining source for information professionals. • The rise of open access will mean that no longer as many assets as before will have to be purchased. Today’s example http://dme.mozarteum.at • Training needs of users decline as digital media are getting easier to use.
new roles for information professionals • The information age does not happen without information professionals. • There a huge demand for tech-savvy information professionals out there. Examples include • web site maintenance • digital archiving
impact of technology on staff • Information professionals that are technologically savvy will thrive better than those who are not. • Fortunately the Palmer School offers LIS508, LIS650, LIS651. • It still does not have a system administration class, but that may come as well.
impact of technology on staff • Constant computer use can cause serious health problems • Problem areas are • bad posture problems at the desk • eye strain • The use of mouse is particularly bad. Learn how to avoid using it. • Injuries take a long time to heal.
digital libraries are hard • In digital libraries terminology is a bad problem. Basic concepts are hard to find. • These definition problems also hurt efforts to build sophisticated information systems by semi-automated means. • We live in the age of the brute-force calculation, not the age of artificial intelligence.
data and metadata • Metadata is data about data. The distinction between data and metadata depends often on the context. • Metadata is often divided into • descriptive metadata • structural metadata • administrative metadata
what’s in the digital library? • Items ? • Material ? • Documents ? • Objects? • Digital Items ? • Digital Material ? • Digital Documents ? • Digital Objects ?
storage and dissemination • Items are stored in digital format in a way we can call the stored form of the item. • When the item is shown to the user, it is shown as a “presentation” or “dissemination”. This is the way the object leaves the server. • When it arrives at the users’ machines, they have to “render” the presentation.
users and clients • A user is someone who uses a digital library. Many times, the user is anonymous and can not be identified. • A client is a software that the user runs to use the digital library. Sometimes this is called a user agent. Many times common people refer to it as a browser.
work and contents • These are difficult things to discuss. Look at the example at the song “Der Lindenbaum”. Could mean • song as sound and words • score • performance • recording • mp3 file containing the recording
repositories • This is general term used to talk about a computer system that has primarily the function of storing contents. • When long-run storage is involved a repository becomes an archive. • A server is a computer that is switched on constantly to provide services to the public.
an example of terminology • “A data model is an abstraction (or an extra level of indirection) for digital objects such that each digital object can be seen as an instance of the class defined by the data model.” • “A surrogate is a transmittable serialization or representation of a digital object that can be passed back and forth so we can do things with it. Possible serialization techniques include XML and RDF/XML.”
a digital library from scratch • Much of the data that is stored in digital libraries is text. • Most other material, that is not textual in nature, such as • sound files • graphics need textual metadata in order to be found. Current technology is not able to find it otherwise.
Information • Information is best understood as “what it takes to answer a question”. • The simplest question has a “yes” or “no” answer. Therefore a bit is the natural measure of information. • Term first used by John Turkey in 1946. • Concatenation of “binary digit”.
Usage of bits • Computers are sometimes classified by the number of bits they can process at one time. "32 bit processor" • Graphics are also often described by the number of bits used to represent each dot.
bits and bytes • a bit can take the values 0 or 1, thus it can describe 2 possibilities • two bits can take the value 00, 01, 10, 11, thus it can describe four 2×2 possibilities • n bits can encode 2 power n possibilities. • The first chips used to process 8 bits at a time. It become customary to refer to them as a byte. It can encode 2 power 8 possibilities. • We can use binary numbers just as decimal numbers.
application of bytes • IP (Internet Protocol) numbers are used as the addresses of computers on the Internet. • In IP version 4 (the one that is most commonly used), each IP number has 4 bytes. • It is represented as x.x.x.x where x is a number between 0 and 255 (why?) • How many computers can there be on the Internet at any one time?
Many bytes • Larger units are • Kilo byte is 2 power 10 bytes (=1024 bytes) • Mega bytes is 2 power 20 bytes • Giga bytes is 2 power 30 bytes • Tera byte is 2 power 40 bytes • From ancient Greek words for "thousand", "large", "giant", and "monster", respectively. Terms date back to the French revolution.
Hex numbers • A byte is often represented by two hex numbers. • Each hex number can encode 16 values • Written 0 to 9, then A B C D E F. F is 15. • Conventionally prefixed with 0x • Use Microsoft calculator with scientific notation to convert.
applications of hex numbers • Media Access Control (mac) addresses of hardware that allows access to computer networks. They are 6-byte numbers, each byte written as 2 hex numbers, e.g. 00:60:08:F5:20:A9 • character numbers that you see when you are inserting a special symbol in Microsoft software, e.g. powerpoint. • Color codes on web pages use 6 hex digits. • 000000 is black • FFFFFF is white
Information in a computer file • A file is a piece of data on a stored on a computer. • Any file contains a sequence of 0s and 1s, like 1010100101010011110101010101… • For a computer to make sense of a file, it has to know what type of file it is.
executable files • Files that are executable are files that make the computer do something. For example the file starts a program, say powerpoint. An executable on one computer may not run on another one. • Non-executable files hold data that is used by an executable file. We will call them data files. Example: powerpoint slides file.
Characters • Much of the information processed by computers is in the form of characters. • From wikipedia • A character is a unit of information that roughly corresponds to a grapheme, or written symbol, of a natural language, such as a letter, numeral, or punctuation mark. • A character is not a grapheme because there are ligatures.
control characters • The concept also includes control characters, which do not correspond to natural language symbols but to other bits of information used to process texts of the language, such as instructions to printers or other devices that display such texts. • An example for such a control character is the newline character.
text files • Many data files contain textual data. • Textual data is a sequence of characters. • A character is an elementary symbol that has some meaning • alphabet letter • hieroglyph • Example: email file • Text files can be read by many computer programs.
non-text files • Examples for non-text files are • graphics files • movie files • sound files • Non-text files are of minor significance in library settings • There is no way to organize information retrieval for non-text files. They have to be retrieved using a textual surrogate. • Traditional library material are textual • will talk about this later.
Representing characters • Computers don't understand text, they only understand numbers. For computers to be able to treat text, there must be a correspondence between numbers and text characters. Such a correspondence is called a character set. • Examples for characters are • a • c • ë • €
Legacy character sets • In early days, computers were a lot less powerful than they are today. • Could only deal with the characters that are most commonly used. • Such sets are • ascii • ISO-8859-1 • cp1252
ASCII • American Standard Code for Information Interchange • 7-bit character set. There is no such thing as 8-bit ASCII • 95 printable symbols • 33 control characters (0-31, 127) • http://www.ccmr.cornell.edu/helpful_data/ascii2.html has a list up to 127
some ASCII control characters • CR (13, ^M) is the carriage return • LF (10, ^J) is the linefeed • FF (12, ^L) is the form feed (new page) • BS (8, ^H) is the backspace • DEL (127, ALT-127) is delete • ESC (27, ^[) escape
ISO-8859-1 • ISO-8859-1, aka ISO-latin-1 extends ASCII with characters that are commonly used by the western European languages. • It is the default character set of html. • Positions 128 to 159 are not used. • Cp1252 fills these with graphic chars. It is as Microsoft character set.
This is not enough • There are around 6800 different languages around. • Some of these languages use characters sets that are not finite, i.e. folks can make up now characters out of existing ones! • Setting up a character set for all languages is almost impossible.
ISO 10646-1 • Defines the Universal Character Set (UCS) • UCS contains the characters required to represent characters used by many known languages, even the likes of Oriya, Telugu, Bopomofo, Runic. • ISO 10646 defines formally a 31-bit character set. They are represented as 32 bits, i.e. 4 bytes, or 8 hex chars. • Not finished. .
Unicode • ISO is a inter-government agency. Slow and bureaucratic. • Industry has come together to work on Unicode, a 2-byte character set. • With some minor exceptions, the Unicode characters are the some as the first 65536 characters in UCS. • Much better documented standard.
Unicode and legacy sets • The first 128 characters are identical to those in ASCII • The next 128 characters are identical to ISO 8859-1 (Latin-1). • Unicode is well documented and the Unicode book can be downloaded from the Internet. A must-have for the serious digital librarian.
Beyond characters • There is more to text than a string of characters. • There is layout • titles • abstracts • mathematical formula spacing
Layout • Layout can be conveyed by additional text that has special meaning. Examples • LaTeX • HTML • PostScript • Another way is to do non-textual layout by adding some other digital signals. Examples • DVI • MS Word • MS Powerpoint These can not be shown in these slides!
Example: LaTeX \bigskip\textbf{Class structure} Classes will be held in the computer lab in the Palmer School between 18:15 and 20:45. An optional practice session will last until 21:15. \begin{tabular}{@{}llll@{}} 0&2006--09--12&introduction to the course &\\ 1&2006--09--19&libraries and food &\\ 2&2006--09--26&introduction to shushing &\\
Example: HTML <p><strong>Class structure</strong><p>Classes will be held in the computer lab in the Palmer School between 18:15 and 20:45. An optional practice session will last until 21:15.<p>Class details: <p><center><table width=100% border=1> <tr><td align=left> 0 </td><td align=left> 2006–09–12 </td><td align=left><a href="lis510w06a-00.ppt">introduction to the course</a> </td></tr><tr><td align=left> 1 </td><td align=left> 2006–09–19 </td><td align=left><a href="lis510w06a-01.ppt">libraries and food</a> </td>
Example: PostScript Fc(Class)g(structur)o(e)-104 3956 y Fd(Classes)26b(will)g(be)e(held)g(in)h(the)f(computer)f(lab)i(in)f(the)h(P)o(almer)f(School)g(between)f(18:15)h(and)g(20:45.)36 b(An)25 b(optional)e(practice)h(session)-104 4055 y(will)d(last)g(until)f(21:15.)-104 4155 y(Class)i(details:)-104 4307 y(0)141 b(2003\22609\22623)94b(introduction)18 b(to)i(the)h(course)-104 4407 y(1)141 b(2002\22609\22630)94 b(bits)21 b(bytes)f(and)g(characters)-104 4507 y(2)141 b(2003\22610\22607)94 b(databases)20 b(and)g(markup)e(languages)-
DVI (rendition, "class structure") 1659: fntnum27 current font is ptmb8t 1660: setchar67 h:=-820459+473168=-347291, hh:=-22 1661: setchar108 h:=-347291+182183=-165108, hh:=-10 1662: setchar97 h:=-165108+327680=162572, hh:=11 1663: setchar115 h:=162572+254928=417500, hh:=27 1664: setchar115 h:=417500+254928=672428, hh:=43 1665: right3 163840 h:=672428+163840=836268, hh:=53 1669: setchar115 h:=836268+254928=1091196, hh:=69 1670: setchar116 h:=1091196+218232=1309428, hh:=83 1671: setchar114 h:=1309428+290976=1600404, hh:=101 1672: setchar117 h:=1600404+364376=1964780, hh:=124 1673: setchar99 h:=1964780+290976=2255756, hh:=142 1674: setchar116 h:=2255756+218232=2473988, hh:=156 1675: setchar117 h:=2473988+364376=2838364, hh:=179 1676: setchar114 h:=2838364+290976=3129340, hh:=197