230 likes | 338 Views
Organisation of documents on Web. Introduction. Prevailing form of a carrier of information on web is document and not its bibliographic substitute. Internet and especially web (WWW) enable us to access and use documents or document parts of various data types.
E N D
Introduction • Prevailing form of a carrier of information on web is document and not its bibliographic substitute. • Internet and especially web (WWW) enable us to access and use documents or document parts of various data types. • The greatest problem of information tools on web is its size (at least tens of billions of documents) and dynamic nature – documents are being born, deleted and changed all the time. Alpe Adria Master Course :: Medical Informatics :: Dr. J. Dimec: Organisation of documents on Web.
Introduction • Term document became very vague. It can be: • text or text with multimedia inclusions (article, monograph, homepage…), • independent multimedia file (video clip, sound…), • list of hyper-text pointers (web directory, results of search with web search engine…)… • Text with multimedia inclusion is constituted of at least two files. Each multimedia inclusion is referenced from a primary (usually textual) file. • Files can be transported from very different locations; they are united into a document on a user’s screen. Alpe Adria Master Course :: Medical Informatics :: Dr. J. Dimec: Organisation of documents on Web.
Introduction • Organising access to documents on web normally doesn’t mean collecting them on one spot – in a database. • Usually it means collecting • pointers to documents, and • descriptions of documents (metadata). • Pointers to documents are collected in an information tool, documents stay on original servers, where they were put by authors. Alpe Adria Master Course :: Medical Informatics :: Dr. J. Dimec: Organisation of documents on Web.
Introduction • Two most important information tools for organisation of documents on web are • web directories, and • web search engines. • Web directories are the oldest web information tools, born almost at the same time as web. • At the beginning there was a simple list “What’s new” and authors reported existence of their documents to editing staff. • Very soon this chronologic principle of organisation became impossible to follow. Alpe Adria Master Course :: Medical Informatics :: Dr. J. Dimec: Organisation of documents on Web.
Web directories • With today’s the big directories (e.g. Yahoo) all important phases of construction are done automatically: • collecting of data about documents, and • classification of documents. • Big directories collect data on documents from all domains – entertainment is prevailing subject. • Smaller directories are either • not limited by domain, but have stricter collection policy, or are • being compiled by and for domain specialists. Alpe Adria Master Course :: Medical Informatics :: Dr. J. Dimec: Organisation of documents on Web.
Web directories • Examples will be given during practical work. Alpe Adria Master Course :: Medical Informatics :: Dr. J. Dimec: Organisation of documents on Web.
Web databases and search engines • The big search engines are collecting metadata and pointers to over billion of documents. • The biggest – Google is claiming that it has 3,083,324,652 web pages (summer 2003). The real number is approx. ⅓ smaller. • The biggest and best are: • Google (http://www.google.com), • AltaVista (http://www.altavista.com), • Teoma (http://www.teoma.com), • AllTheWeb (http://www.alltheweb.com). Alpe Adria Master Course :: Medical Informatics :: Dr. J. Dimec: Organisation of documents on Web.
Web databases and search engines Well done: • collecting of data with autonomous programming agents (robots, spiders, crawlers, harvesters…), • automatic indexing of documents, • computing of relevance. Alpe Adria Master Course :: Medical Informatics :: Dr. J. Dimec: Organisation of documents on Web.
Database construction with robots database robot reading ofdocument doc. B collecting the data on subject document A doc. C pointer to B pointer to C pointer to B doc. D pointer to X Alpe Adria Master Course :: Medical Informatics :: Dr. J. Dimec: Organisation of documents on Web.
Database construction with robots Robot • checks the document x, • saves all pointers to other documents on temporary list, • index document x if it is not indexed yet, or was changed since last visit, • downloads next document from temporary list and do steps 1 – 3. • Many robots work for the same database. • Because of the exponential growth of web it could never be entirely indexed. Alpe Adria Master Course :: Medical Informatics :: Dr. J. Dimec: Organisation of documents on Web.
Database construction with robots • Beside frequencies of stems, search engines use some additional information to compute relevance score of documents. Higher weights get • stems from title, • stems from hyper-text anchors, • stems from top of page, • stems with bold or slanted letters… • Especially effective additional factor in relevance computing is PageRank (Google). Alpe Adria Master Course :: Medical Informatics :: Dr. J. Dimec: Organisation of documents on Web.
Web databases and search engines PageRank: • If the author in his/her document puts pointer to another document that usually means that he/she thinks it is of some value. • Documents with many pointers (citations) to them get higher PageRank. • PageRank of a document is even higher if the citing documents have high PageRank themselves. Alpe Adria Master Course :: Medical Informatics :: Dr. J. Dimec: Organisation of documents on Web.
Web databases and search engines Not so well done – search interfaces. • Search interface stimulates user to use short queries (1 – 3 words). • Search interface stimulates user to use Boolean operators. • Both is inappropriate for non-Boolean search model, but needs less processing power. • Remember: non-Boolean search model behaves best with long queries composed of many words and their synonyms. Alpe Adria Master Course :: Medical Informatics :: Dr. J. Dimec: Organisation of documents on Web.
Web databases and search engines • Examples will be given during practical work. Alpe Adria Master Course :: Medical Informatics :: Dr. J. Dimec: Organisation of documents on Web.
Usefulness of directories and search engines Web directories: + • Pointers are ordered by some criteria, e.g. subject categories, which makes searching easier. • Mostly they contain non-trivial documents, with less multiplicates. - • Relatively small amount of documents. • Directory creator’s and user’s understanding of categories can differ. Difficult browsing as a result. Alpe Adria Master Course :: Medical Informatics :: Dr. J. Dimec: Organisation of documents on Web.
Usefulness of directories and search engines General web directories: • Useful for finding stable collective sources of information: e-journals, homepages of research groups or institutions. These sources should be followed later with other means. Make bookmarks! • Less useful for finding documents as units. Specialised web directories: • Useful for initial overview of a field. • Useful for finding reference literature, standards, protocols… • Sometimes useful for finding documents as units. Alpe Adria Master Course :: Medical Informatics :: Dr. J. Dimec: Organisation of documents on Web.
Usefulness of directories and search engines General search engines: • Simple, well-defined queries: name of person, name of medical appliance or method, name of e-journal… • Finding particular article: compose the query with the most informative part of the title in parentheses. • Good property of search engines: description of a document enter database quickly (matter of days, weeks at most). • High dose of precaution obligatory, regarding the quality of documents! Alpe Adria Master Course :: Medical Informatics :: Dr. J. Dimec: Organisation of documents on Web.
Digital libraries • Collection of e-documents and institution that collects them. • Documents are used across network without limits of time or place. • Collection is usually limited by domain or geography, e.g. production of particular academic institution or same types of documents across institutions in a region (e.g. research reports). • Not normally limited regarding data types. • Internet or web is not a digital library. Alpe Adria Master Course :: Medical Informatics :: Dr. J. Dimec: Organisation of documents on Web.
Digital libraries • Often contains documents with less strict protection of author’s rights: • research reports funded by public money, • preprints, • master theses and doctoral dissertations… • or documents with limited access: • artefacts of cultural heritage, • objects in museums and galleries… • Behind d-library is usually an institution with good reputation, so we can trust documents. Alpe Adria Master Course :: Medical Informatics :: Dr. J. Dimec: Organisation of documents on Web.
Historically important digital libraries NCSTRL (http://www.ncstrl.org/) : • Networked Computer Science Technical Report Library, started on 1995. • D-library of technical and research reports. • Started on 40 US universities with strong computer departments. • Today international membership. • Developers of NCSTRL still on the forefront of the research on public access to knowledge. Alpe Adria Master Course :: Medical Informatics :: Dr. J. Dimec: Organisation of documents on Web.
Historically important digital libraries NCSTRL, moral of the story: • Let the documents stay on the authors’ institutional servers. They are interested in the preservation of their documents more than anybody. • Build the common user interface which hides the differing ways on which documents are organised locally. • Each cooperating institution should do what it is capable to in the common orchestrated effort. • Documents in the d-library should exist in various standard forms: HTML, PDF, text, screen image… Alpe Adria Master Course :: Medical Informatics :: Dr. J. Dimec: Organisation of documents on Web.
Historically important digital libraries NDLTD (http://www.ndltd.org) • Networked Digital Library of Theses and Dissertations, started on 1996. • Master theses and doctoral dissertations (ETDs). • At the beginning some US universities, today very international membership. • Moral of the story: • Develop web interfaces for uploading the files into database, and interfaces to enter metadata. Let authors do as much work as possible. They are motivated for success more than anybody. Alpe Adria Master Course :: Medical Informatics :: Dr. J. Dimec: Organisation of documents on Web.