1 / 15

Document Databases for Information Management

Document Databases for Information Management. Gregor Erbach FTW, Wien DFKI, Saarbrucken ETL, Tsukuba erbach@ftw.at. IM = DM?. Is Information Management the same as Document Management?

joie
Download Presentation

Document Databases for Information Management

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Document Databases for Information Management Gregor Erbach FTW, Wien DFKI, Saarbrucken ETL, Tsukuba erbach@ftw.at

  2. IM = DM? • Is Information Management the same as Document Management? • No, because the relevant information may be distributed across several documents, or may only be a small part of a document • Then what is information management? • Extraction, storage, indexing and retrieval of information units contained in documents.

  3. IM Applications • Document Retrieval • Routing • Question Answering • Factual Database Construction • Summarisation

  4. Document Annotation • Document Annotation adds information to documents • Annotation Formats: SGML, XML, LaTeX, ... • Annotation Standards: HTML, NITF, TEI, CES, GDA, Map Task, TreeBank, DublinCore

  5. Formal Properties of XML • Tree structures • nodes with attribute/value pairs • node content is a string which can contain XML trees • nodes can have identifiers • no type hierarchy

  6. Language Technologies • Think of language technologies as processes that add annotations to documents, based on an analysis of the documents' linguistic content. • This point of view allows a uniform treatment of human-generated and LT-generated annotations.

  7. Document-Level LT • Language Identification • Categorisation • Summarisation All of these can be applied to parts of documents also.

  8. Collection-Level LT • Clustering • Topic detection and tracking • Multi-document summarisation

  9. Fine-Grained LT • Morphology • Part-of-speech Tagging • (shallow) parsing • coreference resolution • information extraction

  10. (Annotated) Text Document LT and Document Annotation Annotated Text Document LT

  11. Information Retrieval • Retrieval of information units in response to an information need • How is the information need stated (keywords, questions, examples)? • How is the information need represented? • How are information units represented? • How are the representations matched?

  12. How are documents represented? • XML trees • index of word/phrase occurrences • index of relations (represented as feature structures) word, phrase, relation index should have pointers to text locations

  13. How are queries represented? • Words / phrases • relations (expressed as feature structures)

  14. How are representations matched? • Unification • Apparent mismatches between query and representation can be resolved by relaxation of the query. • Required inference by forward or backward chaining, as required.

  15. Research Issues • Relevance ranking for feature-structure based queries • Efficient indexing and matching of feature structures is required ( fast unification) • Information content (ontologies) to be represented in the formalism

More Related