150 likes | 247 Views
Document Databases for Information Management. Gregor Erbach FTW, Wien DFKI, Saarbrucken ETL, Tsukuba erbach@ftw.at. IM = DM?. Is Information Management the same as Document Management?
E N D
Document Databases for Information Management Gregor Erbach FTW, Wien DFKI, Saarbrucken ETL, Tsukuba erbach@ftw.at
IM = DM? • Is Information Management the same as Document Management? • No, because the relevant information may be distributed across several documents, or may only be a small part of a document • Then what is information management? • Extraction, storage, indexing and retrieval of information units contained in documents.
IM Applications • Document Retrieval • Routing • Question Answering • Factual Database Construction • Summarisation
Document Annotation • Document Annotation adds information to documents • Annotation Formats: SGML, XML, LaTeX, ... • Annotation Standards: HTML, NITF, TEI, CES, GDA, Map Task, TreeBank, DublinCore
Formal Properties of XML • Tree structures • nodes with attribute/value pairs • node content is a string which can contain XML trees • nodes can have identifiers • no type hierarchy
Language Technologies • Think of language technologies as processes that add annotations to documents, based on an analysis of the documents' linguistic content. • This point of view allows a uniform treatment of human-generated and LT-generated annotations.
Document-Level LT • Language Identification • Categorisation • Summarisation All of these can be applied to parts of documents also.
Collection-Level LT • Clustering • Topic detection and tracking • Multi-document summarisation
Fine-Grained LT • Morphology • Part-of-speech Tagging • (shallow) parsing • coreference resolution • information extraction
(Annotated) Text Document LT and Document Annotation Annotated Text Document LT
Information Retrieval • Retrieval of information units in response to an information need • How is the information need stated (keywords, questions, examples)? • How is the information need represented? • How are information units represented? • How are the representations matched?
How are documents represented? • XML trees • index of word/phrase occurrences • index of relations (represented as feature structures) word, phrase, relation index should have pointers to text locations
How are queries represented? • Words / phrases • relations (expressed as feature structures)
How are representations matched? • Unification • Apparent mismatches between query and representation can be resolved by relaxation of the query. • Required inference by forward or backward chaining, as required.
Research Issues • Relevance ranking for feature-structure based queries • Efficient indexing and matching of feature structures is required ( fast unification) • Information content (ontologies) to be represented in the formalism