710 likes | 879 Views
Teaching Session #9. CIS 702 Communication/Information Technologies (CIT). Chapter 6 Documents: Language & Properties. Philip Robbins – March 7, 2013 Dr. Luz Quiroga, Ph.D. Communication & Information Sciences Ph.D. Program University of Hawai'i at Mānoa. 1. Chapter Contents Metadata
E N D
Teaching Session #9 CIS 702 Communication/Information Technologies (CIT) Chapter 6 Documents: Language & Properties Philip Robbins – March 7, 2013 Dr. Luz Quiroga, Ph.D. Communication & Information Sciences Ph.D. ProgramUniversity of Hawai'i at Mānoa 1
Chapter Contents • Metadata • Document Formats • Markup Languages • Text Properties • Document Preprocessing • Organizing Documents • Text Compression Documents: Language & Properties
Document • Denotes a single unit of information • Structure and a Syntax • Semantics, specified by the author • Presentation style Introduction
Document Syntax • Expresses structure, presentation style, semantics • Implicit in its content • Expressed in a simple declarative language • Expressed in a programming language Text • Can be written in natural language (Hard to process) Introduction
Document Style • How a document is visualized or printed • Can be embedded in the document i.e. RTF files • Can be complemented by macros Introduction
Queries • Short pieces of text • Differ from normal text • Semantics often ambiguous due to polysemy • User intent behind a query is not easy to infer Introduction
Metadata • Data about data • Information on the organization of the data, various data domains, and their relationship • Metadata is associated with most documents Metadata
Descriptive Metadata • External to the meaning of the document and pertain more to how it was created. • Author of the text • Date of publication • Source of the publication • Documentation length Metadata
Semantic Metadata • Characterizes the subject matter within the document contents • Associated with a wide number of documents • Availability is increasing Metadata
Metadata Format • Machine Readable Cataloging Record (MARC) • Format used for most library records • Includes fields for distinct attributes of a bibliographic entry such as: title, author, publication venue. Metadata
Metadata in Web Documents • Increase in web data has led to adding metadata information to web pages. • Cataloging and content rating • Intellectual property rights and digital signatures • Electronic Commerce Metadata
Resource Description Framework (RDF) • New standard for Web metadata • Allows describing Web resources to facilitate automated processing. • Does not assume any particular application or semantic domain. • Consists of a description of nodes and attached attribute/value pairs. Metadata
Text • Computers represent characters in binary, which is done through coding schemes: • EBCDIC (7 bits) • ASCII (8 bits) • UNICODE (16 bits) • IR systems should be able to retrieve information from many text formats (doc, pdf, html, txt) • IR systems have filters to handle most documents (might not be possible with proprietary formats) Text
Text Formats • For document exchange: Rich Text Format (RTF) • For printing and displaying: Portable Document Format (PDF) • For printing and displaying: Postscript (PS) Text
Interchange Formats • For encoding email: Multipurpose Internet Mail Exchange (MIME) • For compressing text: ZIP Text
Multimedia • For applications that handle different types of data: • Text • Sounds • Images • Video • Different types of formats are necessary for storing each media Multimedia
Image Formats • Simplest image formats are direct representations of a bit-mapped display: XBM, BMP, PCX • These formats have lots of redundancy and can be compressed efficiently: GIF Images
Lossy Compression • To improve compression ratios. • Uncompressing a compressed image does not yield exactly the original image. • Joint Photographic Experts Group (JPEG) • Eliminates parts of the image that have less impact in the human eye. • Parametric format – loss can be tuned. Images
Interchange Formats for Images • Tagged Image File Format (TIFF) • Provides for metadata, compression, and varying number of colors. • Standard de facto for images on the Web: • Portable Network Graphics (PNG) Images
Audio Formats • Audio is digitalized • MIDI is the standard format to interchange music between electronic instruments and computers. • AU, WAVE Audio
Movie Formats • Works by coding changes in consecutive frames • Takes advantage of temporal image redundancy • Includes audio signal associated with the video • Audio: MP3, Video: MP4 • AVI, FLI, Quicktime Movies
Format for 3-D Graphics • Computer Graphics Metafile (CGM) • Virtual Reality Modeling Language (VRML) • VRML is the universal interchange format for 3-D graphics and multimedia. Graphics
Markup Languages • Defined as extra syntax used to describe formatting actions, structure information, text semantics, attributes • XML: eXtensible Markup Language • HTML: Hyper Text Markup Language • SGML: Standard Generalized Markup Language Markup
Standard Generalized Markup Language (SGML) • ISO 8879 • Meta-language for tagging text • Provides rules for defining a markup language based on tages • Includes a description of the document structure: “document type definition” • SGML document defined by: document type definition with the text itself marked with tags describing the structure Markup
SGML Document Type Definition • Describes the pieces that a document is composed of • Defines how those pieces relate to each other • Part of the definition can be specified by an SGML Document Type Declaration (DTD) • Other parts (i.e. semantics of elements & attributes) cannot be express formally in SGML Markup
SGML • Tags are denoted by angle brackets < > • Used to identify the beginning and ending of an element • Ending tags include a slash before the tag name • Attributes are specified inside the beginning tag Markup
SGML • Document description does not specify how a document is printed • Output specifications are added to SGML documents: • DSSSL: Document Style Semantic Specification Language • FOSI: Formatted Output Specification Instance • These standards define mechanisms for associating style information with SGML document instances • Allows defining data identified by a tag should be typeset in some particular font Markup
HyperText Markup Language (HTML) • Instance of SGML • Created in 1992 • Latest Version is 4.0 (HTML5 under development) • Includes support for style sheets, frames, tables, forms, etc. • Backwards compatible • Most documents on the Web are stored and transmitted in HTML • HTML tags follow all SGML conventions and include formatting directives. Markup
HyperText Markup Language (HTML) • Can have media embedded within, such as images or audio • Has fields for metadata • Adding programs (i.e. Javascript) inside a webpage makes it dynamic (hence dynamic HTML). Markup
Cascade Style Sheets (CSS) • Because HTML does not fix a presentation style, CSS was introduced. • 1997 • Way for authors to improve the aesthetics of HTML pages • Information about presentation is separate from document content • Support for CSS in current browsers in still modest Markup
eXtensible Markup Language (XML) • Is a simplified subset of SGML • Not a markup language (like HTML) but a meta-language (like SGML) • Allows human-readable sematic markup, which is also machine-readable • Does not have the restriction of HTML • Allows any user to define new tags • More rigid syntax on the syntax: • Ending tags cant be omitted • Distinguishes upper and lower case • Attribute values must be in quotes Markup
eXtensible Style Sheet Language (XSL) • The XML counterpart of Cascading Style Sheets (CSS) • Syntax based on XML • Designed to transform and style highly-structured, data-rich documents written in XML • i.e. With XML it would be possible to automatically extract a table of contents from a document Markup
Hypermedia/Time-based Structuring Language • SGML architecture that specifies the generic hypermedia structure of documents • Includes complex locating of document objects • Includes relationships (hyperlinks) between document objects • Includes numeric, measured associations between document objects • Does not specify graphical interfaces, user navigation or user interaction. Markup
Information Theory • It is difficult to formally capture how much information there is in a given text • However, distribution of symbols is related to it • A text where one symbol appears almost all the time does not convey much information • Information Theory defines a special concept, entropy, to capture information content Theory
Entropy Theory
Entropy Theory
Modeling Natural Language • We can divide the symbols of a text in two disjoint subsets: • Symbols that separate words; • Symbols that belong to words; • Symbols are not uniformly distributed in a text • i.e. In English the vowels are usually more frequent than most consonants. Theory
Modeling Natural Language • A simple model to generate text is the Binomial model • The probability of a symbol depends on previous symbol. • i.e. f cannot appear after a letter c • A finite-context or Markovian model can be used to reflect this dependency. • Second issue: is how the different words are distributed inside each document. Theory
Zipf’s Law Theory
Modeling Natural Language • Words arranged in decreasing order of their frequencies Theory
Modeling Natural Language • Words arranged in decreasing order of their frequencies • Distribution of words is very skewed • Words that are too frequent (“stopwords”) can be disregarded. • Stopword is a word which does not carry meaning in natural language • i.e. Stopwords in English: a, the, by, and • Therefore, half of the words appearing in a text do not need to be considered Theory
Modeling Natural Language • Third Issue: Distribution of words in the documents of a collection. • Simple Model: Consider that each word appears the same number of times in every document (Not True) • Better Model: Use a binomial distribution Theory
Heaps’ Law • Fourth Issue: Number of distinct words in a document (document vocabulary) • To predict the growth of vocabulary size in natural language text: Theory
Modeling Natural Language • Vocabulary size grows sub-linearly with text size Theory