1 / 71

CIS 702 Communication/Information Technologies (CIT)

Teaching Session #9. CIS 702 Communication/Information Technologies (CIT). Chapter 6 Documents: Language & Properties. Philip Robbins – March 7, 2013 Dr. Luz Quiroga, Ph.D. Communication & Information Sciences Ph.D. Program University of Hawai'i at Mānoa. 1. Chapter Contents Metadata

zelda
Download Presentation

CIS 702 Communication/Information Technologies (CIT)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Teaching Session #9 CIS 702 Communication/Information Technologies (CIT) Chapter 6 Documents: Language & Properties Philip Robbins – March 7, 2013 Dr. Luz Quiroga, Ph.D. Communication & Information Sciences Ph.D. ProgramUniversity of Hawai'i at Mānoa 1

  2. Chapter Contents • Metadata • Document Formats • Markup Languages • Text Properties • Document Preprocessing • Organizing Documents • Text Compression Documents: Language & Properties

  3. Document • Denotes a single unit of information • Structure and a Syntax • Semantics, specified by the author • Presentation style Introduction

  4. Introduction

  5. Document Syntax • Expresses structure, presentation style, semantics • Implicit in its content • Expressed in a simple declarative language • Expressed in a programming language Text • Can be written in natural language (Hard to process) Introduction

  6. Document Style • How a document is visualized or printed • Can be embedded in the document i.e. RTF files • Can be complemented by macros Introduction

  7. Queries • Short pieces of text • Differ from normal text • Semantics often ambiguous due to polysemy • User intent behind a query is not easy to infer Introduction

  8. Metadata • Data about data • Information on the organization of the data, various data domains, and their relationship • Metadata is associated with most documents Metadata

  9. Descriptive Metadata • External to the meaning of the document and pertain more to how it was created. • Author of the text • Date of publication • Source of the publication • Documentation length Metadata

  10. Semantic Metadata • Characterizes the subject matter within the document contents • Associated with a wide number of documents • Availability is increasing Metadata

  11. Metadata Format • Machine Readable Cataloging Record (MARC) • Format used for most library records • Includes fields for distinct attributes of a bibliographic entry such as: title, author, publication venue. Metadata

  12. Metadata in Web Documents • Increase in web data has led to adding metadata information to web pages. • Cataloging and content rating • Intellectual property rights and digital signatures • Electronic Commerce Metadata

  13. Resource Description Framework (RDF) • New standard for Web metadata • Allows describing Web resources to facilitate automated processing. • Does not assume any particular application or semantic domain. • Consists of a description of nodes and attached attribute/value pairs. Metadata

  14. Text • Computers represent characters in binary, which is done through coding schemes: • EBCDIC (7 bits) • ASCII (8 bits) • UNICODE (16 bits) • IR systems should be able to retrieve information from many text formats (doc, pdf, html, txt) • IR systems have filters to handle most documents (might not be possible with proprietary formats) Text

  15. Text Formats • For document exchange: Rich Text Format (RTF) • For printing and displaying: Portable Document Format (PDF) • For printing and displaying: Postscript (PS) Text

  16. Interchange Formats • For encoding email: Multipurpose Internet Mail Exchange (MIME) • For compressing text: ZIP Text

  17. Multimedia • For applications that handle different types of data: • Text • Sounds • Images • Video • Different types of formats are necessary for storing each media Multimedia

  18. Image Formats • Simplest image formats are direct representations of a bit-mapped display: XBM, BMP, PCX • These formats have lots of redundancy and can be compressed efficiently: GIF Images

  19. Lossy Compression • To improve compression ratios. • Uncompressing a compressed image does not yield exactly the original image. • Joint Photographic Experts Group (JPEG) • Eliminates parts of the image that have less impact in the human eye. • Parametric format – loss can be tuned. Images

  20. Interchange Formats for Images • Tagged Image File Format (TIFF) • Provides for metadata, compression, and varying number of colors. • Standard de facto for images on the Web: • Portable Network Graphics (PNG) Images

  21. Audio Formats • Audio is digitalized • MIDI is the standard format to interchange music between electronic instruments and computers. • AU, WAVE Audio

  22. Movie Formats • Works by coding changes in consecutive frames • Takes advantage of temporal image redundancy • Includes audio signal associated with the video • Audio: MP3, Video: MP4 • AVI, FLI, Quicktime Movies

  23. Format for 3-D Graphics • Computer Graphics Metafile (CGM) • Virtual Reality Modeling Language (VRML) • VRML is the universal interchange format for 3-D graphics and multimedia. Graphics

  24. Markup Languages • Defined as extra syntax used to describe formatting actions, structure information, text semantics, attributes • XML: eXtensible Markup Language • HTML: Hyper Text Markup Language • SGML: Standard Generalized Markup Language Markup

  25. Standard Generalized Markup Language (SGML) • ISO 8879 • Meta-language for tagging text • Provides rules for defining a markup language based on tages • Includes a description of the document structure: “document type definition” • SGML document defined by: document type definition with the text itself marked with tags describing the structure Markup

  26. SGML Document Type Definition • Describes the pieces that a document is composed of • Defines how those pieces relate to each other • Part of the definition can be specified by an SGML Document Type Declaration (DTD) • Other parts (i.e. semantics of elements & attributes) cannot be express formally in SGML Markup

  27. SGML Document Type Definition Markup

  28. SGML Document Type Definition Markup

  29. SGML • Tags are denoted by angle brackets < > • Used to identify the beginning and ending of an element • Ending tags include a slash before the tag name • Attributes are specified inside the beginning tag Markup

  30. SGML • Document description does not specify how a document is printed • Output specifications are added to SGML documents: • DSSSL: Document Style Semantic Specification Language • FOSI: Formatted Output Specification Instance • These standards define mechanisms for associating style information with SGML document instances • Allows defining data identified by a tag should be typeset in some particular font Markup

  31. HyperText Markup Language (HTML) • Instance of SGML • Created in 1992 • Latest Version is 4.0 (HTML5 under development) • Includes support for style sheets, frames, tables, forms, etc. • Backwards compatible • Most documents on the Web are stored and transmitted in HTML • HTML tags follow all SGML conventions and include formatting directives. Markup

  32. HyperText Markup Language (HTML) • Can have media embedded within, such as images or audio • Has fields for metadata • Adding programs (i.e. Javascript) inside a webpage makes it dynamic (hence dynamic HTML). Markup

  33. HyperText Markup Language (HTML) Markup

  34. HyperText Markup Language (HTML) Markup

  35. Cascade Style Sheets (CSS) • Because HTML does not fix a presentation style, CSS was introduced. • 1997 • Way for authors to improve the aesthetics of HTML pages • Information about presentation is separate from document content • Support for CSS in current browsers in still modest Markup

  36. eXtensible Markup Language (XML) • Is a simplified subset of SGML • Not a markup language (like HTML) but a meta-language (like SGML) • Allows human-readable sematic markup, which is also machine-readable • Does not have the restriction of HTML • Allows any user to define new tags • More rigid syntax on the syntax: • Ending tags cant be omitted • Distinguishes upper and lower case • Attribute values must be in quotes Markup

  37. eXtensible Style Sheet Language (XSL) • The XML counterpart of Cascading Style Sheets (CSS) • Syntax based on XML • Designed to transform and style highly-structured, data-rich documents written in XML • i.e. With XML it would be possible to automatically extract a table of contents from a document Markup

  38. Hypermedia/Time-based Structuring Language • SGML architecture that specifies the generic hypermedia structure of documents • Includes complex locating of document objects • Includes relationships (hyperlinks) between document objects • Includes numeric, measured associations between document objects • Does not specify graphical interfaces, user navigation or user interaction. Markup

  39. Information Theory • It is difficult to formally capture how much information there is in a given text • However, distribution of symbols is related to it • A text where one symbol appears almost all the time does not convey much information • Information Theory defines a special concept, entropy, to capture information content Theory

  40. Entropy Theory

  41. Entropy Theory

  42. Modeling Natural Language • We can divide the symbols of a text in two disjoint subsets: • Symbols that separate words; • Symbols that belong to words; • Symbols are not uniformly distributed in a text • i.e. In English the vowels are usually more frequent than most consonants. Theory

  43. Modeling Natural Language • A simple model to generate text is the Binomial model • The probability of a symbol depends on previous symbol. • i.e. f cannot appear after a letter c • A finite-context or Markovian model can be used to reflect this dependency. • Second issue: is how the different words are distributed inside each document. Theory

  44. Zipf’s Law Theory

  45. Theory

  46. Modeling Natural Language • Words arranged in decreasing order of their frequencies Theory

  47. Modeling Natural Language • Words arranged in decreasing order of their frequencies • Distribution of words is very skewed • Words that are too frequent (“stopwords”) can be disregarded. • Stopword is a word which does not carry meaning in natural language • i.e. Stopwords in English: a, the, by, and • Therefore, half of the words appearing in a text do not need to be considered Theory

  48. Modeling Natural Language • Third Issue: Distribution of words in the documents of a collection. • Simple Model: Consider that each word appears the same number of times in every document (Not True) • Better Model: Use a binomial distribution Theory

  49. Heaps’ Law • Fourth Issue: Number of distinct words in a document (document vocabulary) • To predict the growth of vocabulary size in natural language text: Theory

  50. Modeling Natural Language • Vocabulary size grows sub-linearly with text size Theory

More Related