1 / 23

Content Types: Text and Metadata

Content Types: Text and Metadata. Introduction. Text documents come in many forms Article (news, conference, journal, etc.) Email, memo, … Book, manual, manuscript, transcript, … Any part of one of the above Syntax can express Structure Presentation style Semantics (e.g. software code).

Download Presentation

Content Types: Text and Metadata

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Content Types:Text and Metadata

  2. Introduction • Text documents come in many forms • Article (news, conference, journal, etc.) • Email, memo, … • Book, manual, manuscript, transcript, … • Any part of one of the above • Syntax can express • Structure • Presentation style • Semantics (e.g. software code)

  3. Metadata • Metadata – data about data • Descriptive metadata • External to meaning of document • Author, publication date, document source, document length, document genre, file type, bits per second, frame rate, etc. • Semantic metadata • Characterizes semantic content of document • LoC subject heading, keywords, subject headings from ontologies (e.g. MESH), etc.

  4. Metadata Formats • Machine Readable Cataloging Record (MARC) • Used by most libraries • Fields include title, author, etc. • Resource Description Framework (RDF) • Used for Web resources • Node and attribute / value pairs • Node ID is any Uniform Resource Identifier (URI), which could be a URL

  5. Metadata Sets • Dublin Core Metadata Elements • Contributor – entities contributing to the content • Coverage – extent or scope of content (spatial area, temporal period, …) • Creator – entity primarily responsible for making the content • Date – date associated with event (e.g. publication) for resource • Description – abstract, table of contents, … • Format – media (file) type, dimensions (size, duration), hardware needed • Identifier – unique identifier • Language – language of content • Publisher – entity responsible for making resource available • Relation – reference to related resource(s) • Rights – information about rights held in/over resource • Source – resource from which content is derived • Subject – keywords, key phrases, classification code, etc. • Title – name of the resource • Type – nature or genre of content

  6. Text Formats • Coding schemes • EBCDIC (7 bit, one of first coding schemes) • ASCII (initially 7 bit, extended to 8 bit) • Unicode (16 bit for large alphabets) • Additional Formats • RTF (format-oriented document exchange) • PDF and PostScript (display-oriented representation) • Multipurpose Internet Mail Exchange (MIME) (multiple character sets, languages, media)

  7. Information Theory • How can we predict information value of components of a document? • Entropy – attempts to model information content (information uncertainty) • E = - Sum all symbols in alphabet (pi log2 pi) pi is the probability of symbol I (symbol frequency over number of symbols) Need a text model for real language • Also important for compression as E acts as a limit of how much a text can be compressed.

  8. Modeling Character Strings • Symbols in NL are not evenly distributed • Some symbols are not part of words (often used for syntax) • Symbols in words are not evenly distributed • Models • Binomial model uses distribution of symbols in language • But previous symbols influence probabilities of later symbols • (what letter will appear after a q?) • Finite context or Markovian models used for this dependency • k-order where k is the number of previous characters taken into account by the model • Thus, the binomial model is a 0-order model

  9. Word Distribution in Documents • How frequent are words within documents? • Zipf’s Law • Frequency of the ith most frequent word is 1/itheta * frequency of most frequent word • The value of theta depends on the text (value of 1 is logarithmic distribution) • Theta values of 1.5 to 2.0 best model real texts • In practice, a few hundred words make up 50% of most texts • Frequent words provide less information • Thus, many search strategies involve ignoring stopwords (a, an, the, is, of, by, …)

  10. Word Distribution in Collections • Simplest to assume uniform distribution of words in documents • But not true • Better models built on negative binomial distributions or Poisson distributions

  11. Vocabulary Size for Documents and Collections • Heap’s Law • Vocabulary size (V) grows with number of words (n) • V = Knb • Experimentally, • K is between 10 and 100 • B is between 0.4 and 0.6 • So vocabulary grows proportionally with the square root of the size of the document or collection in words • Works best for large documents & collections

  12. String Similarity Models • Similarity is measured by a distance function • Hamming distance – number of characters different in strings • Levenshtein distance – minimum number of insertions, deletions, and substitutions needed to make strings equal • color to colour is 1 • survey to surgery is 2 • Can be extended to documents • UNIX diff treats each line as a character

  13. Content Types:Markup and Multimedia

  14. Introduction • Markup languages use extra textual syntax to encode: • Formatting / display information • Structure information • Descriptive metadata • Semantic metadata • Marks are often called tags • The act of adding markup is called tagging • Most markup languages use initial and ending tags surrounding the marked text

  15. Standard Generalized Markup Language (SGML) • Metalanguage for markup. • Includes rules for defining markup language • Use of SGML includes • Description of structure of markup • Text marked with tags • Document Type Declaration (DTD) • Describes and names tags and how they are related • Comments used to express interpretation of tags (meaning, presentation, …)

  16. SGML DTD Example • <!– SGML DTD for electronic messages - - > • <! ELEMENT e-mail - - (prolog, contents) > • <! ELEMENT prolog - - (sender, address+ , subject?, Cc*) > • <! ELEMENT (sender | address | subject | Cc) - 0 (#PCDATA) > • <! ELEMENT contents - - (par | image | audio)+ > • <! ELEMENT par - 0 (ref | #PCDATA)+> • <! ELEMENT ref - 0 EMPTY > • <! ELEMENT (image | audio) - - (#NDATA) > • <! ATTLIST e-mail • id ID #REQUIRED • date_sent DATE #REQUIRED • status (secret | public ) public > • <! ATTLIST ref • id IDREF #REQUIRED > • <! ATTLIST (image | audio) • id IDREF #REQUIRED >

  17. SGML Example • <!– DOCTYPE e-mail SYSTEM “e-mail.dtd”> • <e-mail id=94108rby date_sent=02101998> • <prolog> • <sender> Pablo Neruda</sender> • <address> Federico Garcia Lorca</address> • <address> Ernest Hemingway</address> • <subject> Picture of my house in Isla • <Cc> Gabriel Garcia Marquez</Cc> • </prolog> • <contents> • <par> • Here are two photos. One is of the view (photo <ref idref=F2>). • </par> • <image id=F1> “photo1.gif” </image> • <image id=F2> “photo2.jpg” </image> • </contents> • </e-mail>

  18. SGML Characteristics • DTD provides ability to determine if a given document is well-formed. • SGML generally does not specify presentation/appearance. • Output specification standards: • DSSSL (Document Style Semantic Specification Language) • FOSI (Formatted Output Specification Instance)

  19. HyperText Markup Language (HTML) • Based on SGML • HTML DTD not explicitly referenced by documents • HTML documents can have documents embedded within them • Images or audio • Frames with other HTML documents • When programs are included, it is referred to as Dynamic HTML • Strict HTML includes only non-presentational markup. • Cascade Style Sheets (CSS) used to define presentation • In reality, presentational and structural markup are blended by HTML authoring applications.

  20. (Original) HTML Limitations • In contrast to SGML: • Users cannot specify their own tags or attributes. • No support for nested structures that can represent database schemas or object-oriented hierarchies. • No support for validation of document by consuming applications.

  21. eXtensible Markup Language (XML) • XML is a simplified subset of SGML • XML is a meta-language • XML designed for semantic markup that is both human and machine readable • No DTD is required • All tags must be closed • Extensible Style sheet Language (XSL) • XML equivalent of CSS • Can be used to convert XML into HTML and CSS

  22. Multimedia • Lots of data file formats for non-textual data • Images • BMP, GIF, JPEG (JPG), TIFF • Audio • AU, MIDI, WAVE, MP3 • Video • MPEG, AVI, QuickTime • Graphics / Virtual Environments • CGM, VRML, OpenGL

  23. Audio and Video • Data files often have: • Header • Indicates time granularity, number of channels, bits per channel • Somewhat like a DTD • Data • The signal • Data may be compressed • Data may be in frequency domain rather than time domain • Data may be encoded as sequence of differences between consecutive time segments.

More Related