230 likes | 343 Views
Content Types: Text and Metadata. Introduction. Text documents come in many forms Article (news, conference, journal, etc.) Email, memo, … Book, manual, manuscript, transcript, … Any part of one of the above Syntax can express Structure Presentation style Semantics (e.g. software code).
E N D
Introduction • Text documents come in many forms • Article (news, conference, journal, etc.) • Email, memo, … • Book, manual, manuscript, transcript, … • Any part of one of the above • Syntax can express • Structure • Presentation style • Semantics (e.g. software code)
Metadata • Metadata – data about data • Descriptive metadata • External to meaning of document • Author, publication date, document source, document length, document genre, file type, bits per second, frame rate, etc. • Semantic metadata • Characterizes semantic content of document • LoC subject heading, keywords, subject headings from ontologies (e.g. MESH), etc.
Metadata Formats • Machine Readable Cataloging Record (MARC) • Used by most libraries • Fields include title, author, etc. • Resource Description Framework (RDF) • Used for Web resources • Node and attribute / value pairs • Node ID is any Uniform Resource Identifier (URI), which could be a URL
Metadata Sets • Dublin Core Metadata Elements • Contributor – entities contributing to the content • Coverage – extent or scope of content (spatial area, temporal period, …) • Creator – entity primarily responsible for making the content • Date – date associated with event (e.g. publication) for resource • Description – abstract, table of contents, … • Format – media (file) type, dimensions (size, duration), hardware needed • Identifier – unique identifier • Language – language of content • Publisher – entity responsible for making resource available • Relation – reference to related resource(s) • Rights – information about rights held in/over resource • Source – resource from which content is derived • Subject – keywords, key phrases, classification code, etc. • Title – name of the resource • Type – nature or genre of content
Text Formats • Coding schemes • EBCDIC (7 bit, one of first coding schemes) • ASCII (initially 7 bit, extended to 8 bit) • Unicode (16 bit for large alphabets) • Additional Formats • RTF (format-oriented document exchange) • PDF and PostScript (display-oriented representation) • Multipurpose Internet Mail Exchange (MIME) (multiple character sets, languages, media)
Information Theory • How can we predict information value of components of a document? • Entropy – attempts to model information content (information uncertainty) • E = - Sum all symbols in alphabet (pi log2 pi) pi is the probability of symbol I (symbol frequency over number of symbols) Need a text model for real language • Also important for compression as E acts as a limit of how much a text can be compressed.
Modeling Character Strings • Symbols in NL are not evenly distributed • Some symbols are not part of words (often used for syntax) • Symbols in words are not evenly distributed • Models • Binomial model uses distribution of symbols in language • But previous symbols influence probabilities of later symbols • (what letter will appear after a q?) • Finite context or Markovian models used for this dependency • k-order where k is the number of previous characters taken into account by the model • Thus, the binomial model is a 0-order model
Word Distribution in Documents • How frequent are words within documents? • Zipf’s Law • Frequency of the ith most frequent word is 1/itheta * frequency of most frequent word • The value of theta depends on the text (value of 1 is logarithmic distribution) • Theta values of 1.5 to 2.0 best model real texts • In practice, a few hundred words make up 50% of most texts • Frequent words provide less information • Thus, many search strategies involve ignoring stopwords (a, an, the, is, of, by, …)
Word Distribution in Collections • Simplest to assume uniform distribution of words in documents • But not true • Better models built on negative binomial distributions or Poisson distributions
Vocabulary Size for Documents and Collections • Heap’s Law • Vocabulary size (V) grows with number of words (n) • V = Knb • Experimentally, • K is between 10 and 100 • B is between 0.4 and 0.6 • So vocabulary grows proportionally with the square root of the size of the document or collection in words • Works best for large documents & collections
String Similarity Models • Similarity is measured by a distance function • Hamming distance – number of characters different in strings • Levenshtein distance – minimum number of insertions, deletions, and substitutions needed to make strings equal • color to colour is 1 • survey to surgery is 2 • Can be extended to documents • UNIX diff treats each line as a character
Introduction • Markup languages use extra textual syntax to encode: • Formatting / display information • Structure information • Descriptive metadata • Semantic metadata • Marks are often called tags • The act of adding markup is called tagging • Most markup languages use initial and ending tags surrounding the marked text
Standard Generalized Markup Language (SGML) • Metalanguage for markup. • Includes rules for defining markup language • Use of SGML includes • Description of structure of markup • Text marked with tags • Document Type Declaration (DTD) • Describes and names tags and how they are related • Comments used to express interpretation of tags (meaning, presentation, …)
SGML DTD Example • <!– SGML DTD for electronic messages - - > • <! ELEMENT e-mail - - (prolog, contents) > • <! ELEMENT prolog - - (sender, address+ , subject?, Cc*) > • <! ELEMENT (sender | address | subject | Cc) - 0 (#PCDATA) > • <! ELEMENT contents - - (par | image | audio)+ > • <! ELEMENT par - 0 (ref | #PCDATA)+> • <! ELEMENT ref - 0 EMPTY > • <! ELEMENT (image | audio) - - (#NDATA) > • <! ATTLIST e-mail • id ID #REQUIRED • date_sent DATE #REQUIRED • status (secret | public ) public > • <! ATTLIST ref • id IDREF #REQUIRED > • <! ATTLIST (image | audio) • id IDREF #REQUIRED >
SGML Example • <!– DOCTYPE e-mail SYSTEM “e-mail.dtd”> • <e-mail id=94108rby date_sent=02101998> • <prolog> • <sender> Pablo Neruda</sender> • <address> Federico Garcia Lorca</address> • <address> Ernest Hemingway</address> • <subject> Picture of my house in Isla • <Cc> Gabriel Garcia Marquez</Cc> • </prolog> • <contents> • <par> • Here are two photos. One is of the view (photo <ref idref=F2>). • </par> • <image id=F1> “photo1.gif” </image> • <image id=F2> “photo2.jpg” </image> • </contents> • </e-mail>
SGML Characteristics • DTD provides ability to determine if a given document is well-formed. • SGML generally does not specify presentation/appearance. • Output specification standards: • DSSSL (Document Style Semantic Specification Language) • FOSI (Formatted Output Specification Instance)
HyperText Markup Language (HTML) • Based on SGML • HTML DTD not explicitly referenced by documents • HTML documents can have documents embedded within them • Images or audio • Frames with other HTML documents • When programs are included, it is referred to as Dynamic HTML • Strict HTML includes only non-presentational markup. • Cascade Style Sheets (CSS) used to define presentation • In reality, presentational and structural markup are blended by HTML authoring applications.
(Original) HTML Limitations • In contrast to SGML: • Users cannot specify their own tags or attributes. • No support for nested structures that can represent database schemas or object-oriented hierarchies. • No support for validation of document by consuming applications.
eXtensible Markup Language (XML) • XML is a simplified subset of SGML • XML is a meta-language • XML designed for semantic markup that is both human and machine readable • No DTD is required • All tags must be closed • Extensible Style sheet Language (XSL) • XML equivalent of CSS • Can be used to convert XML into HTML and CSS
Multimedia • Lots of data file formats for non-textual data • Images • BMP, GIF, JPEG (JPG), TIFF • Audio • AU, MIDI, WAVE, MP3 • Video • MPEG, AVI, QuickTime • Graphics / Virtual Environments • CGM, VRML, OpenGL
Audio and Video • Data files often have: • Header • Indicates time granularity, number of channels, bits per channel • Somewhat like a DTD • Data • The signal • Data may be compressed • Data may be in frequency domain rather than time domain • Data may be encoded as sequence of differences between consecutive time segments.