1 / 32

Markup Languages and Complex Documents (MLCD)

Markup Languages and Complex Documents (MLCD). Universität zu Köln, 10.12.2004 Claus Huitfeldt (University of Bergen) and C.M. Sperberg-McQueen (World Wide Web Consortium) http://teksttek.aksis.uib.no/projects/mlcd. What MLCD is about. Creating a markup notation data structure

udell
Download Presentation

Markup Languages and Complex Documents (MLCD)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Markup Languages and Complex Documents (MLCD) Universität zu Köln, 10.12.2004 Claus Huitfeldt (University of Bergen) and C.M. Sperberg-McQueen (World Wide Web Consortium) http://teksttek.aksis.uib.no/projects/mlcd

  2. What MLCD is about Creating a markup • notation • data structure • grammar • semantics • for ”complex” documents, i.e. documents with overlapping, fragmented or disordered elements, multiple co-existing alternative structures etc. http://teksttek.aksis.uib.no/projects/mlcd

  3. The structure of this talk • About document markup • The success of SGML/XML • The problems of SGML/XML • MLCD’s aims and organization etc. • Data structure • Notation • Prototype software • Grammars • Related work http://teksttek.aksis.uib.no/projects/mlcd

  4. What Markup is • (or, what we mean by "markup") • Markup is information added to the character stream of a text document, normally meta-information about the contents, intended interpretation, or processing of part of the character stream. • Markup is • - embedded • - separable http://teksttek.aksis.uib.no/projects/mlcd

  5. Why Markup is Important • Form of representation affects: • what computers can do with texts • the way we think about texts • designers of computer text systems • authors and readers • Formal representation may serve as a model for theories of text, or as a tool in theorising about texts. • Therefore: Shortcomings and problems of current markup systems are also important http://teksttek.aksis.uib.no/projects/mlcd

  6. The basic elements of markup (The “Tripod”): • Notation (linearisation) • Data structure (graph representation) • Constraint language (grammar) http://teksttek.aksis.uib.no/projects/mlcd

  7. Why SGML/XML (or is it HTML/PDF?)is such a success Tight integration of: • A simple notation • (the angle brackets) • A straightforward data structure with a natural interpretation • (document tree and attribution of properties to elements) • A powerful constraint language • (the DTD, a context-free grammar) http://teksttek.aksis.uib.no/projects/mlcd

  8. Why SGML/XML is still not perfect Problems representing: • overlapping elements • discontiguous elements • disordered elements • structural variation (alternate ordering) • micro-level variation • macro-level alternate ordering • fragmentation, transposition, disorder • coextensive elements? • context-sensitive constraints • attribute co-occurence constraints i.e.: complex structures. http://teksttek.aksis.uib.no/projects/mlcd

  9. Are complex structures important? Yes, -- an example: Overlap • Overlap exists in real texts: • pages and paragraphs • physical and formal structure • details of inscriptions and other structures • verse lines and speeches in verse drama • direct discourse and verse lines or sentences • Overlap also exists in electronic texts: • Overlap is explicitly allowed, and used for the encoding of overlapping features, in certain non-SGML systems such as: • MECS, FFF (Folio Flat File), TACT/COCOA • Such systems may also create "spurious" or "dumb" overlap. And some systems contain overlap although they are not supposed to: HTML ! http://teksttek.aksis.uib.no/projects/mlcd

  10. Doesn’t SGML/XML have solutions? • Yes, but No: SGML/XML ”solutions”, such as • milestones • fragmentation • virtual elements • stand-off markup • CONCUR • are artificial and cumbersome, and not supported, neither by SGML/XML as such, nor by existing software. • (From now on: SGML/XML -> XML.) http://teksttek.aksis.uib.no/projects/mlcd

  11. Why doesn’t XML support complex structures? • Non-hierarchical links are possible, but • XML element structure is supported by a context-free grammar • which requires hierarchical nesting of elements • What we have called ”complex structures” are not supported by context-free grammars http://teksttek.aksis.uib.no/projects/mlcd

  12. Departure point for MLCD: XML and MECS • MECS: • Simple notation for overlapping structures • Generic, like XML • Used in markup of 20,000 pages • Host of software • But: • No data structure • (Just left-to-right scan) • No grammar • (Though well-formedness and GI vocabulary constraints defined) http://teksttek.aksis.uib.no/projects/mlcd

  13. The idea behind MLCD • Combine the best of both worlds (XML and MECS), i.e. • create a markup language which • can handle complex structures (overlap and beyond), • is based on a markup tripod tightly integrated like that of XML (notation, data structure and grammar). http://teksttek.aksis.uib.no/projects/mlcd

  14. Today XML MECS SGML http://teksttek.aksis.uib.no/projects/mlcd

  15. Tomorrow MLCD XML MECS SGML http://teksttek.aksis.uib.no/projects/mlcd

  16. MLCD – project organization • Project period: 2001-07 • Project partners • Host: Aksis • Programmer, researcher, administration • UoB • Philosophy, Linguistics, Humanities informatics… • GSLIS (semantics) • Renear, Dubin • Sperberg-McQueen (W3C) • Others…? • Achievements • Notation (TexMECS) • Data structure (GODDAG) • Experimental software (wff checker, loader/linearizer, visualizer, MOTS15, BECHAMEL) • Plans • Grammar • Prototype software • Further partners…? http://teksttek.aksis.uib.no/projects/mlcd

  17. MLCD notation: TexMECS(”Trivially extended MECS”) • Design goals: • Isomorphic to XML and MECS for relevant documents • Every TexMECS document corresponds to a GODDAG structure, and vice versa • Correct GODDAG construction without application-specific knowledge • Simplicity of parsing, minimal number of magic characters http://teksttek.aksis.uib.no/projects/mlcd

  18. TexMECS elements Only two reserved characters: < and | empty: <e att="val"> with ID: <e@foo att="val"> contiguous: <e|...|e> interrupted: <e|...|-e> ... <+e|...|e> unordered: <|e||...||e|> virtual: <^e^foo att="val"> self-overlap: <e~1|...<e~2|...|e~1>...|e~2> http://teksttek.aksis.uib.no/projects/mlcd

  19. Other TexMECS mechanisms Internal entities: <&eacute> Structured internal entities: <&dot.fullstop> vs. <&dot.decimal> External entities: <<url>>, e.g. <<vw117-a>> or <<http://www.w3.org/XML>> Comments: <* ... *>. Note that comments can nest. CDATA sections: <#CDATA< ... >>. http://teksttek.aksis.uib.no/projects/mlcd

  20. TexMECS examples <s|<a| John <b| loves |a> Mary |b>|s> <sp who="HUGHIE"| <p|How did that translation go?|p> <lg type="haiku"| <l|da de dum de dum,|l> <l@frog|gets a new frog,|l> <l|...|l>|lg> |sp> <sp who="LOUIS"| <p|Er ...|p> <lg| <l@new|it's a new pond.|l>|lg> |sp> <sp who="DEWEY"| <p|Ah ...|p> <lg| <l@pond|When the old pond|l>|lg> <p|Right. That's it.|p> |sp> <lg|<^l^pond><^l^frog><^l^new>|lg> http://teksttek.aksis.uib.no/projects/mlcd

  21. MLCD data structure: GODDAG(”generalized ordered-descendant directed acyclic graph”) Not: But: Overlap is simply multiple parentage. http://teksttek.aksis.uib.no/projects/mlcd

  22. GODDAG – general description A Goddag is a directed acyclic graph (DAG): • Every node is either a leaf or a non-terminal. • Each leaf is labeled with a string. • Each non-terminal is labeled with an identifier. • Directed arcs identify parent/child relation; paths identify ancestor/descendant relation. • Node n is a leaf node iff n is not a parent. • Node n is a non-terminal node iff n is not a leaf node. http://teksttek.aksis.uib.no/projects/mlcd

  23. Restricted GODDAGs • Leaf nodes are ordered. • Each non-terminal dominates a contiguous subsequence of leaves. • No two nodes dominate the same subsequence of the frontier. Unrestricted GODDAGs • For each node n, arcs (n → x) are ordered. • Leaves need not have any ordering; no contiguity rule for non-terminals. • Two non-terminals may dominate same set of leaves. http://teksttek.aksis.uib.no/projects/mlcd

  24. Features of GODDAGs Just like a tree: • simple inheritance • overriding • additive meaning • positional meaning http://teksttek.aksis.uib.no/projects/mlcd

  25. http://teksttek.aksis.uib.no/projects/mlcd

  26. GODDAG and spurious overlap <a|…<b|…|a>…|b> <a|<b|…|a>…|b> <a|…<b||a>…|b> <a|…<b|…|a>|b> However, GODDAGs can be ”cleaned” by removing spurious overlap. http://teksttek.aksis.uib.no/projects/mlcd

  27. Software • Well-formedness checker • Loader/linearizer for XML, MECS, TexMECS • Visualizer • Retrieval / concordance program Demo: • http://teksttek.aksis.uib.no/projects/mlcd • BECHAMEL http://teksttek.aksis.uib.no/projects/mlcd

  28. MLCD’s open slot: A Grammar • Without a formal grammar, no constraint language. • Without a constraint language, no proper markup system. • Validation Requirements • allow for overlap • allow validation of virtual elements • partial validation? • modular specification, operations on grammar fragments • union • intersection • difference http://teksttek.aksis.uib.no/projects/mlcd

  29. The Chomsky hierarchy • regular grammars (regular expressions) • context-free grammars (BNF, ...) • context-sensitive (monotonic) grammars • unrestricted phrase-structure grammars • What we want is either a little more powerful than context-free grammars. • Or else a little weaker. http://teksttek.aksis.uib.no/projects/mlcd

  30. Formalisms to consider • attribute/affix grammars • attribute grammars (Knuth 1968) • affix grammars (descended from Van Wijngaarden two-level grammars used in Algol 68 Report) • extend context-free grammars in limited (tractable) ways by passing parameters on non-terminals • tree automata • parallel parsing (intersection of multiple grammars), cf. CONCUR • exotica (GPSG slash formalism? graph grammars? constraint grammars?) • standard context-free grammars plus ad hoc rules? http://teksttek.aksis.uib.no/projects/mlcd

  31. Related work • Text Encoding Initiative SIG on overlapping markup • The ARCHway Project (Kentucky) • OSIS (Steve DeRose) • JITTs (Patrick Durusau and Brook O’Donnell) • LMNL project (Wendell Piez and Jeni Tennison) • Bielefeld Text Technology group? • Others ??? http://teksttek.aksis.uib.no/projects/mlcd

  32. Thank you http://teksttek.aksis.uib.no/projects/mlcd http://teksttek.aksis.uib.no/projects/mlcd

More Related