320 likes | 449 Views
Markup Languages and Complex Documents (MLCD). Universität zu Köln, 10.12.2004 Claus Huitfeldt (University of Bergen) and C.M. Sperberg-McQueen (World Wide Web Consortium) http://teksttek.aksis.uib.no/projects/mlcd. What MLCD is about. Creating a markup notation data structure
E N D
Markup Languages and Complex Documents (MLCD) Universität zu Köln, 10.12.2004 Claus Huitfeldt (University of Bergen) and C.M. Sperberg-McQueen (World Wide Web Consortium) http://teksttek.aksis.uib.no/projects/mlcd
What MLCD is about Creating a markup • notation • data structure • grammar • semantics • for ”complex” documents, i.e. documents with overlapping, fragmented or disordered elements, multiple co-existing alternative structures etc. http://teksttek.aksis.uib.no/projects/mlcd
The structure of this talk • About document markup • The success of SGML/XML • The problems of SGML/XML • MLCD’s aims and organization etc. • Data structure • Notation • Prototype software • Grammars • Related work http://teksttek.aksis.uib.no/projects/mlcd
What Markup is • (or, what we mean by "markup") • Markup is information added to the character stream of a text document, normally meta-information about the contents, intended interpretation, or processing of part of the character stream. • Markup is • - embedded • - separable http://teksttek.aksis.uib.no/projects/mlcd
Why Markup is Important • Form of representation affects: • what computers can do with texts • the way we think about texts • designers of computer text systems • authors and readers • Formal representation may serve as a model for theories of text, or as a tool in theorising about texts. • Therefore: Shortcomings and problems of current markup systems are also important http://teksttek.aksis.uib.no/projects/mlcd
The basic elements of markup (The “Tripod”): • Notation (linearisation) • Data structure (graph representation) • Constraint language (grammar) http://teksttek.aksis.uib.no/projects/mlcd
Why SGML/XML (or is it HTML/PDF?)is such a success Tight integration of: • A simple notation • (the angle brackets) • A straightforward data structure with a natural interpretation • (document tree and attribution of properties to elements) • A powerful constraint language • (the DTD, a context-free grammar) http://teksttek.aksis.uib.no/projects/mlcd
Why SGML/XML is still not perfect Problems representing: • overlapping elements • discontiguous elements • disordered elements • structural variation (alternate ordering) • micro-level variation • macro-level alternate ordering • fragmentation, transposition, disorder • coextensive elements? • context-sensitive constraints • attribute co-occurence constraints i.e.: complex structures. http://teksttek.aksis.uib.no/projects/mlcd
Are complex structures important? Yes, -- an example: Overlap • Overlap exists in real texts: • pages and paragraphs • physical and formal structure • details of inscriptions and other structures • verse lines and speeches in verse drama • direct discourse and verse lines or sentences • Overlap also exists in electronic texts: • Overlap is explicitly allowed, and used for the encoding of overlapping features, in certain non-SGML systems such as: • MECS, FFF (Folio Flat File), TACT/COCOA • Such systems may also create "spurious" or "dumb" overlap. And some systems contain overlap although they are not supposed to: HTML ! http://teksttek.aksis.uib.no/projects/mlcd
Doesn’t SGML/XML have solutions? • Yes, but No: SGML/XML ”solutions”, such as • milestones • fragmentation • virtual elements • stand-off markup • CONCUR • are artificial and cumbersome, and not supported, neither by SGML/XML as such, nor by existing software. • (From now on: SGML/XML -> XML.) http://teksttek.aksis.uib.no/projects/mlcd
Why doesn’t XML support complex structures? • Non-hierarchical links are possible, but • XML element structure is supported by a context-free grammar • which requires hierarchical nesting of elements • What we have called ”complex structures” are not supported by context-free grammars http://teksttek.aksis.uib.no/projects/mlcd
Departure point for MLCD: XML and MECS • MECS: • Simple notation for overlapping structures • Generic, like XML • Used in markup of 20,000 pages • Host of software • But: • No data structure • (Just left-to-right scan) • No grammar • (Though well-formedness and GI vocabulary constraints defined) http://teksttek.aksis.uib.no/projects/mlcd
The idea behind MLCD • Combine the best of both worlds (XML and MECS), i.e. • create a markup language which • can handle complex structures (overlap and beyond), • is based on a markup tripod tightly integrated like that of XML (notation, data structure and grammar). http://teksttek.aksis.uib.no/projects/mlcd
Today XML MECS SGML http://teksttek.aksis.uib.no/projects/mlcd
Tomorrow MLCD XML MECS SGML http://teksttek.aksis.uib.no/projects/mlcd
MLCD – project organization • Project period: 2001-07 • Project partners • Host: Aksis • Programmer, researcher, administration • UoB • Philosophy, Linguistics, Humanities informatics… • GSLIS (semantics) • Renear, Dubin • Sperberg-McQueen (W3C) • Others…? • Achievements • Notation (TexMECS) • Data structure (GODDAG) • Experimental software (wff checker, loader/linearizer, visualizer, MOTS15, BECHAMEL) • Plans • Grammar • Prototype software • Further partners…? http://teksttek.aksis.uib.no/projects/mlcd
MLCD notation: TexMECS(”Trivially extended MECS”) • Design goals: • Isomorphic to XML and MECS for relevant documents • Every TexMECS document corresponds to a GODDAG structure, and vice versa • Correct GODDAG construction without application-specific knowledge • Simplicity of parsing, minimal number of magic characters http://teksttek.aksis.uib.no/projects/mlcd
TexMECS elements Only two reserved characters: < and | empty: <e att="val"> with ID: <e@foo att="val"> contiguous: <e|...|e> interrupted: <e|...|-e> ... <+e|...|e> unordered: <|e||...||e|> virtual: <^e^foo att="val"> self-overlap: <e~1|...<e~2|...|e~1>...|e~2> http://teksttek.aksis.uib.no/projects/mlcd
Other TexMECS mechanisms Internal entities: <é> Structured internal entities: <&dot.fullstop> vs. <&dot.decimal> External entities: <<url>>, e.g. <<vw117-a>> or <<http://www.w3.org/XML>> Comments: <* ... *>. Note that comments can nest. CDATA sections: <#CDATA< ... >>. http://teksttek.aksis.uib.no/projects/mlcd
TexMECS examples <s|<a| John <b| loves |a> Mary |b>|s> <sp who="HUGHIE"| <p|How did that translation go?|p> <lg type="haiku"| <l|da de dum de dum,|l> <l@frog|gets a new frog,|l> <l|...|l>|lg> |sp> <sp who="LOUIS"| <p|Er ...|p> <lg| <l@new|it's a new pond.|l>|lg> |sp> <sp who="DEWEY"| <p|Ah ...|p> <lg| <l@pond|When the old pond|l>|lg> <p|Right. That's it.|p> |sp> <lg|<^l^pond><^l^frog><^l^new>|lg> http://teksttek.aksis.uib.no/projects/mlcd
MLCD data structure: GODDAG(”generalized ordered-descendant directed acyclic graph”) Not: But: Overlap is simply multiple parentage. http://teksttek.aksis.uib.no/projects/mlcd
GODDAG – general description A Goddag is a directed acyclic graph (DAG): • Every node is either a leaf or a non-terminal. • Each leaf is labeled with a string. • Each non-terminal is labeled with an identifier. • Directed arcs identify parent/child relation; paths identify ancestor/descendant relation. • Node n is a leaf node iff n is not a parent. • Node n is a non-terminal node iff n is not a leaf node. http://teksttek.aksis.uib.no/projects/mlcd
Restricted GODDAGs • Leaf nodes are ordered. • Each non-terminal dominates a contiguous subsequence of leaves. • No two nodes dominate the same subsequence of the frontier. Unrestricted GODDAGs • For each node n, arcs (n → x) are ordered. • Leaves need not have any ordering; no contiguity rule for non-terminals. • Two non-terminals may dominate same set of leaves. http://teksttek.aksis.uib.no/projects/mlcd
Features of GODDAGs Just like a tree: • simple inheritance • overriding • additive meaning • positional meaning http://teksttek.aksis.uib.no/projects/mlcd
GODDAG and spurious overlap <a|…<b|…|a>…|b> <a|<b|…|a>…|b> <a|…<b||a>…|b> <a|…<b|…|a>|b> However, GODDAGs can be ”cleaned” by removing spurious overlap. http://teksttek.aksis.uib.no/projects/mlcd
Software • Well-formedness checker • Loader/linearizer for XML, MECS, TexMECS • Visualizer • Retrieval / concordance program Demo: • http://teksttek.aksis.uib.no/projects/mlcd • BECHAMEL http://teksttek.aksis.uib.no/projects/mlcd
MLCD’s open slot: A Grammar • Without a formal grammar, no constraint language. • Without a constraint language, no proper markup system. • Validation Requirements • allow for overlap • allow validation of virtual elements • partial validation? • modular specification, operations on grammar fragments • union • intersection • difference http://teksttek.aksis.uib.no/projects/mlcd
The Chomsky hierarchy • regular grammars (regular expressions) • context-free grammars (BNF, ...) • context-sensitive (monotonic) grammars • unrestricted phrase-structure grammars • What we want is either a little more powerful than context-free grammars. • Or else a little weaker. http://teksttek.aksis.uib.no/projects/mlcd
Formalisms to consider • attribute/affix grammars • attribute grammars (Knuth 1968) • affix grammars (descended from Van Wijngaarden two-level grammars used in Algol 68 Report) • extend context-free grammars in limited (tractable) ways by passing parameters on non-terminals • tree automata • parallel parsing (intersection of multiple grammars), cf. CONCUR • exotica (GPSG slash formalism? graph grammars? constraint grammars?) • standard context-free grammars plus ad hoc rules? http://teksttek.aksis.uib.no/projects/mlcd
Related work • Text Encoding Initiative SIG on overlapping markup • The ARCHway Project (Kentucky) • OSIS (Steve DeRose) • JITTs (Patrick Durusau and Brook O’Donnell) • LMNL project (Wendell Piez and Jeni Tennison) • Bielefeld Text Technology group? • Others ??? http://teksttek.aksis.uib.no/projects/mlcd
Thank you http://teksttek.aksis.uib.no/projects/mlcd http://teksttek.aksis.uib.no/projects/mlcd