560 likes | 782 Views
Usage. In order to run examples, download bin.zip from this directory and then extract files. The presentation must be in the same directory as xml files, program files, and batch files. Once files have been extracted open bin/LT XML Presenation.ppt
E N D
Usage In order to run examples, download bin.zip from this directory and then extract files. The presentation must be in the same directory as xml files, program files, and batch files. Once files have been extracted open bin/LT XML Presenation.ppt *Note the online version will not run examples Click to download bin.zip
Why Markup Text? • All language processing applications: machine translation, information retrieval and extraction, text summarization, user/machine dialog systems, speech understanding and synthesis, manipulate language and data stored in an electronic format.
Why Markup Text? • For Example, In textual data, markup for logical structure (section, paragraph, sentence …) provides essential information for any language processing task. • Words, dates, names can be used for information retrieval and titles, footnotes can limit data to be searched.
Why XML for annotation? • Growing industry standard • Support and Maintenance • Availability of Tools • Maintains SGML functionality. • Representation scheme (hierarchical structure) forces order, consistency in annotations
Why XML for annotation? • The cost of creating annotated data may also be very high. • Funders expect the cost to be amortized over several research and development projects. • So the industry needs a standard, usable encoding format that allows human readability – Hence, XML
Start Tag Attribute End Tag Value Element Elements and Attributes <word POS=“NN1”>imagination</word> • POS Tags
LT XML Version 1.2
What is it? • set of XML tools and a developer's tool-kit • C-based API • intended to process all XML documents which are well formed • Suite of tools which can be pipelined together and which communicate using the LT XML API • All use the same query language to access and manipulate subparts of XML documents • Allows simple tools to be composed together into complex applications
Who created LT XML? • Created by the Language Technology Group • Edinburgh • The LTG is a research and development group working in the area of natural language engineering. • Interesting spin-off company: Rhetorical Systems Ltd
LTG, why use XML? • LTG use XML in the context of collecting, standardizing, distributing and using very large text collections (10s and in some case 100s of millions of words in size) • The corpora LTG works with are large and they have a very high density of markup (often each word has associated markup)
LTG, why use XML? • Take for example, the task (common in linguistic applications) of tokenizing a corpus - segmenting out the words - and then looking the results up in a lexicon becomes more complex for SGML marked-up corpora (as for any marked up corpus)
LTG, why use XML? • Parsing SGML is very hard and slow if you handle the full range of constructions, validate as you go, and provide reasonable error messages and/or error recovery • Parsing XML is relatively easy in all cases • The basic architecture underlying their approach is one in which they use a simplified form of SGML, ie XML
LT XML Architecture • Data Architecture: How is all the information included in an XML coded corpus organized and stored? • System Architecture: Organization of the software components which implement the LT XML API and previous LT NSL API
Data Architecture: Storage • Files: “storage units” • XML Documents: composed of a number of files by external entity referencing • Hyper-Documents: linked documents
So, what’s the point? • The implication of this is that corpus components can be hyper-documents, with low-density annotation being expressed in terms of links.
XML Links • Recommended practice in encoding annotated copora is to maintain all or most annotations in separate documents • LTG is currently developing this idea into a shared database
Storage Example • A simple base file: <w id=w12>I</w> <w id=w13>need</w> <w id=w14>a</w> . . . <w id=w28>vacation</w> <c id=c4>.</c>
Storage Example • Standoff markup: <s xml-type=“link” show=“include” href=“&f;#id(w12)..id(c4)”> </s>
Storage Example • What the application really sees: <s> <w id=w12>I</w> <w id=w13>need</w> <w id=w14>a</w> . . . <w id=w28>vacation</w> <c id=c4>.</c> </s> <s> . . . </s>
Storage Example • The original data may contain no markup at all. All markup can be retained in separate documents with links into the original based on offsets. • Known as “Standoff Markup” • Separating annotation from the material being annotated • Base material may be read-only • Material to be annotated may be large • Markup may involve multiple overlapping hierarchies
System Architecture • XML applications • LT XML API layer • XML parser
XML Applications • Applications are designed to cover some commonly occurring needs sggrep — works like the grep program in searching a file for regular string expressions. sgmltrans — translate XML files into another format. sgrpg — systematically transform input document to changed output document sgcount — count elements in an XML file. knit — process compound documents using hyperlinks unknit — create hyperlinked files from XML files sgmltoken — Text tokenization. sgmlseg — simple segmenter sgmlsb — Sentence boundary marker. pesis — Trivial version of James Clark's sgmls. xmlnorm — XML normalizer. textonly — strip out markup simpleq — example program. simple — example program. sgsort — sort XML elements
LT XML API • This is a collection of C functions and types which form a framework for generic SGML and XML processing tasks • This interface was designed before XML existed – originally the LT NSL program
XML Parser • Called RXP, which is also available as a standalone component
Query Language • NSL queries are a way of specifying particular nodes in the SGML document structure. Queries are strings which give a (partial) description of a path from the root of the SGML document to the desired SGML element(s). • A query is a sequence of terms separated by /, where each term describes an SGML element. • XML documents are basically tree structured • Queries specify a path from the root with selection restrictions based on attributes
Simple Query Example • ".*/TEXT/.*/P“ • Describes any <P> element which occurs anywhere (at any level of nesting) inside a <TEXT> element which, in turn, can occur anywhere inside the top-level document element.
Visualizing Queries • The query CORPUS/DOC/TITLE/s means all s elements directly under TITLE's directly under DOC
Visualizing Queries • The query CORPUS/DOC/./s means all s's directly under anything directly under DOC
Visualizing Queries • The query ./.[1]/.[2]/.[0]
Strip out markup Outputs text but not markup from the input XML file session.xml ot.xml textonlyexample1.bat textonlyexample2.bat textonly
Searches a file for regular string expressions. Supports two different command syntaxes : brevity or explicitness. Brevity helpful when used in pipelines. ot.xml (3,406 KB) Find anywhere in OT where the name Mizraim occurs Find anywhere in OT where the name Mizraim and Ludim occurs sggrep
Count elements in an XML file This form is useful for running after sggrep, to see how many matching elements have been found. Count all elements in OT Find number of occurrences of God in OT sgcount
sgmltrans • Translate XML files into some other format • Based on other SGML programs, in that one specifies actions to do at SGML start tags, end tags and text content • Actions are specified using a rule file
sgmltrans: rule file • A rule consists of an LT XML query which describes the elements to which the rule will apply; and a pair of format strings, which specify the strings that will be printed when it encounters a start tag for a matching element and an end tag.
Given the rule: .*/W “” “/$TAG\n” Input: <W TAG="A">The</W> <W TAG="B">cat</W> Output: The/A cat/B Default rule: .* “” “” sgmltrans: simple rule file
A rule query which ends in # matches text content. These rules are called data rules. Instead of a pair of start/end format strings, data rules contain a set of text transformations Does not yet do: “search string””new string” ot.xml session.xml rule.rules sgmltransexample1.bat sgmltransexample2.bat Change Rules file? sgmltransexample3.bat sgmltrans
Process compound documents using hyperlinks Combining hyperlinked files to a single stream is a daily occurrence in LTG’s work on multimedia corpora The following example is a cut-down version of a need which arose in LTG and Centre for Speech Technology Research's SOLE (Spoken Intelligent Labelling Explorer). knit
knit • File of tokenised words , words.xml • This is the target file • Corresponding file marked up with information about the information status of the terms , sem-elem.xml • This is the input file • The input file has less dense markup, and is easier on the eye than the target file.
knit • Note that the href attribute of the sem-elem.xml in the input file is "&w;#id(w410w414w418)" which refers to w410w414w418 in the file words.xml. • This ID is also present in the target file. When knit processes this specification it will obtain the corresponding element from the target file.
knit • The href attribute of the eraseable is "&w;#id(w422)..id(w470w474)", refers to the range from w422 to w470w474 in words.xml. When knit processes this specification it obtains all the elements in this range. In this example, these elements from the target completely replace the corresponding element from the input file.
knit • The DTD , solexml.dtd , specifies the actions which knit will perform by defining attributes on the sem-elem and the eraseable element. Here we are asking it to replace the contents when it sees sem-elem, but to replace the element itself when it sees eraseable
With all that said, an example … Note that elements from the target file have been incorporated in the output, and that the sem-elem is still present in the output file, while the eraseable is absent words.xml sem-elem.xml solexml.dtd knitexample.bat knitoutput.xml knit
unknit • Create hyperlinked files from XML files • Not included with this version of LT XML • Suppose w.xml is an XML file which contains <w> markup around words; s.xml is an XML file which contains <s> markup around sentences consisting of a sequence of <w> elements. Running the command: C:>unknit w.xml w s <s.xml > out.xml will create the XML file out.xml which contains the <s> markup from s.xml, but with all <w> elements replaced by hyperlink(s) back to w.xml
Takes XML input and produces output in the form that nsgmls does (ESIS format). ESIS - ISO 8879 Element Structure Information Set session.xml pesisexample.bat pesis
XML normalizer Apparently trivial program which takes XML input and outputs the same. By default entities will be expanded and such validation as LT XML usually performs will occur. lt.xml xmlnormexample.bat xmlnorm
sgmltoken • Works by identifying upper case and lower case stretches of text and different forms of punctuation, and then tokenizing over that • All text inside <TEXT> elements is tokenized - split into tokens and marked up with <C> elements • “We make no claim that sgmltoken is a general useful tokenizer, it can function as a placeholder for a high-quality tokenizer, such as those used by LTCHUNK and LTPOS”
<TEXT> <BODY> <W TYPE="red">This</W> <W TYPE="blue">is</W> <W TYPE=”green”>it.</W> </BODY> </TEXT> <TEXT> <BODY> <W TYPE="red"> <C ID='C2.T1'>This</C> </W><W TYPE="blue"> <C ID='C4.T1'>is</C> </W><W TYPE="green"> <C ID='C6.T1'>it.</C></W> </BODY> </TEXT> sgmltoken
Sentence boundary marker adds S elements to a file which has already been tokenized with sgmltoken <TEXT> <BODY> <S> <W TYPE="red"> <C ID='C2.T1'>This</C> </W><W TYPE="blue"> <C ID='C4.T1'>is</C> </W><W TYPE="green"> <C ID='C6.T1'>it.</C></W> <S> </BODY> </TEXT> sgmlsb