Flexible Interfaces in the Application of Language Technology to an eScience Corpus

Flexible Interfaces in the Application of Language Technology to an eScience Corpus C.J. Rupp, Ann Copestake, Simone Teufel & Benjamin Waldron Computer Laboratory, University of Cambridge

Outline • Two key interfaces: • SciXML: XML markup for the logical structure of research papers • SAF: Standoff Annotation Formalism for diverse linguistic information • Both • coded in XML and designed for flexibility, • But • what that means is distinct in the two cases.

RSC papers Nature papers SciXML IUCr papers Biology and CL (pdf) SciBorg Architecture OSCAR RASP WSD RMRS merge POS tagging anaphora tasks rhetorical analysis ERG/PET standoff annotation

Sciborg Corpus • A corpus of Chemistry research papers from 3 publishers: • The Royal Society of Chemistry (RSC), • The Nature Publishing Group (NPG), and • The International Union of Crystallography. • Provided in Publishers’ XML markup, but with distinct markup schemes.

Conversion to SciXML RSC papers PLOS Biology papers Nature papers SciXML IUCr papers Biology and CL (pdf)

SciXML Interface Requirements • Extensible • So we can add additional publications • Neutral • So as not to compromise any IP issues • Compatible with existing software • Expressive enough • For adequate rendering in applications

Rendering Issues • We assume application will display the paper • Probably in Hypertext • We must retain enough information to do this effectively • Previous versions of SciXML have focused on the logical structure of scientific papers.

The Development of SciXML • Developed for a medical corpus (2000) • Extracted from HTML web pages • Extended for a Computational Linguistics corpus • First from LaTeX • Then from PDF via OCR • Now defined as Relax NG Schema

Legacy Issues • The original SciXML schema had to interpret formatting. • Lacking any organisation by function • Dictating a flat paragraph structure • Collecting all floats and notes in end lists • But excluding text formatting

Adapted from Publishers’ Markup • List and Table formats • Inline text formatting • Functional paragraph types (e.g. Theorem) • Position markers for floats

Conversion by XSLT • Most constructs can be handled quite simply <xsl:template match="sec"> <DIV DEPTH="{@level}"> <xsl:apply-templates/> </DIV> </xsl:template> • Making the script virtually a stylesheet

Schema Development • Both the XSLT stylesheet and RNG Schema have been developed on a naïve basis. • Coding conversion for constructs that occur in the corpus • Eventually we have a big enough bag of tricks to make extension quite painless.

SciXML Constructs • Paper Identifiers • Unique identifiers, titles and authors • Sections • Divisions embed recursively with headers • Inline text markup • Font settings and LaTeX inclusion • Paragraph structure • Paragraph elements and sub paragraph boundaries in lists, abstracts, captions, etc.

SciXML Constructs • Citations and Cross References • Citations are significant, but we also need textual cross references, compound references, footnote markers, float markers. • Equations and examples • (Linguistic) examples and equation environments • Lists, tables and figures • Lists, including definitions lists, tables, figures, and various other sections for (external) data. • Bibliography • The bibliography section is important for citation tracking

<define name="PAPER.ELEMENT"> <element name="PAPER"> <ref name="METADATA.ELEMENT" /> <optional><ref name="PAGE.ELEMENT" /></optional> <ref name="TITLE.ELEMENT" /> <optional> <ref name="AUTHORLIST.ELEMENT" /> </optional> <optional> <ref name="ABSTRACT.ELEMENT" /> </optional> <element name="BODY"> <zeroOrMore> <ref name="DIV.ELEMENT" /> </zeroOrMore> </element> <optional> <element name="ACKNOWLEDGMENTS"> <zeroOrMore> <choice> <ref name="REF.ELEMENT" /> <ref name="INLINE.ELEMENT" /> </choice> </zeroOrMore> </element> </optional> <optional> <ref name="REFERENCELIST.ELEMENT"> </optional> <optional> <ref name="AUTHORNOTELIST.ELEMENT"> </optional> <optional> <ref name="FOOTNOTELIST.ELEMENT"> </optional> <optional> <ref name="FIGURELIST.ELEMENT"> </optional> <optional> <ref name="TABLELIST.ELEMENT"> </optional> </element> </define> <define name="REFERENCELIST.ELEMENT"> <element name="REFERENCELIS"> <zeroOrMore><ref name="REFERENCE.ELEMENT" /></zeroOrMore> </element> </define> RNG Schema (Fragment)

Language Technology in Sciborg • The goal is Information Extraction from Chemistry research papers. • various analysis components interfacing • Different levels of analysis • Different analysis methods • Specialised and General analysers • But a common semantic representation: RMRS (Robust Minimal Recursion Semantics) • And a common interface structure: SAF

Multiple Analysis Components • PET/ERG: “deep” analysis using detailed (HPSG) grammars and lexicons • RASP: Robust shallow parsing with a statically trained grammar • Each strand has a tokeniser, tagger and parser • OSCAR-3 analyses Chemistry terms and notation

Getting the Text out of SciXML • Only some spans of marked up text contain linguistic text. • Using SciXML we can divide element into: • Text (<P>), Markup (<IT>), Non-Text elements (<SUP>). • The analysers process, ignore and skip these, respectively. • We also use OSCAR-3 to detect data sections without significant text portions.

SciBorg Parsing Architecture OSCAR RASP parser Tokeniser for Rasp SAF Lattice SciXML Sentence splitter POS tagging PET parser Tokeniser for ERG

SAF Interface Requirements • Support results from different analysis components. • Allow the combination of complementary results • But they will assign conflicting structures • Ambiguity is common • Analyses will form a graph or lattice (c.f. chart parsing and word lattices)

Motivating Standoff • XML can only combine linguistic and formatting markup if they share the same tree structure • calculated for C11 H18 O3 • <IT>calculated for</IT> C<SB>11</SB>H<SB>18</SB>O<SB>3</SB> • <v>calculated</v> <pp>for <ne>C11H1803</ne></pp>

Standoff Annotation • A common solution is to separate the flow of text from the annotations representing its analysis • The connection is formed by indexing at some consistent common level • SAF supports character offset indexing and XPoint indexing

Character Offset Indexing Formatted text: Come here! raw text:"<p>Come <i>here</i>!</p>" Unicode character points: .<.p.>.C.o.m.e. .<.i.>.h.e.r.e .< ./ .i .> .! .< ./ .p .> . 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 Tokens <token from='3' to='7' value='Come'/> <token from='11' to='14' value='here'/> <token from='18' to='19' value='!'/>

XPoint Indexing Root (/) . ’P’(/1). . ’I’(/1/2). . text(/1/3). . text(/1/1). . text(/1/2/1). . C.o.m.e. . !. . h.e.r.e.

Index Conversion • We currently use both character offset and XPoint indexing. • The choice is influenced by the XML parser. • This implies maintaining a conversion table for a (SciXML) file. • /1/3/0 <-> 18

Standards for Standoff Annotation • MAF: ISO standard for morphological annotation • SMAF: an emergent standard extending this to sentence, e.g. for parser input • SAF: includes all annotations for a paper in one file

Types of SAF Annotation • Sentence segments • <annot type='sentence' id='s133' from='42065' source='v4987' target='v5154' to='43039' value='…calculated for C11H18O3….'/> • Tokens • <annot type='token' id='t5151' from='42988' to='43030' deps='s133' source='v5150' target='v5151' value='calculated'/> • <annot type='token' id='t5152' from='43031' to='43034' deps='s133' source='v5151' target='v5152' value='for'/> • <annot type='token' id='t5153' from='43035' to='43043' deps='s133' source='v5152' target='v5153' value='C11H18O3'/>

Types of SAF Annotation • Part of Speech (POS) Tags • <annot type='pos' id='p5151' deps='t5151' source='v5150' target='v5151' value='VVN'/> • <annot type='pos' id='p5152' deps='t5152' source='v5151' target='v5152' value='IF'/> • <annot type='pos' id='p5153' deps='t5153' source='v5152' target='v5153' value='NP1'/> • OSCAR (NER) mark up • <annot from="/1/5/6/27/51/2/83.1" to="/1/5/6/27/51/2/88/1.1" type="oscar" id="o554"><slot name="type">compound</slot><slot name="surface">C11H18O3</slot><slot name="provenance">formulaRegex</slot></annot>

Types of SAF Annotation • RMRS analyses: <rmrs cfrom='42329' cto='43303'> <label vid='420'/> … <ep cfrom='43258‘ cto='43288'><gpred>proper_q_rel</gpred><label vid='409'/><var sort='x' vid='410'/></ep> <ep cfrom='43258' cto='43288'><gpred>named_rel</gpred><label vid='411'/><var sort='x' vid='410'/></ep> … <rarg><rargname>RSTR</rargname><label vid='409'/><var sort='h' vid='412'/></rarg> <rarg><rargname>BODY</rargname><label vid='409'/><var sort='h' vid='413'/></rarg> <rarg><rargname>CARG</rargname><label vid='411'/><constant>c11h18o3</constant></rarg> … <hcons hreln='qeq'><hi><var sort='h' vid='412'/></hi><lo><label vid='411'/></lo></hcons> </rmrs>

SAF Flexibility • The standoff supports a variety of annotation types • Which communicate between different levels of analysis • And between different analysis paths • Hence it is also the main route for communication in the architecture

SciXML Flexibility • A common representation for the logical structure and essential formatting of research papers • Conversion from various publishers’ markup schemes • And, also, from HTML, LaTeX and PDF • Applied to several disciplines

Flexible Interfaces in the Application of Language Technology to an eScience Corpus

Flexible Interfaces in the Application of Language Technology to an eScience Corpus

Presentation Transcript

Language and the design of user interfaces

An Introduction to the Web as Corpus

The application of ICT in language learning

The Role of Corpus Linguistics in Language Decipherment

Natural Language Interfaces for Educational Technology

Utilizing Corpus Technology to Facilitate the Learning of English Collocations

An Introduction to The Census Bureau Language Mapper Application

Feed Corpus : An Ever Growing Up to Date Corpus

Application of Archestra Technology in an Edible Oil Refinery

The Balanced Tagged Corpus of Icelandic and Other Icelandic Language Technology Resources

Compiling an oral corpus of child language (G.S.C.C)

Foundations of Language Science and Technology - Corpus Linguistics - Silvia Hansen-Schirra

Session: Application Deployment on Grids/eScience

Foundations of Language Science and Technology - Corpus Linguistics - Silvia Hansen-Schirra

Processing the language of predicting and forecasting in an Italian corpus of economic reports

SPOKEN LANGUAGE CORPUS PROJECT

Natural Language Interfaces to Databases

Ontology-driven Provenance Management in eScience: An Application in Parasite Research

Natural Language Interfaces to Ontologies

WPTS Application to Application Interfaces Dec 4, 2003

Application Programming Interfaces

Application of Speech Technology to Language Learning in China