630 likes | 786 Views
Semantic Knowledge about XML Document Structures. Henning Lobin Petra Saskia Bayerl Hagen Langer Harald Lüngen Georg Rehm Justus-Liebig-Universität Gießen. www.text-technology.de. Overview. Introduction Project approach Corpora Annotation levels Analysis Topic occurrences
E N D
Semantic Knowledge about XML Document Structures Henning Lobin Petra Saskia Bayerl Hagen Langer Harald Lüngen Georg Rehm Justus-Liebig-Universität Gießen www.text-technology.de
Overview • Introduction • Project approach • Corpora • Annotation levels • Analysis • Topic occurrences • Topics and structural positions • Rhetorical functions of structural elements • n-gram statistics • Automated classification experiment • Conclusions and Future Work
Introduction: Semantics of Generic Document Structures Generic Document Structure • Text type • Conventional written form • Structurally coherent • E.g., scientific articles in different disciplines Properties of content • Lexical properties: e.g., frequencies, collocations, ... • Syntactic properties: e.g., phrases • Semantic properties: e.g., thematic structure • Pragmatic properties: rhetorical structure, intentionality, situatedness
Subject of the study: The communicative knowledge related to a specific text type T which supports the interpretation of the meaning of text t T. Goal: A formal representation of that knowledge such that it can be exploited in text-technological applications. Methodology: Empirical, corpus-based investigation of a specific text type T (T = Scientific Article). Introduction: Project goal
Introduction: Aspects of investigation • Empirical aspects • Investigation of subgeneric variants of T • Identification of characteristic properties of parts of text t T • Formal aspects • Modelling of lexical, syntactic, semantic, and pragmatic text type properties • Representation of superordinate knowledge structures which include knowledge of the text type • Technological aspects • Illustration of the influence of text type knowledge on text processing in prototypical applications • Extension of the XML document grammar concept
Project approach: Corpora • Text type: Scientific Article • Disciplines: Psychology (highly standardised structure) Linguistics (less standardised structure)
Project approach: Annotation levels • Textual semantics of specific text types is reflected in • Layout • Typical sequences and hierarchy of text components • Thematic development • Rhetorical structure • Semantics is thus represented at three independent levels: • Structural: layout and surface structuring of text DocBook • Thematic: topics of text segments (aboutness) XML Schema Description • Rhetorical: relations between text segments in reader-author interaction XML Schema Description
Annotation levels: Thematic structure • The themes that text segments deal with, are represented in the form of hierarchically structured topics • This notion can be extended to higher-level units • The proposed thematic structure is based on approaches by Kando (1997) and Teufel (1999). Goals: • Extension and refinement of topic sets • Separation of functional and thematic categories • In total, 120 topics are assumed "A topic is some function determining about which item something is being said. [...] The topic of a sentence has the particular cognitive function of selecting a unit of information or concept from knowledge." (van Dijk, 1977)
Annotation levels: Thematic structure Full thematic schema: t120
Annotation levels: Thematic structure Full thematic schema: t120 (extract)
Annotation levels: Thematic structure Full thematic schema: t120 • Reduction by cutting off at a certain level of depth • 18 terminal categories left
Annotation levels: Thematic structure Reduced thematic Schema: t018 (used in classification experiment)
Annotation levels: Markup for thematic structure • Schema Description (thematic-hierarchical.xsd) representing: • Hierarchy of topics in a scientific article • Their canonical order • But: Order in an instance (a specific article) may differ! • Therefore an alternative, "flat" XML Schema is derived from the original one by means of an XSLT stylesheet (thematic-lgv.xsd) • Elements <group> (empty) and <segment> • The hierarchy is still encoded using ID/IDREF attributes
Annotation levels: Markup for thematic structure <content> <problem> <background> From the now infamous McDonald's coffee spill case to litigation against Ford and Firestone for injuries caused by tire tread separation to tobacco litigation, high stakes civil cases have become familiar staples of our media diet (see e.g., Are lawyers burning America, 1995; Budiansky, 1995; Church, 1986; Langley, 1986; Stossel, 1996).5 </background> </problem> ... </content> <groupid="g19" parent="g3" topic="problem"/> <segment id="s163" parent="g19"topic="background"> From the now infamous McDonald's coffee spill case to litigation against Ford and Firestone for injuries caused by tire tread separation to tobacco litigation, high stakes civil cases have become familiar staples of our media diet (see e.g., Are lawyers burning America, 1995; Budiansky, 1995; Church, 1986; Langley, 1986; Stossel, 1996).5 </segment> <segment id="s163b" parent="g19"topic="researchTopic"> ... </segment>
Evidence From the now infamous McDonald's coffee spill case to litigation against Ford and Firestone for injuries caused by tire tread separation to tobacco litigation, high stakes civil cases have become familiar staples of our media diet (see e.g., Are lawyers burning America, 1995; Budiansky, 1995; Church, 1986; Langley, 1986; Stossel, 1996).5 Annotation levels: Rhetorical structure Rhetorical Level: Rhetorical relations between the (simple and complex) propositions in a text (RST, Mann and Thompson)
Annotation levels: Rhetorical structure 41 relations for RST analyses of scientific articles • 31 of 33 in the ExtMT.rel provided by O‘Donnell (2000) • 10 additional relations • 8 from the set by Carlson/Marcu (2001) • 2 newly defined: Assign, Assigned Assigned ("broken promise") Since anticipated positive expectations for new periods are often disappointed by life
Annotation levels: Markup for rhetorical structure • Mann and Thompson‘s (1988) four constraints on RST Analyses: Completeness, Connectedness, Uniqueness, and Adjacency • RST trees are well-suited for an XML representation • hypo-para.xsd:The elements <hypo> and <para> explicitly mark mono- vs. multinuclear relations • Annotations are prepared using RSTTool by O‘Donnell (2000) and annotation guidelines for the two human annotators • A Python/XSLT program converts the flat XML output format into our hypo-para format • So far, selected sections of 15 English psychological articles have been annotated
Annotation levels: XML-based multi-layer annotations (A2) <sect1> <para> From the now infamous McDonald's coffee spill case to litigation against Ford and Firestone for injuries caused by tire tread separation to tobacco litigation, high stakes civil cases have become familiar staples of our media diet (see e.g., Are lawyers burning America, 1995; Budiansky, 1995; Church, 1986; Langley, 1986; Stossel, 1996). <footnoteref linkend="i5">5</footnoteref> </para> </sect1> Structural <segment id="s24" parent="g6" topic="background"> From the now infamous McDonald's coffee spill case to litigation against Ford and Firestone for injuries caused by tire tread separation to tobacco litigation, high stakes civil cases have become familiar staples of our media diet (see e.g., Are lawyers burning America, 1995; Budiansky, 1995; Church, 1986; Langley, 1986; Stossel, 1996).5 </segment> Thematic <hypo relname="evidence"> <nuk id="i25"> <t id="ti25">From the now infamous McDonald's coffee spill case to litigation against Ford and Firestone for injuries caused by tire tread separation to tobacco litigation, high stakes civil cases have become familiar staples of our media diet</t> </nuk> <sat id="i27"> <t id="ti27">(see e.g., Are lawyers burning America, 1995; Budiansky, 1995; Church, 1986; Langley, 1986; Stossel, 1996).5</t> </sat> </hypo> Rhetorical
Text type Selection Formalisation DocBook- Schema RST-Schema Text type corpus Text type Schema Analysis and Annotation Analysis and Annotation Analysis and Annotation annotated corpus (DocBook) annotated corpus (RST) annotated corpus (thm) Evaluation Formalisation and integration Text type parameters extended Text type schema Workflow Development
Text type Selection Formalisation DocBook- Schema RST-Schema Text type corpus Text type Schema Analysis and Annotation Analysis and Annotation Analysis and Annotation annotated corpus (DocBook) annotated corpus (RST) annotated corpus (thm) Evaluation Formalisation and integration Text type parameters extended Text type schema Workflow Development
Analysis: Research questions • Systematic differences between disciplines concerning • Occurrences of topics, typical structuring • Document classification • Typical relations between configurations on different levels, e.g., thematic rhetorical or thematic structural • Text parsing, Automatic annotation • Linguistic features of topics and rhetorical text segments (Text type parameters) • Automatic annotation, Information extraction • Representational framework for relations between levels
Analysis: Overview • Frequency of topics (linguistics vs. psychology) • Topics at structural positions (linguistics vs. psychology) (structural position e.g. /article[1]/sect1[2]) • Rhetorical functions of structural elements, e.g., captions • Linguistic features in thematic topics: unigrams and n-grams • Cluster-analysis: Automatic thematic labelling of text segments
Analysis: Topic occurrences Conclusion Different relative topic frequencies in psychology and linguistics
Multiple XML annotations art-01.doc art-01.thm ... ... Python ... ... Prolog representation art-01-doc-thm.pl Querying of inclusion relation between doc-elements and thm-segments: Prolog XML representation of inclusion instances ... ... art-01-inclusion.xml cat XML representation of all inclusion Instances found in the corpus inclusion_instances.xml XSLT Statistics over inclusion instances scores.xml Analysis: Topics at structural positions : Method
content Analysis: Topics at structural positions article Psychology, t018 abstract sect1[1] sect1[2] sect1[3] sect1[4] sect1[5] sect1[6] . sect1[7] para[1] .37 para[2] para[2] researchQuestion dataCollection Interpietation researchTopic_\ theory_frm results method_evd_\ conclusion measures framework_\ background_\ rationale ,material data concepts othersWork sample dataAnalysis background researchTopic framework method_evd problem answers evidence
content Analysis: Topics at structural positions article Psychology, t018 abstract .20 sect1[1] sect1[2] .23 sect1[3] .35 sect1[4] .27 sect1[5] .15 sect1[6] .26 sect1[7] .41 para[1] .37] para[2] .41 para[2] .31 researchQuestion dataCollection Interpietation researchTopic_\ theory_frm results method_evd_\ conclusion measures framework_\ background_\ rationale ,material data concepts othersWork sample dataAnalysis background researchTopic framework method_evd problem answers evidence
content Analysis: Topics at structural positions article Psychology, t018 abstract sect1[1] sect1[2] sect1[3] sect1[4] sect1[5] sect1[6] sect1[7] para[1] para[2] para[2] researchQuestion dataCollection Interpietation researchTopic_\ theory_frm results method_evd_\ conclusion measures framework_\ background_\ rationale ,material data concepts othersWork sample .49 dataAnalysis background researchTopic framework method_evd problem answers evidence
content Analysis: Topics at structural positions article Psychology, t018 abstract sect1[1] sect1[2] sect1[3] sect1[4] sect1[5] sect1[6] sect1[7] para[1] para[2] para[2] researchQuestion .20 dataCollection .37 Interpietation .48 researchTopic_\ .16 theory_frm .40 results .49 method_evd_\ .59 conclusion .54 measures .61 framework_\ .15 background_\ .15 rationale .11 ,material .35 data .50 dataAnalysis .36 concepts .19 othersWork .16 sample .49 background researchTopic framework method_evd problem answers evidence
content Analysis: Topics at structural positions article Linguistics, t018 abstract sect1[1] sect1[2] .18 sect1[3] .20 sect1[4] .18 sect1[5] .27 sect1[6] .25 sect1[7] .59 para[1] .24] para[2] .14 para[2] .17 researchQuestion .31 dataCollection .62 Interpretation .31 researchTopic_\ .33 theory_frm results .33 method_evd_\ .67 conclusion .38 measures .38 framework_\ .30 dataAnalysis .45 . background_\ .25 rationale .100 ,material .50 Data .54 concepts .33 sample .29 othersWork .26 background researchTopic framework method_evd problem answers evidence
content Analysis: Topics at structural positions article Linguistics, t018 abstract sect1[1] sect1[2] sect1[3] sect1[4] sect1[5] sect1[6] sect1[7] para[1] ] para[2] para[2] researchQuestion .31 dataCollection .62 Interpretation .31 researchTopic_\ .33 theory_frm results .33 method_evd_\ .67 conclusion .38 measures .38 framework_\ .30 dataAnalysis .45 . background_\ .25 rationale .100 ,material .50 Data .54 concepts .33 sample .29 othersWork .26 background researchTopic framework method_evd problem answers evidence
Analysis: Topics at structural positions: Conclusions • XML-structural position is a good text type parameter, e.g., to serve as a feature in automatic text segment classification • The choice of structural position parameters • must not be too general (sect1, para) • must not be too specific (/article[1]/sect1[1]/sect2[3]/sect3[1]/para[4]/log:table[2]) • Differences between Linguistics and Psychology: • Psychology: Subtopics of evidence abound in sect1[2] • Linguistics: Subtopics of evidence are distributed over sect1[2], sect1[3], and sect1[4]
Analysis: Rhetorical functions of structural elements Which rhetorical relation dominates <rst:sat>-elements whose primary data are identical to a <log:caption>-element? • Corpus: psy-engl-pte15 (15 articles) • 12 occurrences of <log:caption> • Analysis usingseit.pl • Not statistically significant
HYPO Interpretation FIGURE ? CAP- TION NUK SAT IMG Analysis: Rhetorical functions of structural elements • seit.pl currently does not support complex queries for correspondences between whole rhetorical schemata and structural configurations Not enough data: more macrostructure annotations are needed to check rhetorical functions of structural elements
Analysis: n-gram statistics Method: corpus query tool operating on thm-annotations stored in Tamino-DB
Analysis: n-gram statistics <topic count="1453" wordforms="50074" name="results" wordformsPercent="14.46"/> <wordformNgramsNclusters floor="2" verbose="['true']" n="4"> <ngrams> <ngram count="81">F|(|1|,</ngram> <ngram count="68">s|.|e|.</ngram> <ngram count="66">S|.|D|.</ngram> <ngram count="66">,|F|(|1</ngram> <ngram count="52">,|S|.|D</ngram> <ngram count="45">,|p|.|05</ngram> <ngram count="40">,|s|.|e</ngram> <ngram count="37">,|p|0|.</ngram> <ngram count="33">,|p|.|01</ngram> <ngram count="30">p|.|05|.</ngram> <ngram count="30">F|(|2|,</ngram> <ngram count="29">,|p|0|:</ngram> <ngram count="28">,|SD|1|.</ngram> <ngram count="26">)|,|F|(</ngram> <ngram count="24">children|in|the|deception</ngram> <ngram count="21">m|.|s|.</ngram> <ngram count="21">P|.|0001|)</ngram> <ngram count="21">.|s|.|e</ngram> <ngram count="21">,|m|.|s</ngram> <ngram count="20">in|the|guilty|condition</ngram> <ngram count="20">as|a|function|of</ngram> <ngram count="19">the|other|hand|,</ngram> <ngram count="19">D|.|0|.</ngram> <ngram count="19">.|D|.|0</ngram> <ngram count="19">(|p|0|.</ngram> <ngram count="18">significantly|more|likely|to</ngram> <ngram count="18">in|the|deception|condition</ngram> <ngram count="18">a|mean|z|score</ngram> <ngram count="18">D|.|1|.</ngram> <ngram count="18">.|In|addition|,</ngram> <ngram count="18">.|D|.|1</ngram> </ngrams> </wordformNgramsNclusters> subcorpus: psychology topic: results stopwords included P/p, M, t, F, SD, SE: statistical coefficients
Classification experiment Automated thematic labelling of segments • Methods: • K-nearest-neighbour (KNN) classifier • Topic-bigram model • conditional probability, simple smoothing • Feature extraction: word forms • No stop word list • No lemmatisation or other morphology • Vector space representation • Simple frequency-based probability distribution p(f |T) • No TF*IDF or comparable weighting • Similarity metric: Jenson-Shannon Divergence (iRad)
Classification experiment • Corpus: • 31 German linguistics articles, about 4,000 segments • Topics: • The t018 set plus rest classes (void_meta, textual), and some noise (t, references, aspect_rsl) • Data: • 29,100 word form types as features in the vector space model • 181,416 word form tokens • Leave-one-out-split training: • For each test document, the classifier and the bigram model were trained using all text segments from remaining documents
Classification experiment: Villain-victim analysis Villain: category falsely assigned Victim: correct category of a false classification
Classification experiment: Conclusions • The classification quality is better if the classifier is accompanied by a topic bigram model (accuracy 41% 46%) • Accuracy of 46 % is not sufficient for many practical applications enlarge the database to obtain more training examples for certain topics that are currently underrepresented • The topic othersWork is a major villain, presumably because it may subsume any topic as long as it is attributed to a different author include parameters that better identify the argumentative status of a segment (cf. Teufel 1999) • Citation features • Verb action type • Discourse markers/cue phrases signalling rhetorical status
Text type parameters: Conclusions Parameters extracted: • Structural position • Wordform n-grams • Topic n-grams (not presented) • Punctuation: '?', '!' (not presented) • Segment length (not presented) ToDo: Additional promising parameters (cf. Teufel 1999) • Morphology parameters: BASE FORM, TENSE, PERSON/NUMBER • Syntax parameters: VOICE, POS:AUX • Headline of current section: similarity with prototypical headlines • Others: keywords, formulaic expressions, further and alternative positional features
Future work • Automated classification • Experiments on the (larger) English psychology corpus • Inclusion of further text type parameters in the feature space • Data: • Extension of the data base, i.e., corpus annotations • Analyses: • Investigating the separate impact of discipline and language on text type characteristics • Differentiation between essential and additional features of text type Scientific Article • Synopsis/Formalisation: • Development of a framework in order to represent text type knowledge
content Analysis: Topics at structural positions article Psychology, t018 abstract .20 sect1[1] sect1[2] .23 sect1[3] .35 sect1[4] .27 sect1[5] .15 sect1[6] .26 sect1[7] .41 para[1] .37] para[2] .41 para[2] .31 researchQuestion .20 dataCollection .37 Interpietation .48 researchTopic_\ .16 theory_frm .40 results .49 method_evd_\ .59 conclusion .54 measures .61 framework_\ .15 background_\ .15 rationale .11 ,material .35 data .50 dataAnalysis .36 concepts .19 othersWork .16 sample .49 background researchTopic framework method_evd problem answers evidence
Representation of text type semantics in semantic markup Summary • thematic-hierarchical.xsd thematic-lgv.xsd • hypo-para.xsd • fgtt-docbook.xsd
Semantic markup for rhetorical structure • There are notable examples of rhetorical relations between non-adjacent text spans in our corpus • E.g., the content of a footnote is a CONCESSION in relation to a discourse unit somewhere else in the text • Solution in the hypo-para format • An additional element <extra> can appear anywhere within <hypo>, <para>, or<terminal> • Its ID is referred to in an empty <nuk> or <sat> element somewhere else • Annotation of such “long-distance dependencies“ is also permitted in RSTTool • But conversion into hypo-para format not implemented yet