1 / 63

Semantic Knowledge about XML Document Structures

Semantic Knowledge about XML Document Structures. Henning Lobin Petra Saskia Bayerl Hagen Langer Harald Lüngen Georg Rehm Justus-Liebig-Universität Gießen. www.text-technology.de. Overview. Introduction Project approach Corpora Annotation levels Analysis Topic occurrences

benito
Download Presentation

Semantic Knowledge about XML Document Structures

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Semantic Knowledge about XML Document Structures Henning Lobin Petra Saskia Bayerl Hagen Langer Harald Lüngen Georg Rehm Justus-Liebig-Universität Gießen www.text-technology.de

  2. Overview • Introduction • Project approach • Corpora • Annotation levels • Analysis • Topic occurrences • Topics and structural positions • Rhetorical functions of structural elements • n-gram statistics • Automated classification experiment • Conclusions and Future Work

  3. Introduction: Semantics of Generic Document Structures Generic Document Structure • Text type • Conventional written form • Structurally coherent • E.g., scientific articles in different disciplines Properties of content • Lexical properties: e.g., frequencies, collocations, ... • Syntactic properties: e.g., phrases • Semantic properties: e.g., thematic structure • Pragmatic properties: rhetorical structure, intentionality, situatedness

  4. Subject of the study: The communicative knowledge related to a specific text type T which supports the interpretation of the meaning of text t  T. Goal: A formal representation of that knowledge such that it can be exploited in text-technological applications. Methodology: Empirical, corpus-based investigation of a specific text type T (T = Scientific Article). Introduction: Project goal

  5. Introduction: Aspects of investigation • Empirical aspects • Investigation of subgeneric variants of T • Identification of characteristic properties of parts of text t  T • Formal aspects • Modelling of lexical, syntactic, semantic, and pragmatic text type properties • Representation of superordinate knowledge structures which include knowledge of the text type • Technological aspects • Illustration of the influence of text type knowledge on text processing in prototypical applications • Extension of the XML document grammar concept

  6. Project approach: Corpora • Text type: Scientific Article • Disciplines: Psychology (highly standardised structure) Linguistics (less standardised structure)

  7. Project approach: Annotation levels • Textual semantics of specific text types is reflected in • Layout • Typical sequences and hierarchy of text components • Thematic development • Rhetorical structure • Semantics is thus represented at three independent levels: • Structural: layout and surface structuring of text  DocBook • Thematic: topics of text segments (aboutness) XML Schema Description • Rhetorical: relations between text segments in reader-author interaction XML Schema Description

  8. Annotation levels: Thematic structure • The themes that text segments deal with, are represented in the form of hierarchically structured topics • This notion can be extended to higher-level units • The proposed thematic structure is based on approaches by Kando (1997) and Teufel (1999). Goals: • Extension and refinement of topic sets • Separation of functional and thematic categories • In total, 120 topics are assumed "A topic is some function determining about which item something is being said. [...] The topic of a sentence has the particular cognitive function of selecting a unit of information or concept from knowledge." (van Dijk, 1977)

  9. Annotation levels: Thematic structure Full thematic schema: t120

  10. Annotation levels: Thematic structure Full thematic schema: t120 (extract)

  11. Annotation levels: Thematic structure Full thematic schema: t120 • Reduction by cutting off at a certain level of depth • 18 terminal categories left

  12. Annotation levels: Thematic structure Reduced thematic Schema: t018 (used in classification experiment)

  13. Annotation levels: Markup for thematic structure • Schema Description (thematic-hierarchical.xsd) representing: • Hierarchy of topics in a scientific article • Their canonical order • But: Order in an instance (a specific article) may differ! • Therefore an alternative, "flat" XML Schema is derived from the original one by means of an XSLT stylesheet (thematic-lgv.xsd) • Elements <group> (empty) and <segment> • The hierarchy is still encoded using ID/IDREF attributes

  14. Annotation levels: Markup for thematic structure <content> <problem> <background> From the now infamous McDonald's coffee spill case to litigation against Ford and Firestone for injuries caused by tire tread separation to tobacco litigation, high stakes civil cases have become familiar staples of our media diet (see e.g., Are lawyers burning America, 1995; Budiansky, 1995; Church, 1986; Langley, 1986; Stossel, 1996).5 </background> </problem> ... </content> <groupid="g19" parent="g3" topic="problem"/> <segment id="s163" parent="g19"topic="background"> From the now infamous McDonald's coffee spill case to litigation against Ford and Firestone for injuries caused by tire tread separation to tobacco litigation, high stakes civil cases have become familiar staples of our media diet (see e.g., Are lawyers burning America, 1995; Budiansky, 1995; Church, 1986; Langley, 1986; Stossel, 1996).5 </segment> <segment id="s163b" parent="g19"topic="researchTopic"> ... </segment>

  15. Evidence From the now infamous McDonald&apos;s coffee spill case to litigation against Ford and Firestone for injuries caused by tire tread separation to tobacco litigation, high stakes civil cases have become familiar staples of our media diet (see e.g., Are lawyers burning America, 1995; Budiansky, 1995; Church, 1986; Langley, 1986; Stossel, 1996).5 Annotation levels: Rhetorical structure Rhetorical Level: Rhetorical relations between the (simple and complex) propositions in a text (RST, Mann and Thompson)

  16. Annotation levels: Rhetorical structure 41 relations for RST analyses of scientific articles • 31 of 33 in the ExtMT.rel provided by O‘Donnell (2000) • 10 additional relations • 8 from the set by Carlson/Marcu (2001) • 2 newly defined: Assign, Assigned Assigned ("broken promise") Since anticipated positive expectations for new periods are often disappointed by life

  17. Annotation levels: Markup for rhetorical structure • Mann and Thompson‘s (1988) four constraints on RST Analyses: Completeness, Connectedness, Uniqueness, and Adjacency • RST trees are well-suited for an XML representation • hypo-para.xsd:The elements <hypo> and <para> explicitly mark mono- vs. multinuclear relations • Annotations are prepared using RSTTool by O‘Donnell (2000) and annotation guidelines for the two human annotators • A Python/XSLT program converts the flat XML output format into our hypo-para format • So far, selected sections of 15 English psychological articles have been annotated

  18. Annotation levels: XML-based multi-layer annotations (A2) <sect1> <para> From the now infamous McDonald's coffee spill case to litigation against Ford and Firestone for injuries caused by tire tread separation to tobacco litigation, high stakes civil cases have become familiar staples of our media diet (see e.g., Are lawyers burning America, 1995; Budiansky, 1995; Church, 1986; Langley, 1986; Stossel, 1996). <footnoteref linkend="i5">5</footnoteref> </para> </sect1> Structural <segment id="s24" parent="g6" topic="background"> From the now infamous McDonald's coffee spill case to litigation against Ford and Firestone for injuries caused by tire tread separation to tobacco litigation, high stakes civil cases have become familiar staples of our media diet (see e.g., Are lawyers burning America, 1995; Budiansky, 1995; Church, 1986; Langley, 1986; Stossel, 1996).5 </segment> Thematic <hypo relname="evidence"> <nuk id="i25"> <t id="ti25">From the now infamous McDonald's coffee spill case to litigation against Ford and Firestone for injuries caused by tire tread separation to tobacco litigation, high stakes civil cases have become familiar staples of our media diet</t> </nuk> <sat id="i27"> <t id="ti27">(see e.g., Are lawyers burning America, 1995; Budiansky, 1995; Church, 1986; Langley, 1986; Stossel, 1996).5</t> </sat> </hypo> Rhetorical

  19. Text type Selection Formalisation DocBook- Schema RST-Schema Text type corpus Text type Schema Analysis and Annotation Analysis and Annotation Analysis and Annotation annotated corpus (DocBook) annotated corpus (RST) annotated corpus (thm) Evaluation Formalisation and integration Text type parameters extended Text type schema Workflow Development

  20. Text type Selection Formalisation DocBook- Schema RST-Schema Text type corpus Text type Schema Analysis and Annotation Analysis and Annotation Analysis and Annotation annotated corpus (DocBook) annotated corpus (RST) annotated corpus (thm) Evaluation Formalisation and integration Text type parameters extended Text type schema Workflow Development

  21. Analysis: Research questions • Systematic differences between disciplines concerning • Occurrences of topics, typical structuring • Document classification • Typical relations between configurations on different levels, e.g., thematic rhetorical or thematic structural • Text parsing, Automatic annotation • Linguistic features of topics and rhetorical text segments (Text type parameters) • Automatic annotation, Information extraction • Representational framework for relations between levels

  22. Analysis: Overview • Frequency of topics (linguistics vs. psychology) • Topics at structural positions (linguistics vs. psychology) (structural position e.g. /article[1]/sect1[2]) • Rhetorical functions of structural elements, e.g., captions • Linguistic features in thematic topics: unigrams and n-grams • Cluster-analysis: Automatic thematic labelling of text segments

  23. Analysis: Topic occurrences

  24. Analysis: Topic occurrences

  25. Analysis: Topic occurrences Conclusion Different relative topic frequencies in psychology and linguistics

  26. Multiple XML annotations art-01.doc art-01.thm ... ... Python ... ... Prolog representation art-01-doc-thm.pl Querying of inclusion relation between doc-elements and thm-segments: Prolog XML representation of inclusion instances ... ... art-01-inclusion.xml cat XML representation of all inclusion Instances found in the corpus inclusion_instances.xml XSLT Statistics over inclusion instances scores.xml Analysis: Topics at structural positions : Method

  27. content Analysis: Topics at structural positions article Psychology, t018 abstract sect1[1] sect1[2] sect1[3] sect1[4] sect1[5] sect1[6] . sect1[7] para[1] .37 para[2] para[2] researchQuestion dataCollection Interpietation researchTopic_\ theory_frm results method_evd_\ conclusion measures framework_\ background_\ rationale ,material data concepts othersWork sample dataAnalysis background researchTopic framework method_evd problem answers evidence

  28. content Analysis: Topics at structural positions article Psychology, t018 abstract .20 sect1[1] sect1[2] .23 sect1[3] .35 sect1[4] .27 sect1[5] .15 sect1[6] .26 sect1[7] .41 para[1] .37] para[2] .41 para[2] .31 researchQuestion dataCollection Interpietation researchTopic_\ theory_frm results method_evd_\ conclusion measures framework_\ background_\ rationale ,material data concepts othersWork sample dataAnalysis background researchTopic framework method_evd problem answers evidence

  29. content Analysis: Topics at structural positions article Psychology, t018 abstract sect1[1] sect1[2] sect1[3] sect1[4] sect1[5] sect1[6] sect1[7] para[1] para[2] para[2] researchQuestion dataCollection Interpietation researchTopic_\ theory_frm results method_evd_\ conclusion measures framework_\ background_\ rationale ,material data concepts othersWork sample .49 dataAnalysis background researchTopic framework method_evd problem answers evidence

  30. content Analysis: Topics at structural positions article Psychology, t018 abstract sect1[1] sect1[2] sect1[3] sect1[4] sect1[5] sect1[6] sect1[7] para[1] para[2] para[2] researchQuestion .20 dataCollection .37 Interpietation .48 researchTopic_\ .16 theory_frm .40 results .49 method_evd_\ .59 conclusion .54 measures .61 framework_\ .15 background_\ .15 rationale .11 ,material .35 data .50 dataAnalysis .36 concepts .19 othersWork .16 sample .49 background researchTopic framework method_evd problem answers evidence

  31. content Analysis: Topics at structural positions article Linguistics, t018 abstract sect1[1] sect1[2] .18 sect1[3] .20 sect1[4] .18 sect1[5] .27 sect1[6] .25 sect1[7] .59 para[1] .24] para[2] .14 para[2] .17 researchQuestion .31 dataCollection .62 Interpretation .31 researchTopic_\ .33 theory_frm results .33 method_evd_\ .67 conclusion .38 measures .38 framework_\ .30 dataAnalysis .45 . background_\ .25 rationale .100 ,material .50 Data .54 concepts .33 sample .29 othersWork .26 background researchTopic framework method_evd problem answers evidence

  32. content Analysis: Topics at structural positions article Linguistics, t018 abstract sect1[1] sect1[2] sect1[3] sect1[4] sect1[5] sect1[6] sect1[7] para[1] ] para[2] para[2] researchQuestion .31 dataCollection .62 Interpretation .31 researchTopic_\ .33 theory_frm results .33 method_evd_\ .67 conclusion .38 measures .38 framework_\ .30 dataAnalysis .45 . background_\ .25 rationale .100 ,material .50 Data .54 concepts .33 sample .29 othersWork .26 background researchTopic framework method_evd problem answers evidence

  33. Analysis: Topics at structural positions: Conclusions • XML-structural position is a good text type parameter, e.g., to serve as a feature in automatic text segment classification • The choice of structural position parameters • must not be too general (sect1, para) • must not be too specific (/article[1]/sect1[1]/sect2[3]/sect3[1]/para[4]/log:table[2]) • Differences between Linguistics and Psychology: • Psychology: Subtopics of evidence abound in sect1[2] • Linguistics: Subtopics of evidence are distributed over sect1[2], sect1[3], and sect1[4]

  34. Analysis: Rhetorical functions of structural elements Which rhetorical relation dominates <rst:sat>-elements whose primary data are identical to a <log:caption>-element? • Corpus: psy-engl-pte15 (15 articles) • 12 occurrences of <log:caption> • Analysis usingseit.pl • Not statistically significant

  35. HYPO Interpretation FIGURE  ? CAP- TION NUK SAT IMG Analysis: Rhetorical functions of structural elements • seit.pl currently does not support complex queries for correspondences between whole rhetorical schemata and structural configurations Not enough data: more macrostructure annotations are needed to check rhetorical functions of structural elements

  36. Analysis: n-gram statistics Method: corpus query tool operating on thm-annotations stored in Tamino-DB

  37. Analysis: n-gram statistics <topic count="1453" wordforms="50074" name="results" wordformsPercent="14.46"/> <wordformNgramsNclusters floor="2" verbose="['true']" n="4"> <ngrams> <ngram count="81">F|(|1|,</ngram> <ngram count="68">s|.|e|.</ngram> <ngram count="66">S|.|D|.</ngram> <ngram count="66">,|F|(|1</ngram> <ngram count="52">,|S|.|D</ngram> <ngram count="45">,|p|.|05</ngram> <ngram count="40">,|s|.|e</ngram> <ngram count="37">,|p|0|.</ngram> <ngram count="33">,|p|.|01</ngram> <ngram count="30">p|.|05|.</ngram> <ngram count="30">F|(|2|,</ngram> <ngram count="29">,|p|0|:</ngram> <ngram count="28">,|SD|1|.</ngram> <ngram count="26">)|,|F|(</ngram> <ngram count="24">children|in|the|deception</ngram> <ngram count="21">m|.|s|.</ngram> <ngram count="21">P|.|0001|)</ngram> <ngram count="21">.|s|.|e</ngram> <ngram count="21">,|m|.|s</ngram> <ngram count="20">in|the|guilty|condition</ngram> <ngram count="20">as|a|function|of</ngram> <ngram count="19">the|other|hand|,</ngram> <ngram count="19">D|.|0|.</ngram> <ngram count="19">.|D|.|0</ngram> <ngram count="19">(|p|0|.</ngram> <ngram count="18">significantly|more|likely|to</ngram> <ngram count="18">in|the|deception|condition</ngram> <ngram count="18">a|mean|z|score</ngram> <ngram count="18">D|.|1|.</ngram> <ngram count="18">.|In|addition|,</ngram> <ngram count="18">.|D|.|1</ngram> </ngrams> </wordformNgramsNclusters> subcorpus: psychology topic: results stopwords included P/p, M, t, F, SD, SE: statistical coefficients

  38. Classification experiment Automated thematic labelling of segments • Methods: • K-nearest-neighbour (KNN) classifier • Topic-bigram model • conditional probability, simple smoothing • Feature extraction: word forms • No stop word list • No lemmatisation or other morphology • Vector space representation • Simple frequency-based probability distribution p(f |T) • No TF*IDF or comparable weighting • Similarity metric: Jenson-Shannon Divergence (iRad)

  39. Classification experiment • Corpus: • 31 German linguistics articles, about 4,000 segments • Topics: • The t018 set plus rest classes (void_meta, textual), and some noise (t, references, aspect_rsl) • Data: • 29,100 word form types as features in the vector space model • 181,416 word form tokens • Leave-one-out-split training: • For each test document, the classifier and the bigram model were trained using all text segments from remaining documents

  40. Classification experiment: Results

  41. Classification experiment: Villain-victim analysis Villain: category falsely assigned Victim: correct category of a false classification

  42. Classification experiment: Conclusions • The classification quality is better if the classifier is accompanied by a topic bigram model (accuracy 41%  46%) • Accuracy of 46 % is not sufficient for many practical applications  enlarge the database to obtain more training examples for certain topics that are currently underrepresented • The topic othersWork is a major villain, presumably because it may subsume any topic as long as it is attributed to a different author  include parameters that better identify the argumentative status of a segment (cf. Teufel 1999) • Citation features • Verb action type • Discourse markers/cue phrases signalling rhetorical status

  43. Text type parameters: Conclusions Parameters extracted: • Structural position • Wordform n-grams • Topic n-grams (not presented) • Punctuation: '?', '!' (not presented) • Segment length (not presented) ToDo: Additional promising parameters (cf. Teufel 1999) • Morphology parameters: BASE FORM, TENSE, PERSON/NUMBER • Syntax parameters: VOICE, POS:AUX • Headline of current section: similarity with prototypical headlines • Others: keywords, formulaic expressions, further and alternative positional features

  44. Future work • Automated classification • Experiments on the (larger) English psychology corpus • Inclusion of further text type parameters in the feature space • Data: • Extension of the data base, i.e., corpus annotations • Analyses: • Investigating the separate impact of discipline and language on text type characteristics • Differentiation between essential and additional features of text type Scientific Article • Synopsis/Formalisation: • Development of a framework in order to represent text type knowledge

  45. Thank you!

  46. Additional material

  47. Topics in Psychology and Linguistics

  48. content Analysis: Topics at structural positions article Psychology, t018 abstract .20 sect1[1] sect1[2] .23 sect1[3] .35 sect1[4] .27 sect1[5] .15 sect1[6] .26 sect1[7] .41 para[1] .37] para[2] .41 para[2] .31 researchQuestion .20 dataCollection .37 Interpietation .48 researchTopic_\ .16 theory_frm .40 results .49 method_evd_\ .59 conclusion .54 measures .61 framework_\ .15 background_\ .15 rationale .11 ,material .35 data .50 dataAnalysis .36 concepts .19 othersWork .16 sample .49 background researchTopic framework method_evd problem answers evidence

  49. Representation of text type semantics in semantic markup Summary • thematic-hierarchical.xsd  thematic-lgv.xsd • hypo-para.xsd • fgtt-docbook.xsd

  50. Semantic markup for rhetorical structure • There are notable examples of rhetorical relations between non-adjacent text spans in our corpus • E.g., the content of a footnote is a CONCESSION in relation to a discourse unit somewhere else in the text • Solution in the hypo-para format • An additional element <extra> can appear anywhere within <hypo>, <para>, or<terminal> • Its ID is referred to in an empty <nuk> or <sat> element somewhere else • Annotation of such “long-distance dependencies“ is also permitted in RSTTool • But conversion into hypo-para format not implemented yet

More Related