1 / 17

Dan McIntyre, John Heywood, Tony McEnery, Elena Semino and Mick Short

Building a corpus to investigate the presentation of speech, thought and writing in Spoken British English. Dan McIntyre, John Heywood, Tony McEnery, Elena Semino and Mick Short Department of Linguistics and Modern English Language Lancaster University, UK. Aims of the project.

emma
Download Presentation

Dan McIntyre, John Heywood, Tony McEnery, Elena Semino and Mick Short

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Building a corpus to investigatethe presentation of speech,thought and writing in SpokenBritish English Dan McIntyre, John Heywood, Tony McEnery, Elena Semino and Mick Short Department of Linguistics and Modern English Language Lancaster University, UK PALC 2003

  2. Aims of the project • To investigate the forms and functions of speech, thought and writing presentation in spoken data. • To compare the presentation of ST&WP in a corpus of spoken data with the findings from an equivalent corpus of written texts. • To further test the model of speech and thought presentation outlined in Leech and Short (1981).

  3. What is speech, thought and writing presentation? • Prototypically, the presentation in a posterior discourse of what was said, thought or written in a (supposed) anterior discourse. Speaker’s words • Direct speech • [DS] ‘Shut up, you silly old fool,’ [RS] she said. • Indirect speech • [RS] She told him [IS] that he should shut up. • Representation of a speech act • [RSA] She commanded him. Reporter’s words

  4. Selecting the corpus data • 120 transcripts - approximately 260,000 words. • Texts taken from the British National Corpus (BNC) and Centre for North West Regional Studies (CNWRS) oral history archives at Lancaster University. • CNWRS interview tapes digitised to be time-aligned with text by Softsound Ltd, Cambridge, UK. • BNC sound files identified where possible.

  5. The ST&WP categories Main categories

  6. ST&WP category features Category features

  7. ST&WP category features Category features

  8. Annotating the corpus for ST&WP • We use the element <sptag> and mark the ST&WP category within the attribute cat. • Tags designed for concordancing using Wordsmith Tools. • 15 fields to mark ST&WP categories. • x used as a placeholder for empty positions. elementattribute attribute value <sptag cat = “xxxxxxxxxxxxxxx”> fields 1 - 15

  9. Annotating the corpus for ST&WP • We use the element <sptag> and mark the ST&WP category within the attribute cat. • Tags designed for concordancing using Wordsmith Tools. • 15 fields to mark ST&WP categories. • x used as a placeholder for empty positions. element attribute attribute value <sptag cat = “FIW”> fields 1 - 15

  10. Annotating the corpus for ST&WP • We use the element <sptag> and mark the ST&WP category within the attribute cat. • Tags designed for concordancing using Wordsmith Tools. • 15 fields to mark ST&WP categories. • x used as a placeholder for empty positions. element attribute attribute value <sptag cat = “xDSxxxh”> fields 1 - 15

  11. Annotating the corpus for ST&WP • We use the element <sptag> and mark the ST&WP category within the attribute cat. • Tags designed for concordancing using Wordsmith Tools. • 15 fields to mark ST&WP categories. • x used as a placeholder for empty positions. element attribute attribute value <sptag cat = “xRSAxxghxxxxp”> fields 1 - 15

  12. Annotating the corpus for ST&WP • We use the element <sptag> and mark the ST&WP category within the attribute cat. • Tags designed for concordancing using Wordsmith Tools. • 15 fields to mark ST&WP categories. • x used as a placeholder for empty positions. <sptag cat=“xDS”> = <sptag one=“x” two=“D” three=“S”>

  13. A sample extract from a marked-up file <sptag cat="A">Then they went to Hereford and there were Quakers there and </sptag><sptag cat="xRIxxxxxxi">he had a hard time of it</sptag><sptag cat="xRIxxxxxxi">they didn't like Catholics</sptag><sptag cat="A">and I can remember <note desc="S implied">they sent me</note> I was a manageress in the laundry here and <note desc="S implied">they sent me to Kendal</note> when we opened a laundry at Kendal and I was staying at a lodging in Kendal and the man was th they were Quakers and </sptag><sptag cat="xRSxx2">I said to the young lady, I said</sptag><sptag cat="xDS"> Would you mind if you made my dinner on Friday it doesn't matter if it's only bread and butter, but no meat, because we don't eat meat on a Friday or no bacon just bread anything plain it doesn't matter what it is but no meat</sptag><sptag cat="xRS">and the old man says</sptag><sptag cat="xDS">I'm sorry for thee</sptag><sptag cat="xRT">and I thought </sptag><sptag cat="xDT">oh</sptag><sptag cat="A">but he was a Quaker. Anyway</sptag><sptag cat="xRS">she says</sptag><sptag cat="xDS">shut up you silly old fool</sptag>

  14. Preliminary results: comparative numbers and percentages of speech tags in the Spoken and Written Corpora, in relation to total number of discourse presentation tags Spoken Corpus Total tags = 34,927 A = 21,467 RU = 255 RS = 2,774 Ambiguities = 1,149 ST&WP tags = 8,783 Written Corpus Total tags = 16,533 N = 3,601 Ambiguities = 885 ST&WP tags = 8,588

  15. Preliminary results: comparative numbers and percentages of thought tags in the Spoken and Written Corpora, in relation to total number of discourse presentation tags Spoken Corpus Total tags = 34,927 A = 21,467 RU = 255 RT = 1,109 Ambiguities = 488 ST&WP tags = 8,783 Written Corpus Total tags = 16,533 N = 3,601 Ambiguities = 885 ST&WP tags = 8,588

  16. Preliminary results: comparative numbers and percentages of writing tags in the Spoken and Written Corpora, in relation to total number of discourse presentation tags Spoken Corpus Total tags = 34,927 A = 21,467 RU = 255 RW = 145 Ambiguities = 295 ST&WP tags = 8,783 Written Corpus Total tags = 16,533 N = 3,601 Ambiguities = 885 ST&WP tags = 8,588

  17. Where next? • Further refinement of ST&WP annotation. • ST&WP and prosodic discontinuities (e.g. voice quality.) • Combination of quantitative and qualitative analyses. • Comparison of findings from the two corpora.

More Related