120 likes | 337 Views
Data in Linguistics. Linguistics -- by tradition -- is not an exact or empirical science. Modern linguists have attempted to transform linguistics step by step into a science. An exact science needs formalized models and provable methods for verifying (or more often falsifying) theories.
E N D
Data in Linguistics Linguistics -- by tradition -- is not an exact or empirical science. Modern linguists have attempted to transform linguistics step by step into a science. An exact science needs formalized models and provable methods for verifying (or more often falsifying) theories. Empirical science needs a methodology of how to obtain, process, evaluate data and how to exploit data for the verification (falsification) of theories. An exact empirical science needs to establish the correspondence between data and formal models. Therefore data need to be interpreted. Quantitative data require methods and tools for measurement. In linguistics, the quantitative branch of the discipline has been disconnected from the theoretical core of the field for many decades, since quantitative linguists could not measure phenomena that were in the focus of discussion. It was language technology that finally brought them together. Example: Astronomy Photographs and spectral analyses of distant heavenly bodies are scientific data. However without their interpretation in relationship with the formal models, they are rather useless.
Data in Linguistics A well developed and established concept of linguistic data is still missing. No good theory of relationship between different types of data, e.g., • example sentences, • online performance experiments, • corpora, • tree banks, • test suites However,there has been progress in several areas, e.g., • evaluating acceptability judgements - methodology for subjective rating tests. • annotation, interpretation of data • methods for using quantitative data in language technology
Types of Linguistic Data 1 Linguistically data are often classified into “real” and “unreal” data depending on their origin. However, this dichotomy does not fully cover the range of possible sources. • naturally occurring data, e.g., • (balanced) reference corpora • specialized corpora for specific subject domains or applications • incidentally diccovered linguistic examples • evoked or induced data , e.g., • dialogue-scenario data • wizard-of-Oz data • invented or solicited data, e.g., • sample sentences created by linguists • acceptability judgements solicited by linguists • test suites
Types of Linguistic Data 2 The dichotomy real and unreal does not necessarily coincide with the property of naturalness. Linguistic examples are often considered “unnatural”. On the other hand, a large corpus may contain many sentences that are extremely unnatural. Naturalness does not solely depend on the origin of the data.
Concept of Linguistic Data If we view linguistics as an empirical science, pieces of linguistic knowledge have to be abstractions over linguistic data. These abstractions are parts of our theories about contents and structure of linguistic competence and about the processes, constraints and preferences that govern linguistic performance. Linguistic data are individual utterances, parts of utterances or collections of utterances in a certain human language (or several languages). The utterances may be represented in written or spoken form (or signed), i.e., as textual or acoustic signals.
Annotated Data Usually these collected utterances are annotated by additional information. If the annotation does not contain a partial linguistic interpretation of the utterances, the annotations may be considerd part of the data. Annotations that do not include linguistic interpretation are, e.g.: • judgements of native speakers on the acceptability or appropriateness of the utterance, • information on speaker(s), • information on hearer(s) or intended audience, • information on the utterance situation (time, place, circumstances) • information on the published source, • typographic information, • layout and document structure, • textual transcriptions of spoken utterances, • transcription of pauses.
Interpreted Data Annotations involving a partial linguistic interpretation are, e,g.: • part-of-speech tags, • word sense information, • morphosyntactic features of words, • constituent structures for phrases or sentences, • coreference markers, • dependency structures, • predicate-argument structures, • reference identifications for term phrases, • information structures within sentences, • intonation contours, • speech acts, • discourse structures.
Parameters for Classification language: Spanish, English, German sublanguage/register: regional dialect, sociolect, vernacular, professional jargon, toddler speech text sort(s): newspaper articles, wire news, political speech, control commands subject domain: stock rates, flight reservations, type of producers: professional journalist, student, radiologist mode of production: spoken, written, signed, morsed medium of production: pencil, PC with MS Word, dictaphone conditions of production: spontaneous, carefully composed, produced under time pressure transmission encoding: raw ascii code, HTML, digitized phone signal, unicode medium of transmission: telephone, WWW, CB radio storage encoding: raw ASCII code, HTML, AIFF medium of storage: DAT tape, CD ROM, hard disk mode of presentation: spoken, written, signed medium of presentation: newspaper, radio, book, tv show, theater performance, type of intended recipients: newspaper reader, booking agent, theater audience number of intended recipients: point-to-point, multicast, broadcast synchronicity of discourse: synchronous dialogue, asynchronous direction: one-way, two-way
Criteria for Usefulness In order to be useful, data have to be representative. • representative of a certain linguistic phenomenon, • representative of a certain text sort, • representative of the expected input to some language technology application, • representative of the expected output of some language technology application, • representative of a certain speaker, • etc. Can data be representative of an entire language?
Forschungsaufgaben • Rohdaten sind heute einfach zu beschaffen. • Die anspruchsvolle Aufgabe liegt in der linguistischen Interpretation. • Forschungsaufgaben: • Entwurf der Annotationsschemata für die Beschreibungsebenen • Entwurf von Austauschformaten und Übersetzungswerkzeugen • Entwurf und Implementierung der Werkzeuge für die Korpusannotation • Teilautomatisierung der Annotation • Entwurf von Methoden und Werkzeugen für die Qualitätssicherung • Entwurf von Werkzeugen für die Nutzung der Daten in der Forschung(Abruf, Auswertung, zusätzliche Dokumentation) • Entwurf von Werkzeugen für die Nutzung der Daten für die Anwendungsentwicklung (Methoden und Werkzeuge für das „Training“)