520 likes | 747 Views
From CHILDES to TalkBank. An International Database of Communicative Interaction. TalkBank. Brian MacWhinney Carnegie Mellon University, Psychology Child Language Data Exchange System CHILDES Steven Bird, Mark Liberman University of Pennsylvania, Linguistics Linguistic Data Consortium, LDC
E N D
From CHILDES to TalkBank An International Database of Communicative Interaction
TalkBank • Brian MacWhinney • Carnegie Mellon University, Psychology • Child Language Data Exchange System CHILDES • Steven Bird, Mark Liberman • University of Pennsylvania, Linguistics • Linguistic Data Consortium, LDC • Howard Wactlar • Carnegie Mellon University, Computer Science • Informedia Project
Basic Premise of TalkBank • Human Communication is a unified fact, • but it is studied by 8 disciplines and up to 40 subdisciplines. • Analysis is important, but so is synthesis. • We can put the puzzle back together by focusing all the disciplines on the data.
Some Examples • “My Theory” • Bettino Craxi • Nixon’s Watergate Tapes • MacWhinney’s Lectures • Ross and Mark • Graphics lesson • Bilingual Classroom
My Theory: An Example Special Issue of Discourse Processes edited by Tim Koschmann with articles from • Rogers Hall • Jay Lemke • Annemarie Palincsar • Carl Frederiksen • Commentary by • Judith Green & Marleen McClelland • Jeremy Roschelle
TalkBank Areas • Classroom Discourse - CMU Dec 99 • Conversation Analysis - Odense Oct • Text and Discourse - Santa Barbara July • Child Language Disorders - Madison 2002 • Language and Gesture - CMU October • Child Language Learning - Madison Aug 2002 • Animal Communication - Penn May 2000
More areas …. • Field Linguistics - LSA Dec 99, Penn Dec 2000 • Aphasia • Corpus Linguistics • Signed Language • Second Language Learning • Anthropological Linguistics • Cross-cultural studies
More areas ... • Multilingualism, code-switching - LIDES • Mother-infant interaction • Psychiatry • Conflict Resolution • Management Styles • Small-group Interaction - soon • Human-computer Interaction
More areas ... • Speech Technology - ongoing • Virtual Reality • Guided Robots, Social Robots
Why data-sharing is important • Increasing the size and reliability of the empirical basis • Opening science to the community, practitioners, and students • Opening science to collaborative commentary • Creating transparency across disciplines
Key Features of TalkBank • Multimodal digitized data • Internet access • Defense of confidentiality • Codon: transcription, coding, viewing, and analysis • XML standard for underlying representation • Alliance of databases from many fields
Why TalkBank can be built now • The Internet • Fast computers. big disks, cheap storage • Good audio and video digitization • Advances in web-based database design • Emergence of annotation standards • Maturation of the social sciences
CHILDES: APrototype • Brian MacWhinney - CMU • Leonid Spektor - CMU • Catherine Snow - Harvard • 2000 Members • 400 Active contributors
1850-1950 Darwin and Diaries • Darwin, Stern, Ament • Emotion, gesture, language, the soul • Card files and shoe boxes
1950-1984 Tapes • Nagras and TEAC, VHS and Beta • Dittos, mimeo, notes in the margins • Good “raw” data, unclear transcription
Universals • Are there basic patterns to babbling? • Are early word orders universal? • Does UG give children a universal set of functional categories? • Is the vocabulary spurt universal? The answer requires LOTS of data
Particulars • Do children have individual styles? • Gestalt vs. Analytic • Enactive (1S) vs. Depictive (3S) • Do children respond differentially to parental recasts? • Do children vary in their match to cue validity? Again, we need LOTS of data
Comparisons • How should we match SLI children to normal controls -- MLU? Morphology, TTR • How should we compare language socialization processes across social classes? Between cultures? • How should we compare the course of development across languages? The case of Romance.
Three Components • CHAT -- Transcription System • CLAN -- Programs • Database
CHAT Format @Begin @Participants: CHI Target_Child Sid, MOT Mother *MOT: you want them to go in there? *CHI: yeah. [+ Q] *CHI: yeah. [+ SR] *MOT: okay. *CHI: okay. [+ I] *CHI: look at this. %act: CHI picks up piece of paper @End
String Search • Freq • KWAL • Combo • Gem • GemFreq, GemList
Indexes • MLU • MLT • WdLen, MaxWd • VOCD • DSS • IPSyn (in progress)
Profiles • Chains • Cooccur • Dist • CHIP • KeyMap • TimeDur
Phonology • MakeMod • ModRep • PhonFreq • UniCode • Inventory (in progress, LIPP, CompProf) • Process Analysis (in progress)
Utilities • Dates • Rely • Lines • SaltIn • Check
The Database • English - 25 corpora • Non-English - 18 languages • Clinical - 14 corpora, aphasia, SLI, Down, autism, Williams, and other groups • Narrative - Frog stories, Red Balloon • Childhood Bilingualism • Adult Second Language Learning
Morphology • MOR • Post, PostTrain -- Christophe Parisse • Parse -- Kenji Sagae • --> revised DSS, LARSP, IPSyn • MinMor for 14 language • MaxMor for English, Spanish, Italian, Hungarian, Dutch, German
New Technologies • Sonic CHAT • Bullets • QuickTime Movies • Sound editor by wave • Movie editor by dragging • Fast mode editing • Web streaming of audio and video
Sample Topics • Past tense debate • Functional categories, tenseless verbs • Verb frame generalization • Fine-tuning of the input • Theory of mind • Lexical range and communicative context • MLU and vocabulary growth in disorders
Research based on CHILDES • Over 1200 published studies • Syntax • Morphology • Discourse • Lexicon • Narrative, Literacy • Language Impairments • Phonology
Allied Efforts • JCHAT, Chinese, Korean • Dutch, Nordic, Celtic • Romance (Italian, Spanish, Portuguese) • Slavic (Krakow, Vienna) • Bilingualism -- Catalan, Basque • Frogs, Disorders, Code-switching • Classroom discourse
Format Babel Alembic Annotator Archivage CA CHAT COCOSDA CSAE CSLU DAISY DAMSL Delta DRI EAGLES Emu Festival FSA’s GATE HIAT Hyperlex Intex ISIP LDC MATE MICASE MPEG MPI Multitext Observer Partitur Praat SABLE SAMPA SGREP SignSTream SIL SLAM SMDL SNACK StandOff SUSANN TalkBank TEI Tipster Transcriber TreeBank TSNLP Unicode UTF
Anthropology on the Web Chagnon’s Yanamamo
Confidentiality Levels 1 - fully public 2 - copying block 3 - transcripts public, audio/video protected 4 - non-disclosure 5 - non-disclosure, no copying 6 - data-viewing with approval 7 - data-viewing under direct supervision 8 - archived only