Timeframes and Corpus Analysis

Timeframes and Corpus Analysis Brian MacWhinney CMU - Psychology, Modern Languages, LTI, SDU - IFKI 5/22/04 1

Goals of this talk • Explain the theory of meshed time frames • Explain how the TalkBank principles derive from this vision • Characterize possible analysis types in principle • Show what you can do with TalkBank and related tools in practice 5/22/04 2

The Core Idea • Human communication is a single unified process. • However, patterns in communication are analyzed by 20 different fields. • The time scales of the processes vary across 7 spatio-temporal frames 5/22/04 3

7 spatio-temporal frames • Phylogenetic (Evolutionary) • Epigenetic (Embryological) • Developmental • Processing • Social • Diachronic • Interaction 5/22/04 4

Data Capture • All of the space-time frames must show their effects and be conditioned in actual moments in time and space. • We can capture The Moment and The Place on video. • However, we will need to compare across time and space to understand the texture of the process. 5/22/04 5

A sample moment:Transcript linked to video 5/22/04 6

Meshing of space-time scales Orloj of Prague -- 1490 5/22/04 7

The Antikythera – Greece 150BC 5/22/04 8

Transforming Science • Science can get locked into repetitive loops. • Breaking out of loops comes from adding additional constraints, considerations. • This usually involves adding new data types, data slices, or time frames. • This can be viewed as linking together a wider data network 5/22/04 9

To achieve linking, we must have • Rich data • Data Sharing • Interoperability • Open access 5/22/04 10

Rich Data • For data depth, we need • Good recording • Good microanalytic methods • For data breadth, we need • Sharing across projects – no navigator can map the world alone • This then leads to the need for data-sharing and interoperability 5/22/04 11

CHILDES and TalkBank 5/22/04 12

Chomsky (1962) Any natural corpus will be skewed. Some sentences won’t occur because they are obvious, other because they are false, still others because they are impolite. The corpus, if natural, will be so wildly skewed that the description would be no more than a mere list. 5/22/04 13

The Rise of Corpus Studies During the last ten years of LLBA citations, there was a 50% drop in citations of Chomsky and a 100% rise in citations of “corpus”. -- Östen Dahl 5/22/04 14

Data Sharing • 42 reasons not to share data • The reason to share: it is our responsibility • The solutions: • Methods for password protection • Methods for anonymization • Credit to contributor • Group commitment 5/22/04 15

Interoperability • Format Babel: 86 formats • Program Babel: 55 programs The solutions: • CHAT XML • Roundtrip Convertors for 8 formats • Program uniformity (nice but not crucial) 5/22/04 16

The Access Problem • Missing pieces • You Tube has no transcripts • BNC transcripts have no audio, etc. • Corpora in people’s closets or private servers • We need these data for our students • Maybe when I retire 5/22/04 17

The Access Solution • No licenses, fees, or barriers • Open to every browser • Programs run directly • Direct commentary • But … some protection is needed and some anonymization for some data types 5/22/04 18

Analysis Methods • Bag of Words • QDA = a.k.a. Hand Coding • Tagging = a.k.a. Automatic Coding • Profiles = a.k.a. Canned Analyses • Group/treatment comparisons • CA Analysis • Gesture Analysis • Phonetic Analysis • Collaborative Commentary • Error analysis • Longitudinal analysis • Modeling 5/22/04 19

1. Bag of Words (BoW) • Basic method of Corpus Linguistics • For written data, there are many many resources: Google, BNC, Libraries, LDC • But for spoken data, TalkBank is the major open source • Core BoW analyses support • Usage-based learning models in L1 and L2 • Theories in eight other areas 5/22/04 20

BoW Methods • Basic Programs (CLAN and BNC) • FREQ (BNC links to t-tests) / STATFREQ • KWAL with windows • COMBO (regular expressions) • WebCLAN (limited) • Download and run locally • X-Query Search Engine (in preparation) 5/22/04 21

BoW Methods • FREQ -> STATFREQ -> EXCEL • KWAL -> clickable output • Limiting through GEM • @Bg: conversation ending • …. • @Eg: conversation ending 5/22/04 22

2. Qualitative Data Analysis(QDA) = Coding • Build Coding System • Use Coder’s Editor to insert codes • Use RELY to compare coder accuracy • RELY output pinpoints disagreements • Click and play disagreements to refine coding system Examples: Rollins INCA, MUMIN in Anvil 5/22/04 23

Speech Act Coding 5/22/04 24

QDA through Naked Video • Terabytes of video • Speechome, Classroom, Resident Care • No transcripts • Occasional sign posts • Sparse speech recognition • Automatic video analysis 5/22/04 25

3. Tagging • Morphosyntax – MOR, POST • 12 languages • Some languages need more training • With correct transcription, accuracy is at 98% • MOR generates tags • POST disambiguates • POSTMORTEM examines residual issues 5/22/04 26

Tagging (cont.) • GRASP uses output of MOR to add grammatical relation (GR) dependency structure with 38 relations. • English, Japanese, Hebrew, Spanish • Accuracy is at 93%, more work still needed • Tagging for CA categories? • Eckhardt, Mondada, & Wagner 5/22/04 27

Searchable Features 5/22/04 28

Propositional TaggingPolycythemia - Frederiksen 5/22/04 29

4. Language Profiles • Phonological inventories, TAKI • DSS (English, Japanese) • IPSyn • MORTABLE • Parts of speech • Grammatical morphemes 5/22/04 30

AphasiaBank Classification 5/22/04 31

Clinician Types by K-means clusters 5/22/04 32

5. Group Comparisons • Pretest – Treatment – Posttest • Measure gain scores – AphasiaBank Wright • L2 increases in fluency (Praat and TIMEDUR) from 4/3/2 training – Nel de Jong • Classroom discourse • Accountable discourse • MacWhinney and Arkenberg • Lauren Resnick, Beth Warren, Sarah Michaels 5/22/04 33

6. CA Analysis • CA Database • SamtaleBank (CALPIU?) • STEM/L2 classroom data • Newport Beach, Watergate, CallFriend • Koschmann Competency • Santa Barbara 5/22/04 34

CA Corpora? The Database Our Corpora My Corpus My Transcript 5/22/04 35

CA Tools • Overlap alignment through CAFont and INDENT • Removal of constraints on sentences, focus on TCUs and turns • Line numbers on and off • Alignment to audio – sonic CHAT • Special characters 5/22/04 36

5/22/04 37

7. Gesture Analysis • Detailed tiers in ANVIL – MUMIN, FORM • Basic time linkage in Elan – HKSL • Automatic interoperability between ANVIL, Elan, and CLAN • Microscopic zooming in CLAN • Links to “sequence” subfiles • Links to “snapshot” subfiles 5/22/04 38

In CHAT and CLAN 5/22/04 39

In ELAN 5/22/04 40

Torturtid 5/22/04 41

Overall transcript 5/22/04 42

Sequence Subfiles • Three parts • Each part has components • Each part linked • Each part displayed 5/22/04 43

Snapshot Files *C: (0.2) participants↗ ↓ ⁎⌈o::r⌉⁎ %vis: --------1--------|---------2-----------|-----------3-- 1. on uttering the syllable "ci", C reaches for a pencil with her right hand and paper with left hand. 2. On uttering the syllable ”pants, C grabs a pencil with right hand and the paper with left hand. 3. On "or", she lifts the paper from the table. 5/22/04 44

8. PHONCLAN Praat 5/22/04 45

Phonetic Data 5/22/04 46

9. Collaborative Commentary 5/22/04 47

Comment Tagging, Filtering • Automatic: author, date, media begin-end • Author self-characterized metadata (role, faction, position, credentials) • Commentary type (refutation, defense,elaboration, analogy, statistics, case law, gesture-speech match) • Filters: only teacher, only from colleagues, etc. 5/22/04 48

10. Error Analysis • Basic to work in CHILDES, BilingBank, and AphasiaBank • Main line coding system • goed [: went] [* +ed] • I want 0to go home. • Complete system for aphasia, speech errors 5/22/04 49

11. Sequential Analysis • Variation sets, recasting, CHIP, fine-tuning • If CDS has “want X”, does child increase use of “want go home” • Code sequences through CHAINS and KeyMap • Phonological Model-Replica analysis • Richer analysis through MacShapa 5/22/04 50

Timeframes and Corpus Analysis

Timeframes and Corpus Analysis

Presentation Transcript

Corpus Pattern Analysis: Word Meaning and Word Use

Employee Grievance Process: Overview and Timeframes

A Newspaper Corpus Analysis

the 4 timeframes

Prior Authorization Processing Timeframes Status

CORpus analysis

UTILITY NSR REFORM TIMEFRAMES

Corpus annotation and analysis using Praat

Statistical Methods for Corpus Analysis

Corpus-assisted discourse analysis

Spectral envelope analysis of TIMIT corpus

ACM email corpus annotation analysis

Corpus analysis (2)

Use of corpus analysis tools in medical corpus processing

Reduction of Timeframes

Corpus analysis (2)

Corpus analysis (1)

Trading Timeframes

Corpus-assisted discourse analysis

Use of corpus analysis tools in medical corpus processing

Corpus analysis (2)

VIEWS ON TIMEFRAMES AND CYCLES