3. Using typological databases in historical linguistic research

3. Using typological databases in historical linguistic research

Prerequisites for using typological features for inferring phylogenies • An adequate amount of data structured in an adequate way • A proper selection of features based on their stabilities

What an adequate amount of information structured in an adequate way is is an open question. Let‘s look at examples of phylogenies based on • lexical data from ASJP • typological data from Jaziky Mira • typological data from WALS for the (almost) same set of languages

Abkhaz (abk) Azerbaijani (North) (azb) Bashkir (bak) Bengali (ben) Breton (bre) Bulgarian (bul) Burushaski (bsk) Catalan (cat) Chechen (che) Chukchi (ckt) Chuvash (chv) Czech (ces) Danish (dan) Dutch (nld) Finnish (fin) French (fra) Georgian (kat) Hebrew Modern (heb) Hungarian (hun) Icelandic (isl) Italian (ita) Itelmen (itl) Kabardian (kbd) Ket (ket) Khanty (kca) Kirghiz (kir) Komi Zyrian (kpv) Lezgian (lez) Nenets (yrk) Ossetic (Osetin) (oss) Persian (pes) Polish (pol) Portuguese (por) Russian (rus) Selkup (sel) Swedish (swe) Tatar (tat) Ukrainian (ukr) Uzbek (uzn) Yakut (sah)

Ossetic (Osetin) (oss) Persian (pes) Polish (pol) Portuguese (por) Russian (rus) Selkup (sel) Swedish (swe) Tatar (tat) Ukrainian (ukr) Uzbek (uzn) Yakut (sah) Abkhaz (abk) Azerbaijani (North) (azb) Bashkir (bak) Bengali (ben) Breton (bre) Bulgarian (bul) Burushaski (bsk) Catalan (cat) Chechen (che) Chukchi (ckt) Chuvash (chv) Czech (ces) Danish (dan) Dutch (nld) Finnish (fin) French (fra) Georgian (kat) Hebrew Modern (heb) Hungarian (hun) Icelandic (isl) Italian (ita) Itelmen (itl) Kabardian (kbd) Ket (ket) Khanty (kca) Kirghiz (kir) Komi Zyrian (kpv) Lezgian (lez) Nenets (yrk)

Georgian (kat) Chechen (che) Lezgian (lez) Uzbek (uzn) Abkhaz (abk) Azerbaijani (North) (azb) Kabardian (kbd) Ossetic (Osetin) (oss)

ASJP

Jazyki Mira

WALS

The amount of data for the languages in JM: currently unknown (Oleg is working on it). • The amount of data for this language set in WALS: between 37 and 136 features (average: 86.5). As good as it gets in WALS.

The relation among the amount of data and the performance for establishing phylogenies in WALS Correlations of WALS distances with the Ethnologue classification (dotted lines) and the WALS classification (solid lines). In each group of curves, the lowest represents the sample of languages with 20 minimally attested features, and successively higher curves represent languages with 40, 60, 80, and 100 attested features.

Figure 5. Results of mixing WALS and ASJP distances: correlations with the WALS classification (solid lines) and the Ethnologue classification (dotted lines) as a function of the percentage of ASJP data in the mixture. In each group of curves, the lowest (on the left side of the graph) represents the sample of languages with 20 attested features, and successively higher curves represent languages with 40, 80, 60, and 100 attested features, respectively.

So in spite of the problems with WALS-type features an ASJP-type classification can be improved when combined with WALS features.But if we don‘t want to make a selection we need to figure out which are the most stable features. How do we do that?“…we are far from being able to reduce the different stabilities and viabilities of various linguistic elements to precise numbers…” (Nichols 2003: 283)

Here‘s what to do: • Invent some metric or find suggestions in the literature • Test its performance on a simulated dataset where the stabilities are preset • Apply it to an empirical dataset • Look at how the results compare to other people‘s claims • Explain the results

3 different metrics • Metric A: Count for the genetic groups and the areal groups what percent of the languages share the feature value that is the best represented within each group and take into account the number of values of this best represented features and the number of languages sharing it, not just the proportion (Wichmann and Kamholz, forthc.) • Metric B: Same as A, but not taking into account the number of features involved (Nichols 1995) • Metric C: Measure the proportion of language pairs within genera which have the same value for a given feature and weight this by the proportion of unrelated language pairs which share the same value (Wichmann and Holman, under review)

Performances of the Wichmann/Holman metric C, Wichmann/Kamholz Metric A, and Nichols Metric B for different situations of data coverage PL: Probability that a language is included in the sample PF: Probability that a feature is attested for a language

Let‘s interpret the stabilities • My show‘s over • Your turn

Revisit these lectures: • Google „Soeren Wichmann“, go to „for students“ and look at the slides • There is a special home page for the ASJP project. Check it out

• Wichmann, Søren and Eric W. Holman. Under review. Assessing temporal stability for linguistic typological features. • Holman, Eric W., Søren Wichmann, Cecil H. Brown, Viveka Vilupillai, André Müller, • Pamela Brown, and Dik Bakker. In press. Explorations in automated lexicostatistics. Folia Linguistica (scheduled for 2008). • Wichmann, Søren and Arpiar Saunders. 2007. How to use typological databases in historical linguistic research. Diachronica 24.2: 373-404. • Wichmann, Søren and David Kamholz. In press. A stability metric for typological features. Sprachtypologie und Universalienforschung.

3. Using typological databases in historical linguistic research

3. Using typological databases in historical linguistic research

Presentation Transcript

Historical Research

Historical Research

Using CCC Library Research Databases

Using Mechanical Turk for linguistic research

Archiving and linguistic databases

Typological prototypes or typological properties? Tibeto-Burman prosodic systems in typological perspective

Comparative Linguistic research in CLARIN

Corpora in Linguistic Research

HPC in linguistic research

TYPOLOGICAL RESEARCH

Tracking Linguistic Variation in Historical Corpora

Using clinical databases for translational research

Historical XML Databases

Linguistic summaries on relational databases

Electronic Databases (Research Databases)

Historical Research

Databases for Linguistic Purposes

Using Databases…

Historical Research

Using CCC Library Research Databases

Databases for Linguistic Purposes

Geographical information in linguistic databases