470 likes | 606 Views
Data Management and Linguistic Analysis: MDS applied to RODA. Sheila M. Embleton, Dorin Uritescu & Eric S. Wheeler York University, Toronto, Canada. Order of Presentation. Context Romanian and RODA RODA as Linguistic Technology Examples Latin Word-final /u/
E N D
Data Management and Linguistic Analysis: MDS applied to RODA Sheila M. Embleton, Dorin Uritescu & Eric S. Wheeler York University, Toronto, Canada
Order of Presentation • Context • Romanian and RODA • RODA as Linguistic Technology • Examples • Latin Word-final /u/ • Non-palatalized dentals before front vowels • MDS • MDS as an analytic tool • MDS and Romanian Dialects
Romania Source: http://en.wikipedia.org/wiki/Romanian_language#Geographic_distribution
Romanian • 22+ million speakers • critical exemplar of eastern Romance language family
Noul Atlas lingvistic român. Crişana • Crişana region in north-west Romania • Hard copy atlas by Stan and Uritescu (1996, 2003) • Digitize to make it more accessible
Objective • Use Information Technology to permit a broad range of scholars to • access the data, • select the data appropriately, and • present the data clearly; and so gain greater understanding of its significance.
State of the Project (Nov 2007) • Have entered all 407 maps from Vol. I and II • Twice proof-read • Consulted source slips, when needed • Have developed search and mapping tools to access the digital data • Initial version now posted at:http://vpacademic.yorku.ca/romanian
The technology allows one to: • View the data • Search for data and count it • Interpret the data or the counts • Analyze the data (e.g. MDS) • See the results as maps • Save the maps as .jpg pictures • Save the results for later use • Hear samples of the data
RODA: function • Custom-defined maps • You select the data • You see the result as a map • Programmable access to the whole set of digitized data • You ask about data spread over many maps • You can customize what you search for (not just the editor’s choice)
RODA: search of data • Context of search becomes important • Word-final vs non-final vs either • Plain character vs accented character • Character vs (superposed) alternate • Choice of fields to search • E.g. With nouns: sg. vs pl. entries • Variations heard by field workers • Flags to mark special situations (e.g. hesitation)
Crişana, Romania (from RODA)
Seeing Words Change Word-final /u/in Latin and non-Latin words
Is word-final /u/ random? • Look for a geographic pattern over all potential occurrences • The maps for single examples such as /ochi/ and others, are in the hard-copy dialect Atlas, • But total data for all examples is spread widely over many maps.
Word-final /u/ • Data from: • 407 maps • Field 1 • Size of cross shows the number of occurrences • Horizontal= syllabic • Vertical = non-syllabic
Syllabic and non-syllabic /u/ • Data from: • Selected maps • Field 1 • Word-final or non-word-final • Size of cross shows the number of occurrences • Horizontal= syllabic • Vertical = non-syllabic
Word-final, syllabic /u/ • Data from: • 407 maps • Field 1 • word-final only • (horizontal = vertical) • Locations 137, 141, 146 show most examples
Word-final, syllabic /u/ • Can review the data
Word-final, syllabic /u/ • Data from: • selected maps • Field 1 • word-final only • removed non-vocalic /u/ , def. art., some clusters +/u/. • (horizontal = vertical) • Locations 137, 141, 146 show most examples
/u/ Pattern • There is a pattern: • Word final /u/ is retained in central, and north-eastern areas • It is syllabic mostly in parts of the central area • The locations with most frequent syllabic final /u/ do not form a continuous area
Dialect sub-regions • Some locations have a given feature; others do not. • On the basis of such (sometimes limited) examples, linguists posit the existence of dialect sub-regions. • MDS analysis of “all” data raises questions about the nature of these sub-regions.
Non-palatalized dentals before front vowels • Crişana: dentals before front vowels are palatalized. • Are they restructured as palatals? • If the process is no longer productive, there may be non-palatalized dentals before front vowels. • If so, where, in what forms and what is the frequency?
Non-palatalized dentals before front vowels • Examples everywhere. • (As is well-known, dentals are not palatalized in Oaş, except for 220.) • Map shows where and how many examples.
Non-palatalized dentals before front vowels • There are examples everywhere (not only in Oaş) • Here we establish a result with the location and frequency of examples. • Can view the examples that support the conclusion.
MDS as Analytic tool • In addition to select, search, count and map functions, RODA can have special-purpose analytic tools. • A built-in MDS tool allows us to create MDS maps based on any selected set of data. • Other analytic techniques could also be implemented.
MDS Process-1 Multidimensional scaling (MDS) uses the “linguistic distance” between n+1 locations to place them in an n-dimensional space exactly...
MDS Process-2 MDS projects an n-space onto a 2-space (a map) so that the distances among the points are preserved as best as possible.
MDS Process -3 • The linguistic map may or may not correspond to geography • It does give a high-level picture of the total linguistic relationship: All the data used to get the distances is now displayed as a single picture.
Distance measures • Based on linguistic forms being “same” or “not same” • Does not account for forms that are nearly the same: • “cat” ~ “caţ” ~ “feline” • Missing forms are “not same” • Summed over many comparisons
MDS and dialects • Embleton and Wheeler have used an MDS process on • English dialects • Finnish dialects • Dialect roughly correlates with geography
Romanian Dialect groupings • Begin with a hypothesis about dialect groupings in Crişana. • Analyzed all data in 403 maps, using the MDS method. • Identity is exact match; any difference is a difference of 1. • Distance is sum of differences. • We see the groupings on a map.
MDS mapAll groups • South-east and South-west are distinct. • The rest are less so. • Suggests the dialect unity of the region • --> refine groupings
MDS mapRefined groupings • Still, considerable overlap or closeness • More groups that could be identified, e.g.: • Several divisions in West • Two areas in Oaş • Oaş is close to southern areas • Still, its distinctness is clear (cf. also Uritescu 1984a).
Crişana dialect regions When a lot of data is considered: • There is much overlap of regions • A few regions are distinct. It is possible that areas share features in a complex way, based on distance, physical geography and other factors. There is more apparent unity than traditional analyses (based on a few features) would provide.
Further investigation We want to look at: • Differences in vocabulary (rare vs common terms) • Phonetics vs morphology vs syntax • Other definitions of distance
RODA and MDS • RODA provides the large amount of data. • MDS makes the large amount of data readily understandable as a single picture. • Implementing MDS in RODA means that researchers can easily try the approach.
Summary • RODA provides: • Accessible data • Flexible searching and custom presentation • Repeatable processing • MDS makes the data easy to visualize • Result: new linguistic insights based on the greater understanding of the data
Contacts • Sheila Embleton embleton@yorku.ca • Dorin Uritescu dorinu@yorku.ca • Eric Wheeler wheeler@ericwheeler.ca Site: vpacademic.yorku.ca/romanian/