1 / 47

Data Management and Linguistic Analysis: MDS applied to RODA

Data Management and Linguistic Analysis: MDS applied to RODA. Sheila M. Embleton, Dorin Uritescu & Eric S. Wheeler York University, Toronto, Canada. Order of Presentation. Context Romanian and RODA RODA as Linguistic Technology Examples Latin Word-final /u/

mai
Download Presentation

Data Management and Linguistic Analysis: MDS applied to RODA

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data Management and Linguistic Analysis: MDS applied to RODA Sheila M. Embleton, Dorin Uritescu & Eric S. Wheeler York University, Toronto, Canada

  2. Order of Presentation • Context • Romanian and RODA • RODA as Linguistic Technology • Examples • Latin Word-final /u/ • Non-palatalized dentals before front vowels • MDS • MDS as an analytic tool • MDS and Romanian Dialects

  3. Context

  4. Romania Source: http://en.wikipedia.org/wiki/Romanian_language#Geographic_distribution

  5. Romanian • 22+ million speakers • critical exemplar of eastern Romance language family

  6. Noul Atlas lingvistic român. Crişana • Crişana region in north-west Romania • Hard copy atlas by Stan and Uritescu (1996, 2003) • Digitize to make it more accessible

  7. Objective • Use Information Technology to permit a broad range of scholars to • access the data, • select the data appropriately, and • present the data clearly; and so gain greater understanding of its significance.

  8. State of the Project (Nov 2007) • Have entered all 407 maps from Vol. I and II • Twice proof-read • Consulted source slips, when needed • Have developed search and mapping tools to access the digital data • Initial version now posted at:http://vpacademic.yorku.ca/romanian

  9. RODA as linguistic technology

  10. The technology allows one to: • View the data • Search for data and count it • Interpret the data or the counts • Analyze the data (e.g. MDS) • See the results as maps • Save the maps as .jpg pictures • Save the results for later use • Hear samples of the data

  11. RODA: function • Custom-defined maps • You select the data • You see the result as a map • Programmable access to the whole set of digitized data • You ask about data spread over many maps • You can customize what you search for (not just the editor’s choice)

  12. RODA: search of data • Context of search becomes important • Word-final vs non-final vs either • Plain character vs accented character • Character vs (superposed) alternate • Choice of fields to search • E.g. With nouns: sg. vs pl. entries • Variations heard by field workers • Flags to mark special situations (e.g. hesitation)

  13. Examples from RODA

  14. Crişana, Romania

  15. Crişana, Romania (from RODA)

  16. Seeing Words Change Word-final /u/in Latin and non-Latin words

  17. Word-final /u/ from Latin

  18. Is word-final /u/ random? • Look for a geographic pattern over all potential occurrences • The maps for single examples such as /ochi/ and others, are in the hard-copy dialect Atlas, • But total data for all examples is spread widely over many maps.

  19. Word-final /u/ • Data from: • 407 maps • Field 1 • Size of cross shows the number of occurrences • Horizontal= syllabic • Vertical = non-syllabic

  20. Syllabic and non-syllabic /u/ • Data from: • Selected maps • Field 1 • Word-final or non-word-final • Size of cross shows the number of occurrences • Horizontal= syllabic • Vertical = non-syllabic

  21. Word-final, syllabic /u/ • Data from: • 407 maps • Field 1 • word-final only • (horizontal = vertical) • Locations 137, 141, 146 show most examples

  22. Word-final, syllabic /u/ • Can review the data

  23. Word-final, syllabic /u/ • Data from: • selected maps • Field 1 • word-final only • removed non-vocalic /u/ , def. art., some clusters +/u/. • (horizontal = vertical) • Locations 137, 141, 146 show most examples

  24. /u/ Pattern • There is a pattern: • Word final /u/ is retained in central, and north-eastern areas • It is syllabic mostly in parts of the central area • The locations with most frequent syllabic final /u/ do not form a continuous area

  25. Dialect sub-regions • Some locations have a given feature; others do not. • On the basis of such (sometimes limited) examples, linguists posit the existence of dialect sub-regions. • MDS analysis of “all” data raises questions about the nature of these sub-regions.

  26. Non-palatalized dentals before front vowels

  27. Non-palatalized dentals before front vowels • Crişana: dentals before front vowels are palatalized. • Are they restructured as palatals? • If the process is no longer productive, there may be non-palatalized dentals before front vowels. • If so, where, in what forms and what is the frequency?

  28. Non-palatalized dentals before front vowels • Examples everywhere. • (As is well-known, dentals are not palatalized in Oaş, except for 220.) • Map shows where and how many examples.

  29. Non-palatalized dentals before front vowels • There are examples everywhere (not only in Oaş) • Here we establish a result with the location and frequency of examples. • Can view the examples that support the conclusion.

  30. MDS

  31. MDS as Analytic tool • In addition to select, search, count and map functions, RODA can have special-purpose analytic tools. • A built-in MDS tool allows us to create MDS maps based on any selected set of data. • Other analytic techniques could also be implemented.

  32. MDS Process-1 Multidimensional scaling (MDS) uses the “linguistic distance” between n+1 locations to place them in an n-dimensional space exactly...

  33. MDS Process-2 MDS projects an n-space onto a 2-space (a map) so that the distances among the points are preserved as best as possible.

  34. Projection to 2-space

  35. MDS Process -3 • The linguistic map may or may not correspond to geography • It does give a high-level picture of the total linguistic relationship: All the data used to get the distances is now displayed as a single picture.

  36. Distance measures • Based on linguistic forms being “same” or “not same” • Does not account for forms that are nearly the same: • “cat” ~ “caţ” ~ “feline” • Missing forms are “not same” • Summed over many comparisons

  37. MDS and dialects • Embleton and Wheeler have used an MDS process on • English dialects • Finnish dialects • Dialect roughly correlates with geography

  38. Romanian Dialect groupings • Begin with a hypothesis about dialect groupings in Crişana. • Analyzed all data in 403 maps, using the MDS method. • Identity is exact match; any difference is a difference of 1. • Distance is sum of differences. • We see the groupings on a map.

  39. MDS mapAll groups • South-east and South-west are distinct. • The rest are less so. • Suggests the dialect unity of the region • --> refine groupings

  40. MDS mapRefined groupings • Still, considerable overlap or closeness • More groups that could be identified, e.g.: • Several divisions in West • Two areas in Oaş • Oaş is close to southern areas • Still, its distinctness is clear (cf. also Uritescu 1984a).

  41. MDS mapRefined groupings

  42. Crişana dialect regions When a lot of data is considered: • There is much overlap of regions • A few regions are distinct. It is possible that areas share features in a complex way, based on distance, physical geography and other factors. There is more apparent unity than traditional analyses (based on a few features) would provide.

  43. Further investigation We want to look at: • Differences in vocabulary (rare vs common terms) • Phonetics vs morphology vs syntax • Other definitions of distance

  44. RODA and MDS • RODA provides the large amount of data. • MDS makes the large amount of data readily understandable as a single picture. • Implementing MDS in RODA means that researchers can easily try the approach.

  45. Summary • RODA provides: • Accessible data • Flexible searching and custom presentation • Repeatable processing • MDS makes the data easy to visualize • Result: new linguistic insights based on the greater understanding of the data

  46. Contacts • Sheila Embleton embleton@yorku.ca • Dorin Uritescu dorinu@yorku.ca • Eric Wheeler wheeler@ericwheeler.ca Site: vpacademic.yorku.ca/romanian/

More Related