400 likes | 538 Views
The Ninetieth Anniversary of the LSA: A Commemorative Symposium. Morphology: the last 40 years Mark Aronoff January 3, 2014. Preface: Technology and theory. T he relation between technology and theory goes both ways We like to believe that theory leads technology
E N D
The Ninetieth Anniversary of the LSA: A Commemorative Symposium Morphology: the last 40 years Mark Aronoff January 3, 2014
Preface:Technology and theory • The relation between technology and theory goes both ways • We like to believe that theory leads technology • At least as often it is the other way round • Many of the successes of early science were technology driven
Antoni van LoewenhoekIn the year of 1675 I discover’d living creatures in Rain water
A Case StudyMorphological Productivity • Morphological productivity was rarely investigated until the 1980’s • Newly available electronic tools made the quantitative study of morphological productivity possible • New tools have led to breakthroughs in our understanding of both synchronic and diachronic morphology • The tools lead us to question fundamental assumptions about the discreteness of language and the value of the competence/performance distinction
Counting WordsData Resources and English Morphology • Fundamental discoveries in linguistic morphology over the last half-century have depended on improvements in our ability to count English words • As the resources for counting words have changed and improved, so have our ideas about morphology changed and (we hope) our understanding improved
Laying the Foundations for Studying Morphological Productivity • Early linguistic word data resources were not designed for linguistics, though they were focused on language • Walker 1775 • Thorndike 1921, 1932, 1944 • Only in the 1960’s did the first truly linguistically driven electronic word data resources appear • Brown 1963 (word counts) • Kučera and Francis 1967 (frequency counts)
John WalkerThe Godfather of Modern Morphology • Walker’s Rhyming Dictionary. 1775 • Walker’s dictionary has gone through many editions and remains in print • The term rhyming dictionary was misleading, though it was a good selling point • Walker’s dictionary was meant for linguists as much as for poets, though few linguists used it
Notable linguistic remarks from Walker’s original Introduction • As in other Dictionaries words follow each other in an alphabetical order according to the letters they begin with, in this they follow each other according to the letters they end with. • The English Language, it may be said, has hitherto been seen through but one end of the perspective; and though terminations form the distinguishing character and specific difference of every language in the world, we have never before had a prospect of our own, in this point of view.
The Father of Educational Psychology • Thorndike was one of the first American experimental psychologists • Thorndike’s work was a precursor to both behaviorism and modern cognitive psychology • Thorndike spent his entire career at Columbia University Teacher’s College • Thorndike is regarded as a founding figure in educational psychology
Thorndike’s word books • Between 1921 and 1944, Thorndike published three frequency-based word books for teachers, to be used in curriculum design • The last edition (Thorndike and Lorge) contained 30,000 words • The books consisted almost entirely of frequency lists: 1/ 1,000,000; 1/4,000,000; 1000 most frequent • These were the first frequency lists published for any language
A. F. Brown • A. F. Brown was one of the first computational linguists, working at Penn and then at LeHigh • In 1963, he published his Normal and Reverse English Word List, prepared under contract with the Air Force Office of Scientific Research • The list was collated from 18 dictionaries • Each list runs to 400 pages of computer printout, with 100 words per page = 400,000 entries
Kučera and FrancisFrancis and Kučera • The Brown Corpus (1964) • 1,014,312 words of running text of edited English prose printed in the United States during the calendar year 1961 • 500 samples of 2000+ words each • Tagged in a variety of ways • Computational Analysis of Present-Day American English (1967) • Frequency Analysis of English Usage (1982) • Approximately 45,000 distinct lemmas listed with their frequencies • Lemmas with adjusted frequency >5/m in rank order
The last 25 yearsLarge-scale electronic resources • The availability in the last quarter century of large-scale electronic resources has made it possible to study English morphology in hitherto unimagined ways • These resources have changed our perspective on how morphology works • Two types of resources: • Electronic dictionaries • Large corpora
The Oxford English Dictionary • The largest, longest, and most expensive academic publishing project in history • 1857 Inaugurated • 1879 Work begins in earnest • 1933 First full edition • 1989 OED2 • 1992 CD-ROM of OED2 • 2000 – OED Online (by subscription)
À quoi çasert(l’amour)? • The OED, unlike Webster’s II and others, is a historical dictionary • Recent editions of the OED were designed from the bottom up as electronic resources • The combination allows us to ask questions that we could never before expect to find answers for • We can even ask questions that we might never before have imagined
OED Tools • The OED prides itself on the accuracy of its first citations • The first citations provide the most accurate historical record available in any language of the first use of a word • The ability to use wild cards permits the simple construction of historical timelines for individual affixes • The timelines allow easy and accurate study for the first time of the growth and decline of patterns of affixation in English over the last millennium
What the OED shows us • The system is self-organizing • We can track the emergence of “borrowed” affixes from the borrowing of large numbers of individual words to the productive use of an affix (e.g., -ment, -ation, -ity, -able) • Homonymous affixes compete • The competition between affixes is resolved through competition
Corpora • The Brown corpus, compiled 50 years ago, contained a total of 1 million words • The Google Books database currently contains over 30 million booksand over 150 billion words • Other modern large corpora are comparably large and are tagged for part of speech • The COCA corpus contains over 450 million words • Corpora allow for the counting of individual words/lemmas and their frequencies in a corpus
Baayen’s Productivity Indices • In a series of publications from 1989 on, HaraldBaayen developed a number of corpus-based indices intended to capture the intuitive notion of morphological productivity • Baayen’s indices are based on the idea that words that only occur once in a corpus, hapaxlegomena, are a window into morphological productivity • This idea makes no sense in the absence of a searchable corpus of reasonable size • The general method becomes less useful as the corpus grows in size
P = n1/ N • The best known of Baayen’s indices is P, which measures the “growth rate” of the affix: the probability that an encounter with a word containing the affix is a new type. • In the equation, n1is represents the total number of hapaxes containing the affix, and N represents the total number of tokens containing the affix. • P fits linguists’ intuitions about productivity reasonably well in corpora < 100M words, except when both n’s are small (for unproductive affixes)
V and P* • V is the total number of lexeme types containing a given affix • Differences in V between affixes reflect the extent to which relevant base words have been used • Baayen plots P against V to obtain P*, the relative “global productivity” of affixes • This measure is problematic, as Baayen notes, because there is no principled way of scaling the axes
Hapax vs. Hapax • Baayen’s final measure is P *, the hapax-conditioned degree of productivity • P * = n1/ h1, where h1is the total number of hapaxes across all types in the corpus • Since h1is the same for all affixes in a corpus, this measure simply counts the numbers of hapaxes for each affix identified in a corpus • The difference in P * yields intuitively satisfactory results for Baayen’s corpora • The greatest weakness of P * is that it cannot easily be compared across corpora
Where hapaxes fail • Both P and P * measurements are dependent on the size (N) of the corpus • The number of hapaxes in a corpus is a decreasing function of N • The rate of increase in the number of hapaxes slows as the size of the corpus increases • Very large corpora show few if any hapaxes • There is no way to know what the “proper” size of a corpus is for hapax-based measures to be useful • It is not clear what the value of a measure of global productivity is
So far, so good • We gain insights into morphological productivity if we use quantitative tools • We can not treat productivity as a discrete phenomenon if we want to learn about it • The methods and measures we use depend on the machinery that we have • The notion of an absolute measure of productivity that is valid across corpora is elusive and problematic
Escape from Hapax • The number of hapaxes decreases as the size of the corpus increases • With very large corpora hapaxes are not helpful • We can learn a great deal from very large corpora if we confine ourselves to the direct comparison of pairs of competing affixes • This method is not based on hapaxes • This line of research does not address the question of global productivity at all • Google Fight!
Using Google Search • We use Google Search Estimated Total Matches (ETM) as a measure of usage • PROBLEMS • Google is very noisy and must be used with great caution • ETM is not an actual count but an estimate based on a proprietary method • SOLUTIONS • Little weight is placed on raw numbers or on individual word pairs • Only large differences between affixes are taken into account
A test caseComparing –ic and -ical • Sample ETM counts for high frequency doublets (Lindsay & Aronoff 2013)
Comparing –ic and -ical • Sample ETM counts for high frequency singletons (Lindsay & Aronoff 2013)
Usually –icwinsSometimes -icalwins • -icalis productive in stems ending in -olog(from Lindsay and Aronoff 2013)
Usually –icwinsSometimes -icalwins • -icalis productive in stems ending in -olog(from Lindsay and Aronoff 2013)
Why –olog? • -olog defines the largest set by far of stems with neighborhood length 4 preceding either of the two suffixes (475 members) • The -olog set contains 2/3 of all stems in –g • The -ologset is thus a very large morphologically defined subsystem with very few neighbors • The -ologset is uniquely suited to sustain -icalas a productive suffix, in spite of the clear dominance of -icoverall
Conclusion • The combination of rich computational resources and quantitative methods allows us to make progress in understanding questions that could not be profitably studied a quarter century ago • As the resources change, so do the questions, the methods, and the theories that they drive
THANK YOU Special thanks to those who have joined in my personal struggle over the last 40 years to understand morphological productivity by counting Morris Halle Frank Anshen Mark Lindsay La lotta continua!