Jerid Francom (Wake Forest University) Adam Ussishkin (University of Arizona)

How specialized are specialized corpora?Behavioral evaluation of corpus representativeness for Maltese • Jerid Francom (Wake Forest University) • Adam Ussishkin (University of Arizona) • Amy LaCross (University of Arizona) • 19 May 2010: O7 (Evaluation of Methodologies), 14.45-15.05 • LREC 2010, Mediterranean Conference Center • Valletta, Malta

Acknowledgements • Generous contribution of data to this project by Dr. Albert Gatt (Univ. of Malta) • Statistical expertise from Jeff Berry (Univ. of Arizona) • Funding from the United States National Science Foundation (BCS-0715500) to Adam Ussishkin

Goals • IssueFor many languages, the quality of available textual data is less than ideal for corpus creation in the light of standard sampling practices. • ProposeBehavioral data can provide a valuable metric to evaluate corpus resources otherwise considered ‘specialized’. • CasePsyCoL Maltese Lexical Corpus • ContributeNovel, cross-discipline metric for evaluating the quality of language resources

Sparse coverage • Most of the world’s 5-7000 languages have no corpus resources • Efforts to fill the gap, often exploit the availability of language data on the web • An Crúbadán project, 446 languages (Scannell, 2007) • McEnery et al., (2006) survey of recent work

Sparse coverage • Low-density languages (Borin, 2009)Languages in which resources exist; but in limited quantity/quality • Limited access to print and/or electronic data • Available primary data may be less-than-representative • Weakens assurance that results from low-density language resources are credible

Corpus representativeness • What is a ‘representative corpus’? • An externally valid sample of language use • A sample that approximates what the language is. • Full range of structural types (language units) • What are the characteristics of such a sample? • Genre/register • Modality

An issue for low-density languages • Standard practice to achieve representativeness • Apply rigorous sampling methods • Collect large amounts of data • Problematic for low-density languages: a representativeness bottleneck • Lack large amounts of data • Available data is often limited in register, modality, etc. • Corpus resources are typically specialized

Assessing representativeness • How do we know whether we have a ‘representative’ sample? • We don’t, in an absolute sense. • Faith in survey sampling practicesCasting the net far and wide • Can we be assured we don’t have a representative sample? • Not exactly. • It is logically possible that smaller, less diverse samples are externally valid for linguistic units that appear in the collection.

Proposal • Need for an external metric. • Current proposal suggests findings from behavioral experimentation can provide a valuable metric to evaluate corpus resources. • Exploit the correlation between derivedfrequency counts and elicitedbehavioral reactions • Behavioral data and adjusted frequency(Gries 2008; 2009) • Of particular importance for specialized corpora

Behavioral findings • Well-known robust effects for relative frequency in language processing • Word naming RTs (e.g., Forster & Chambers, 1973) • Lexical decision RTs (e.g., Carroll & White, 1973) • Sentence reading RTs (e.g., MacDonald, 1994) • Word familiarity ratings (e.g., Gernsbacher 1984) • Log frequency is a good predictor of behavior.

Approach • Evaluating corpus representativeness through behavioral assessment • Derive frequency counts from a specialized corpus • Elicit behavioral response of participants from target population • Assess correlation strength: how well do behavioral responses correlate with corpus measures?

Case study and predictions • Case study • Calculate: log frequency of subset of items in a Maltese lexical corpus • Measure: subjective word familiarity ratings of native speakers of Maltese • Assess: relative distribution of the measures • Prediction • Congruence between relative distributions indicates a representative sample of the language • Mismatches underscore potential sampling issues

The specialized corpus • PsyCoL Maltese Lexical Corpus (PMLC)(Francom, Ussishkin, and Woudstra, 2009)http://psycol.sbs.arizona.edu/resources/ • Online Maltese newspapers, 1998-1999; 2005 - 2007PsyCoL lab (59.8%) and Dr. Albert Gatt (40.2%) • 3,323,325 total tokens (53,000 unique)Token/type ratio of 1.6% • Typical for low-density languages • Large corpus, still relatively small (cf. British National Corpus 100+million; Corpus of Contemporary American English 400+ million) • Limited in register, modality

Linguistic variable to quantify • Because there is little previous quantitative research on Maltese, the empirical focus of this investigation was narrowed to: • Semitic-origin verbs/binyanim (also known as form) • Semitic-origin verbs in Maltese conform to the classical Semitic binyan system (categories based on morphosyntactic and phonological properties) • Question: How does frequency as measured in our corpus correlate with behavior?Can the binyan categories be exploited to provide correlations?

Maltese binyanim

A behavioral task: word familiarity • We devised three tests to measure corpus representativeness • Each test measured a different aspect of our corpus counts and our behavioral task. • The behavioral task involved native Maltese-speakers, who gave subjective word familiarity ratings for all Semitic-origin Maltese verbs taken from Aquilina (2000); n=1536. • Scale from very unfamiliar to very familiar • Shown to be a reliable predictor of lexical processing (Connine et al. 1990)

Word familiarity experiment • Participants • 107 native speakers of Maltese • Task • Subjective word familiarity task, online

Measuring frequency in the corpus • We then used the PMLC to calculate word frequency measures for the same set of verbs. • Using regular expression-enabled searching, we counted token frequency for all verbs occurring in the PMLC (n=447). • Frequency was then encoded as a log-based measure.

Three tests • Next, we conducted three distinct statistical analyses to assess correlation between these corpus measures and the results of our word familiarity experiment • 1. Statistical regression between corpus log frequency and behavioral data. • 2. Binned groups by frequency to determine whether any correlation is found. • 3. Binned items by binyan to determine whether any correlation is found.

1. Statistical regression • We found a weak correlation (r=.14); these results show at best a trend toward correlation, but suggests that familiarity ratings likely do not predict word frequency given these results.

2. Binning by frequency • Binning into two bands shows a correlation: • Binning into three bands also shows a correlation:

2. Binning by frequency • An LMER analysis of each binning (2 groups and 3 groups) shows significance: • All contrasts for two-bin intervals (High/Low=4.2, t=2.0) and three-bin intervals (High/Mid=7.1, t=3.9; Mid/Low=7.0, t=2.2) were significant. • These results support the hypothesis that behavior and corpus measures are correlated.

3. Binning by binyan • Earlier and ongoing work (Frost et al. 1997, 1998, 2000; Ussishkin et al. in progress) shows binyan effects in Hebrew in both visual and auditory modalities, so Maltese could be expected to show similar effects. • Our goal here is to measure whether verbs, when grouped by binyan, show a correlation between word frequency measures and word familiarity ratings.

3. Binning by binyan • Only binyanim 1, 2, 5, 7 were analyzed; binyanim 3, 6, 8, 9, and 10 were not included in the analyses because they are so sparsely populated:

3. Binning by binyan • Word frequency results: significant contrasts found between Binyanim 7 and 2 (β=.54, t=6.0); and between Binyanim 7 and 5 (β=1.15, t=-2.2). • Word familiarity results: no significant contrasts found. Binyan by word frequency Binyan by word familiarity

General assessment • The results show that verb frequency distributions in the PMLC pattern to some degree with the psychological representations of native speakers (the representative population) • On the surface suggests the PMLC is on the right track, but underscores the specialized nature of corpus • However, a response bias in the word familiarity task may play a part in the mismatches • Ceiling effect may have contributed to lower correlation scores

General assessment • Reasons to be optimistic about the verb distributions in the PMLC: • Distribution of verb count/ frequency (Zipf, 1949) • Distribution of word length/ frequency (Li, 1992) • Both measures trend as expected for representative samples

Conclusion • Novel methodology: direct comparison between corpus resource and behavior. • Highlighting a robust effect from psycholinguistics (frequency of linguistic units predicts behavior). • We predicted the opposite could occur; this provides a way to validate LDL resources. • This approach encourages cross-discipline endeavors for resource development and theoretical investigation.

Thank you very much! • Grazzi ħafna!

Jerid Francom (Wake Forest University) Adam Ussishkin (University of Arizona)