Using the WWW to resolve PP attachment ambiguities in Dutch

Using the WWW to resolve PP attachment ambiguities in Dutch Vincent Vandeghinste Centrum voor Computerlinguïstiek K.U.Leuven, Belgium

Introduction • Finding the correct attachment site for PP’s is one of the problems when parsing natural languages • Volk (2000;2001) has presented an approach for German by using cooccurrence frequencies on the WWW

Introduction (2) • We present a replication of the approach used by Volk, but applied on Dutch • We present a number of changes that have been made on the initial formula and their effect on the results

cooccurrence values • On the one hand, the cooccurrence strength between nouns and prepositions is measured • On the other hand, the cooccurrence strength between verbs and prepositions is measured • The competing values of N+P vs. V+P are used to decide whether to attach the PP to the noun or to the verb

Experiment 1 • Method • Altavista search engine • noun NEAR preposition vs. verb NEAR preposition • restricted to Dutch documents • lemmata are used for lookup • minimal cooccurrence threshold

Experiment 1 • Evaluation • 500 PP’s were selected which were immediately following a noun or a pronoun which functions as a noun. • It was manually decided if the PP was attached either to the verb or to the noun.

Experiment 1 • Algorithm • if cooc(N+P) and cooc(V+P) are available, the higher value decides • if one is not available (2% of test cases), the other value is compared to a threshold • if both are unavailable, no decision can be made

Experiment 1 • Results • 100% coverage: 58.4% correct attachment • max. accuracy 59%, coverage 98% • Conclusion • better than pure guessing (50%) • much lower than Volk for German • defaulting to Noun-attachment: 68%

Experiment 2 • Method • Full forms, not lemmata • Results • we want to compare at a rate of 75% correct attachments • if we set threshold so we have 75% correct attachment: coverage =21.6% • Conclusion :Results are much better than with lemmata, but still low

Experiment 3 • Method • Full forms • Minimal distance threshold • Results • 75% correct attachment: coverage=27% • Conclusion: Still a lot lower than Volk (58%), but improving

Experiment 4 • Method • We include the head noun of the PP into the queries • cooc(X,P,N2)=freq(X,P,N2)/freq(X) • without thresholds • defaulting to N-attachment if cooc’s don’t exist • Results • General accuracy = 68% with coverage=100% • Conclusions: Results are as accurate as defaulting to N-attachment

Experiment 5 • Method • minimal cooc-threshold when triple cooc not available for one • when both unavailable: no decision • Results • setting the threshold to reach an accuracy of 75% is impossible

Experiment 6 • Method • full forms + lemmata • Results: • maximum accuracy is 68.77% • Conclusions: • Volk gets nice results in the just described conditions: coverage of 63% with an accuracy of 75% • We get only 27% coverage with same accuracy

Experiment 7 • Method • combining doubles and triples into one algorithm • minimal distance and 2 different thresholds • when min-distance < threshold for triples then use minimal distance of doubles • Results: • coverage of 48.8% with an accuracy of 75% • coverage of 50% with an accuracy of 74.4%

Experiment 8 • Method • accuracy with preprocessed triples • test cases where N1 is not a real noun are removed from testset (492 cases remaining) • unlexicalized compounds are reduced to the heads of the compounds krijtstreepjeskostuum => kostuum • Results • coverage of 60.4% with an accuracy of 75% • coverage of 50% with an accuracy of 76.8%

Experiment 8 • Results: • combining the two minimal distances algorithms (for doubles and triples) gives a big rise in coverage for the same accuracy • preprocessing of nouns and leaving out pronouns gives a second big rise in coverage for the same accuracy • after defaulting the remaining cases to N-attachment we end up with an accuracy of 70.33%

General Conclusions • using the WWW helps to get a more accurate estimate of PP-attachment • difference between our results and German results: Number of decidable cases is higher for German since the number of WWW documents is higher for German • Querying cooccurrence freqs with WWW search engines using the NEAR operator allows only very rough queries

Future improvements • Using cooccurrence freqs on a controlled corpus might improve results: • more exact queries are possible than with AltaVista • less noise in the corpus

References • Volk, M. (2000). Scaling up using the WWW to resolve PP-attachment ambiguities. In Proceedings of Konvens, Ilmenau. • Volk, M. (2001). Exploiting the WWW qs q corpus to resolve PP-attachment ambiguities. In Proceedings of Corpus Linguistics, Lancaster.

Using the WWW to resolve PP attachment ambiguities in Dutch

Using the WWW to resolve PP attachment ambiguities in Dutch

Presentation Transcript

The Dutch in New Jersey

Attachment in Adulthood

Normal Attachment and Attachment Disorders in the Early Years

Creating and using Performance Indicators in Dutch Hospitals

ATTACHMENT STYLE ASSESSMENT OF ADOLESCENTS IN RESIDENTIAL CARE: Using the Attachment Style Interview (ASI)

Emotional attachment to the workplace

The Dutch

Resolving Word Ambiguities

Using PP to Scaffold a Text

Using Communication Skills to Resolve Conflicts

The Ambiguities of Political News

Chap. 8: Ambiguities in Rulemaking

Introduction to Components and Specifications Using RESOLVE

Modeling Biomolecules Using the WWW

How to Resolve Email Attachment Problems on HP Computer/Laptop?

Raw PP and PVC Plastic Materials - www.888rps.com

How to resolve infertility using IUI treatment

The WWW

Component Implementations Using RESOLVE

Introduction to Components and Specifications Using RESOLVE

1-888-726-3195 Steps to Resolve Gmail Attachment Failed Error

Call - 1-800-316-3088 How To Resolve Gmail Attachment Failed Error