190 likes | 290 Views
Using the WWW to resolve PP attachment ambiguities in Dutch. Vincent Vandeghinste Centrum voor Computerlinguïstiek K.U.Leuven, Belgium. Introduction. Finding the correct attachment site for PP’s is one of the problems when parsing natural languages
E N D
Using the WWW to resolve PP attachment ambiguities in Dutch Vincent Vandeghinste Centrum voor Computerlinguïstiek K.U.Leuven, Belgium
Introduction • Finding the correct attachment site for PP’s is one of the problems when parsing natural languages • Volk (2000;2001) has presented an approach for German by using cooccurrence frequencies on the WWW
Introduction (2) • We present a replication of the approach used by Volk, but applied on Dutch • We present a number of changes that have been made on the initial formula and their effect on the results
cooccurrence values • On the one hand, the cooccurrence strength between nouns and prepositions is measured • On the other hand, the cooccurrence strength between verbs and prepositions is measured • The competing values of N+P vs. V+P are used to decide whether to attach the PP to the noun or to the verb
Experiment 1 • Method • Altavista search engine • noun NEAR preposition vs. verb NEAR preposition • restricted to Dutch documents • lemmata are used for lookup • minimal cooccurrence threshold
Experiment 1 • Evaluation • 500 PP’s were selected which were immediately following a noun or a pronoun which functions as a noun. • It was manually decided if the PP was attached either to the verb or to the noun.
Experiment 1 • Algorithm • if cooc(N+P) and cooc(V+P) are available, the higher value decides • if one is not available (2% of test cases), the other value is compared to a threshold • if both are unavailable, no decision can be made
Experiment 1 • Results • 100% coverage: 58.4% correct attachment • max. accuracy 59%, coverage 98% • Conclusion • better than pure guessing (50%) • much lower than Volk for German • defaulting to Noun-attachment: 68%
Experiment 2 • Method • Full forms, not lemmata • Results • we want to compare at a rate of 75% correct attachments • if we set threshold so we have 75% correct attachment: coverage =21.6% • Conclusion :Results are much better than with lemmata, but still low
Experiment 3 • Method • Full forms • Minimal distance threshold • Results • 75% correct attachment: coverage=27% • Conclusion: Still a lot lower than Volk (58%), but improving
Experiment 4 • Method • We include the head noun of the PP into the queries • cooc(X,P,N2)=freq(X,P,N2)/freq(X) • without thresholds • defaulting to N-attachment if cooc’s don’t exist • Results • General accuracy = 68% with coverage=100% • Conclusions: Results are as accurate as defaulting to N-attachment
Experiment 5 • Method • minimal cooc-threshold when triple cooc not available for one • when both unavailable: no decision • Results • setting the threshold to reach an accuracy of 75% is impossible
Experiment 6 • Method • full forms + lemmata • Results: • maximum accuracy is 68.77% • Conclusions: • Volk gets nice results in the just described conditions: coverage of 63% with an accuracy of 75% • We get only 27% coverage with same accuracy
Experiment 7 • Method • combining doubles and triples into one algorithm • minimal distance and 2 different thresholds • when min-distance < threshold for triples then use minimal distance of doubles • Results: • coverage of 48.8% with an accuracy of 75% • coverage of 50% with an accuracy of 74.4%
Experiment 8 • Method • accuracy with preprocessed triples • test cases where N1 is not a real noun are removed from testset (492 cases remaining) • unlexicalized compounds are reduced to the heads of the compounds krijtstreepjeskostuum => kostuum • Results • coverage of 60.4% with an accuracy of 75% • coverage of 50% with an accuracy of 76.8%
Experiment 8 • Results: • combining the two minimal distances algorithms (for doubles and triples) gives a big rise in coverage for the same accuracy • preprocessing of nouns and leaving out pronouns gives a second big rise in coverage for the same accuracy • after defaulting the remaining cases to N-attachment we end up with an accuracy of 70.33%
General Conclusions • using the WWW helps to get a more accurate estimate of PP-attachment • difference between our results and German results: Number of decidable cases is higher for German since the number of WWW documents is higher for German • Querying cooccurrence freqs with WWW search engines using the NEAR operator allows only very rough queries
Future improvements • Using cooccurrence freqs on a controlled corpus might improve results: • more exact queries are possible than with AltaVista • less noise in the corpus
References • Volk, M. (2000). Scaling up using the WWW to resolve PP-attachment ambiguities. In Proceedings of Konvens, Ilmenau. • Volk, M. (2001). Exploiting the WWW qs q corpus to resolve PP-attachment ambiguities. In Proceedings of Corpus Linguistics, Lancaster.