180 likes | 196 Views
METODI STATISTICI NELLA LINGUISTICA COMPUTAZIONALE. Massimo Poesio Universita’ di Venezia. Obiettivi del corso. Un’introduzione all’uso dei corpora e ai metodi statistici. Piano del corso. Fondamenti di statistica, uso dei corpora
E N D
METODI STATISTICI NELLA LINGUISTICA COMPUTAZIONALE Massimo Poesio Universita’ di Venezia
Obiettivi del corso • Un’introduzione all’uso dei corpora e ai metodi statistici
Piano del corso • Fondamenti di statistica, uso dei corpora • Tasks & tecniche base: predizione di parole, n-grams, smoothing, spelling, Bayesian inference • POS tagging: tagsets, Brill tagger, HMM tagging • Valutazione di sistemi • Il lessico • Grammatiche probabilistiche,parsing statistico
Oggi • Statistica e Linguistica (Abney, 1996) • Fondamenti di probabilita’ • Corpora
Dettagli pratici • Orario: 10:30-13, 14:30-17 • Laboratori: dalle 17 alle 18 (non oggi) • Orario di ricevimento: 9:30-10:30, 18-19 • Email: poesio@essex.ac.uk • Pagina web (temporanea): csstaff.essex.ac.uk/staff/poesio/Courses/Venezia/Stat_NLP/
Empiricism vs. Rationalism • Chomskyan linguistics: • Assumption: linguistic knowledge mostly innate • Emphasis on explanation • Primary goal: simplicity of the theory • Empirical methods • Assumption: linguistic knowledge primarily derives from generalizations over experience • Emphasis on data • Primary goal: fact discovery • Computational Linguistics between 1960 & 1980 mostly Chomskyan
Problems statistical methods are meant to address • Ambiguity resolution: previous choices were • Narrow domains to avoid ambiguity • Hand-coded rules • Hand-tuned preference weights • Adaptation to new domains • Measuring improvement
Case study: POS tagging “Time flies like an arrow”N/V N/V V/N/CJ Det N
The rise of statistical methods • First area in which statistical techniques truly proved their worth was Automatic Speech Recognition (ASR) • ASR techniques then used for POS tagging, and then in all areas of CL • A synthesis of statistical methods and linguistic insights now underway
Modern empiricism in Computational Linguistics • Large data collections • Rigorous collection techniques (interannotator agreement) • Rigorous evaluation techniques • Discovery of generalizations: via learning techniques
Statistics & the study of language? • Theoretical advances • Language acquisition: the role of experience • Linguistic theory: graded grammaticality • Language change: shifts in grammaticality • Empirical • Quantify linguistic phenomena • Analyze data • Test hypotheses • Psychological • Express preferences
Some interesting statistics about language • Lexical biases • Category: “bank” = Noun 85%, Verb 15% • Sense: Bank(river) 22%, Bank(money) 78% • Syntax • Subcategorization of “realised”: NP 20%, S 65%, Other 15% • Semantics / discourse • “he” in subject position 65% of the time
Corpora • The use of statistical techniques has been made possible by the availability of CORPORA – large collections of text typically ANNOTATED with linguistic information: • The Brown corpus (1M words) and British National Corpus (150 million words), annotated with POS tags (English) • Penn Treebank (4M words), syntactically annotated (English) • SEMCOR (250K), annotated with wordsense information • The MapTask, annotated with dialogue information • Italian: CORIS (100M words+, Bologna), Si-TAL (220K words, written, annotated with syntactic information & wordsense information), IPAR (‘MapTask Italiano’)
Basic uses of corpora:Collocations • COMPOUNDS: “computer program”, “disk drive”, “calcio di rigore” • PHRASAL VERBS: “wake up”, “come on” • PHRASAL EXPRESSIONS: “bacon and eggs”, “the bees’ knees”, “siamo alla frutta”
Statistical Language Processing • Statistical inference: • Collect statistics about occurrence of X • Predict new occurrences • Example: language modeling • Problem: predict word that follows, given previous ones • Find Wn that maximizes P(Wn|W1..W n-1) • Applications: • Speech recognition • Spell-checking • POS tagging …
Bibliografia • Steven Abney, Statistical Methods and Linguistics, in Judith Klavans and Philip Resnik (eds.), The Balancing Act, The MIT Press, Cambridge, Mass., 1995. • Testi: • Daniel Jurafsky and James Martin, Speech and Language Processing, Prentice-Hall • Piu’ generale, e piu’ facile da seguire • Christopher Manning and Hinrich Schütze, Foundations of Statistical Natural Language Processing, MIT Press • Piu’ completo, e scritto da una prospettiva piu’ linguistica, ma tecnicamente piu’ avanzato