METODI STATISTICI NELLA LINGUISTICA COMPUTAZIONALE

METODI STATISTICI NELLA LINGUISTICA COMPUTAZIONALE Massimo Poesio Universita’ di Venezia

Obiettivi del corso • Un’introduzione all’uso dei corpora e ai metodi statistici

Piano del corso • Fondamenti di statistica, uso dei corpora • Tasks & tecniche base: predizione di parole, n-grams, smoothing, spelling, Bayesian inference • POS tagging: tagsets, Brill tagger, HMM tagging • Valutazione di sistemi • Il lessico • Grammatiche probabilistiche,parsing statistico

Oggi • Statistica e Linguistica (Abney, 1996) • Fondamenti di probabilita’ • Corpora

Dettagli pratici • Orario: 10:30-13, 14:30-17 • Laboratori: dalle 17 alle 18 (non oggi) • Orario di ricevimento: 9:30-10:30, 18-19 • Email: poesio@essex.ac.uk • Pagina web (temporanea): csstaff.essex.ac.uk/staff/poesio/Courses/Venezia/Stat_NLP/

Empiricism vs. Rationalism • Chomskyan linguistics: • Assumption: linguistic knowledge mostly innate • Emphasis on explanation • Primary goal: simplicity of the theory • Empirical methods • Assumption: linguistic knowledge primarily derives from generalizations over experience • Emphasis on data • Primary goal: fact discovery • Computational Linguistics between 1960 & 1980 mostly Chomskyan

Problems statistical methods are meant to address • Ambiguity resolution: previous choices were • Narrow domains to avoid ambiguity • Hand-coded rules • Hand-tuned preference weights • Adaptation to new domains • Measuring improvement

Case study: POS tagging “Time flies like an arrow”N/V N/V V/N/CJ Det N

The rise of statistical methods • First area in which statistical techniques truly proved their worth was Automatic Speech Recognition (ASR) • ASR techniques then used for POS tagging, and then in all areas of CL • A synthesis of statistical methods and linguistic insights now underway

Modern empiricism in Computational Linguistics • Large data collections • Rigorous collection techniques (interannotator agreement) • Rigorous evaluation techniques • Discovery of generalizations: via learning techniques

Statistics & the study of language? • Theoretical advances • Language acquisition: the role of experience • Linguistic theory: graded grammaticality • Language change: shifts in grammaticality • Empirical • Quantify linguistic phenomena • Analyze data • Test hypotheses • Psychological • Express preferences

Some interesting statistics about language • Lexical biases • Category: “bank” = Noun 85%, Verb 15% • Sense: Bank(river) 22%, Bank(money) 78% • Syntax • Subcategorization of “realised”: NP 20%, S 65%, Other 15% • Semantics / discourse • “he” in subject position 65% of the time

Corpora • The use of statistical techniques has been made possible by the availability of CORPORA – large collections of text typically ANNOTATED with linguistic information: • The Brown corpus (1M words) and British National Corpus (150 million words), annotated with POS tags (English) • Penn Treebank (4M words), syntactically annotated (English) • SEMCOR (250K), annotated with wordsense information • The MapTask, annotated with dialogue information • Italian: CORIS (100M words+, Bologna), Si-TAL (220K words, written, annotated with syntactic information & wordsense information), IPAR (‘MapTask Italiano’)

Basic uses of corpora:Collocations • COMPOUNDS: “computer program”, “disk drive”, “calcio di rigore” • PHRASAL VERBS: “wake up”, “come on” • PHRASAL EXPRESSIONS: “bacon and eggs”, “the bees’ knees”, “siamo alla frutta”

Bigrams: New York

Statistical Language Processing • Statistical inference: • Collect statistics about occurrence of X • Predict new occurrences • Example: language modeling • Problem: predict word that follows, given previous ones • Find Wn that maximizes P(Wn|W1..W n-1) • Applications: • Speech recognition • Spell-checking • POS tagging …

Bibliografia • Steven Abney, Statistical Methods and Linguistics, in Judith Klavans and Philip Resnik (eds.), The Balancing Act, The MIT Press, Cambridge, Mass., 1995. • Testi: • Daniel Jurafsky and James Martin, Speech and Language Processing, Prentice-Hall • Piu’ generale, e piu’ facile da seguire • Christopher Manning and Hinrich Schütze, Foundations of Statistical Natural Language Processing, MIT Press • Piu’ completo, e scritto da una prospettiva piu’ linguistica, ma tecnicamente piu’ avanzato

METODI STATISTICI NELLA LINGUISTICA COMPUTAZIONALE

METODI STATISTICI NELLA LINGUISTICA COMPUTAZIONALE

Presentation Transcript

Nicoletta Calzolari Istituto di Linguistica Computazionale - CNR - Pisa glottolo@ilc.cnr.it

LINGUISTICA GENERALE E COMPUTAZIONALE, PARTE 2

Nicoletta Calzolari Istituto di Linguistica Computazionale - CNR - Pisa glottolo@ilc.cnr.it

Metodi statistici nella linguistica computazionale

Nicoletta Calzolari Istituto di Linguistica Computazionale - CNR - Pisa glottolo@ilc.cnr.it

LINGUISTICA GENERALE E COMPUTAZIONALE

LINGUISTICA GENERALE E COMPUTAZIONALE

LINGUISTICA GENERALE E COMPUTAZIONALE

LINGUISTICA GENERALE E COMPUTAZIONALE

Istituto di Linguistica Computazionale – Pisa Andrea Bozzi

LINGUISTICA GENERALE E COMPUTAZIONALE

LINGUISTICA GENERALE E COMPUTAZIONALE, PARTE 2