1 / 47

LINGUISTICA GENERALE E COMPUTAZIONALE

LINGUISTICA GENERALE E COMPUTAZIONALE. CLASSIFICAZIONE DI TESTI. CLASSIFICAZIONE. Un CLASSIFICATORE e ’ una FUNZIONE da oggetti che si vogliono classificare a etichette Assegnare la parte del discorso a parole Assegnare valore SPAM/NO SPAM a email Positivo / negativo

ozzy
Download Presentation

LINGUISTICA GENERALE E COMPUTAZIONALE

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. LINGUISTICA GENERALE E COMPUTAZIONALE CLASSIFICAZIONE DI TESTI

  2. CLASSIFICAZIONE • Un CLASSIFICATORE e’ una FUNZIONE daoggettichesivoglionoclassificare a etichette • Assegnare la parte del discorso a parole • Assegnarevalore SPAM/NO SPAM a email • Positivo / negativo • I variaspettidell’interpretazione del linguaggiocheabbiamovistonella prima lezione (disambiguazionedelleparti del discorso, analisisintattica, etc) possonoesseretuttivisti come problemidiclassificazione

  3. ESEMPIO: DISAMBIGUAZIONE DELLE PARTI DEL DISCORSO COME CLASSIFICAZIONE • La disambiguazionedelleparti del discorso (POS tagging) puo’ essere vista come un classificatorechedeterminal’interpretazionepiu’ probabilediogniparola (nome, verbo, etc)

  4. SECONDO ESEMPIO: DISAMBIGUAZIONE DEI SIGNIFICATI

  5. SENSES OF “line” • Product: “While he wouldn’t estimate the sale price, analysts have estimated that it would exceed $1 billion. Kraft also told analysts it plans to develop and test a line of refrigerated entrees and desserts, under the Chillery brand name.” • Formation: “C-LD-R L-V-S V-NNA reads a sign in Caldor’s book department. The 1,000 or so people fighting for a place in line have no trouble filling in the blanks.” • Text: “Newspaper editor Francis P. Church became famous for a 1897 editorial, addressed to a child, that included the line “Yes, Virginia, there is a Santa Clause.” • Cord: “It is known as an aggressive, tenacious litigator. Richard D. Parsons, a partner at Patterson, Belknap, Webb and Tyler, likes the experience of opposing Sullivan & Cromwell to “having a thousand-pound tuna on the line.” • Division: “Today, it is more vital than ever. In 1983, the act was entrenched in a new constitution, which established a tricameral parliament along racial lines, with separate chambers for whites, coloreds and Asians but none for blacks.” • Phone: “On the tape recording of Mrs. Guba's call to the 911 emergency line, played at the trial, the baby sitter is heard begging for an ambulance.”

  6. UNA VISIONE GEOMETRICA DELLA CLASSIFICAZIONE SPAM NON-SPAM

  7. ESEMPIO DI CLASSIFICATORE: DECISION TREE

  8. IL RUOLO DELL’APPRENDIMENTO AUTOMATICO • Nellalinguisticacomputazionalemoderna, questiclassificatori non vengonospecificati a mano, ma vengono APPRESI AUTOMATICAMENTE a partiredaesempi.

  9. CLASSIFICAZIONE PROBABILISTICA • Ad ognietichettae’ tipicamenteassociatauna PROBABILITA’ • Il classificatorepuo’ esseresviluppato a manoo APPRESO da (grandiquantita’ di) ESEMPI usandometodidi APPRENDIMENTO AUTOMATICO

  10. POS TAGGER PROBABILISTICI • Un POS TAGGER e’ un classificatorechericeve come input informazionisullaparola (FEATURES) • UNIGRAM PROBABILITY: P(N|salto), P(V|salto) • AFFIXES (‘ing’, ‘ould’) • N-GRAM PROBABILITIES: P(NN|unsalto) • …. • Produce in output unaprobabilita’ • P(N|UProb,AFF,Nprob) = … • P(V|UProb,AFF,Nprob) = …

  11. TIPI DI CLASSIFICATORI • SUPERVISIONATI (SUPERVISED) • Imparanodaesempietichettati • Modellanol’apprendimentotramiteinsegnanti • NON SUPERVISIONATI (UNSUPERVISED) • Scopronoda soli la struttura del problema • Modellanol’apprendimento del linguaggio • SEMI-SUPERVISED • Ricevono come input pochiesempi poi procedono per somiglianza

  12. SUPERVISED CLASSIFICATION FOR POS TAGGING • L’algoritmodiapprendimentoriceve come input un corpus di TRAINING classificato con POS tags • La/Art gatta/N fece/V un/Art salto/N ./. • Giuseppe/PN e’/V matto/Adj ./. • Si estrae le features / calcola le probabilita’ • Costruisce un MODELLO chepuo’ poi essereusato per classificare ALTRI testi

  13. TRAIN/TEST

  14. METODI PER L’APPRENDIMENTO • DECISION TREES • NAÏVE BAYES

  15. NAÏVE BAYES • MetodiBayesiani: decisionesuclassificazionebasatasu • un modello PROBABILISTICO • checoniugausodiinformazioni A PRIORI ed A POSTERIORI come nellaregoladiBayes • Metodi NAÏVE BAYES: sifannoassunzionichesemplificano molto ilcalcolodelleprobabilità

  16. LeggediBayes

  17. Bayesapplicataallaclassificazioneditesti P(Classe|Proprietà) = P(Proprietà|Classe)*P(Classe) /P(Proprietà)

  18. Maximum a posteriori Hypothesis

  19. Naive Bayes Classifiers Task: Classify a new instance based on a tuple of attribute values

  20. Naïve Bayes Classifier: Assumptions • P(cj) • Can be estimated from the frequency of classes in the training examples. • P(x1,x2,…,xn|cj) • O(|X|n•|C|) • Could only be estimated if a very, very large number of training examples was available. Conditional Independence Assumption:  Assume that the probability of observing the conjunction of attributes is equal to the product of the individual probabilities.

  21. Flu X1 X2 X3 X4 X5 runnynose sinus cough fever muscle-ache The Naïve Bayes Classifier • Conditional Independence Assumption: features are independent of each other given the class:

  22. C X1 X2 X3 X4 X5 X6 Learning the Model • Common practice:maximum likelihood • simply use the frequencies in the data

  23. Flu X1 X2 X3 X4 X5 runnynose sinus cough fever muscle-ache Problem with Max Likelihood • What if we have seen no training cases where patient had no flu and muscle aches? • Zero probabilities cannot be conditioned away, no matter the other evidence!

  24. Smoothing to Avoid Overfitting # of values of Xi overall fraction in data where Xi=xi,k • Somewhat more subtle version extent of “smoothing”

  25. Using Naive Bayes Classifiers to Classify Text: Basic method • Attributes are text positions, values are words. • Naive Bayes assumption is clearly violated. • Example? • Still too many possibilities • Assume that classification is independent of the positions of the words • Use same parameters for each position

  26. ESEMPIO DI CLASSIFICAZIONE: DOCUMENT CLASSIFICATION (NLTK book, p. 227-228)

  27. VALUTAZIONE • ACCURACY: percentualedirispostecorrette • Nelcasodiproblemi in cui la classediinteresserappresentaunapercentuale minima del totale: PRECISION e RECALL

  28. PRECISION E RECALL

  29. ESEMPIO DI CLASSIFICAZIONE: GENDER IDENTIFICATION (NLTK book, p. 222-227)

  30. APPRENDERE DECISION TREES • Top-down: dato un certoinsiemediesempi, trovare la proprieta’ chepermettedidividerli in sottogruppipiu’ COERENTI • Poi siprocedericorsivamente • Sceltadellaproprieta’: INFORMATION GAIN

  31. Top-down DT induction • Partition training examples into good “splits”, based on values of a single “good” feature: (1) Sat, hot, no, casual, keys -> + (2) Mon, cold, snow, casual, no-keys -> - (3) Tue, hot, no, casual, no-keys -> - (4) Tue, cold, rain, casual, no-keys -> - (5) Wed, hot, rain, casual, keys -> +

  32. Top-down DT induction keys? yes no Drive: 1,5 Walk: 2,3,4

  33. Top-down DT induction • Partition training examples into good “splits”, based on values of a single “good” feature (1) Sat, hot, no, casual -> + (2) Mon, cold, snow, casual -> - (3) Tue, hot, no, casual -> - (4) Tue, cold, rain, casual -> - (5) Wed, hot, rain, casual -> + • No acceptable classification: proceed recursively

  34. Top-down DT induction t? cold hot Walk: 2,4 Drive: 1,5 Walk: 3

  35. Top-down DT induction t? cold hot Walk: 2,4 day? Sat Wed Tue Drive: 1 Walk: 3 Drive: 5

  36. Top-down DT induction t? cold hot Mo, Thu, Fr, Su Walk: 2,4 day? Sat Wed Tue Drive: 1 Walk: 3 Drive: 5 ? Drive

  37. Selezionedellaproprieta’ • La sceltadellaproprieta’ dausare per dividerel’insiemecorrente in sottinsiemipiu’ coerentisibasasu un criteriodi RIDUZIONE DEL DISORDINE basatosullanozionedi ENTROPIA

  38. ENTROPIA

  39. Entropy and Decision Trees keys? E(S)=-0.6*lg(0.6)-0.4*lg(0.4)= 0.97 no yes Walk: 2,4 Drive: 1,3,5 E(Sno)=0 E(Skeys)=0

  40. Entropy and Decision Trees t? E(S)=-0.6*lg(0.6)-0.4*lg(0.4)= 0.97 cold hot Walk: 2,4 Drive: 1,5 Walk: 3 E(Scold)=0 E(Shot)=-0.33*lg(0.33)-0.66*lg(0.66)= 0.92

  41. INFORMATION GAIN

  42. Information gain • For each feature f, compute the reduction in entropy on the split: Gain(S,f)=E(S)-∑(Entropy(Si)*|Si|/|S|) f=keys? : Gain(S,f)=0.97 f=t?: Gain(S,f)=0.97-0*2/5-0.92*3/5=0.42 f=clothing?: Gain(S,f)= ?

  43. TEXT CATEGORIZATION WITH DT • Build a separate decision tree for each category • Use WORDS COUNTS as features

  44. Reuters Data Set (21578 - ModApte split) • 9603 training, 3299 test articles; ave. 200 words • 118 categories • An article can be in more than one category • Learn 118 binary category distinctions • Earn (2877, 1087) • Acquisitions (1650, 179) • Money-fx (538, 179) • Grain (433, 149) • Crude (389, 189) Common categories (#train, #test) • Trade (369,119) • Interest (347, 131) • Ship (197, 89) • Wheat (212, 71) • Corn (182, 56)

  45. AN EXAMPLE OF REUTERS TEXT Foundations of Statistical Natural Language Processing, Manning and Schuetze

  46. Decision Tree for Reuter classification Foundations of Statistical Natural Language Processing, Manning and Schuetze

  47. Information gain & text classification

More Related