370 likes | 396 Views
The NLP TOOLTORIAL: Tools for Natural Language Processing and Text Mining. Dragomir R. Radev April 22, 2011. Abuja, Nigeria (CNN) -- Incumbent Goodluck Jonathan is the winner of the presidential election in Nigeria, the chairman of Nigeria's Independent
E N D
The NLP TOOLTORIAL:Tools for Natural Language Processing and Text Mining Dragomir R. Radev April 22, 2011
Abuja, Nigeria (CNN) -- Incumbent Goodluck Jonathan is the winner of the presidential election in Nigeria, the chairman of Nigeria's Independent National Electoral Commission declared Monday. "Goodluck E. Jonathan of PDP, having satisfied the requirements of the law and scored the highest number of votes, is hereby declared the winner," Chairman Attahiru Jega said. He said the ruling People's Democratic Party won 22,495,187 of the 39,469,484 votes cast Saturday. That number far outstripped the votes for Muhammadu Buhari, of the Congress for Progressive Change, the main opposition party, which won 12,214,853. To avoid a runoff, Jonathan needed at least a quarter of the vote in two-thirds of the 36 states and the capital. He won that amount in 31 states. Only the PDP signed the results; representatives of the other parties refused to do so. Nigeria's main opposition party, the Congress for Progressive Change, alleged that would-be voters had been intimidated and driven away from polling stations and that ballot boxes were stuffed with votes for Jonathan.
Recent advances in molecular genetics have permitted the development of novel virus-based vectors for the delivery of genes and expression of gene products [6,7,8]. These live vectors have the advantage of promoting robust immune responses due to their ability to replicate, and induce expression of genes at high efficiency. Sendai virus is a member of the Paramyxoviridae family, belongs in the genus respirovirus and shares 60–80% sequence homology to human parainfluenza virus type 1 (HPIV-1) [9,10]. The viral genome consists of a negative sense, non-segmented RNA. Although Sendai virus was originally isolated from humans during an outbreak of pneumonitis [11] subsequent human exposures to Sendai virus have not resulted in observed pathology [12]. The virus is commonly isolated from mouse colonies and Sendai virus infection in mice leads to bronchopneumonia, causing severe pathology and inflammation in the respiratory tract. The sequence homology and similarities in respiratory pathology have made Sendai virus a mouse model for HPIV-1. Immunization with Sendai virus promotes an immune response in non-human primates that is protective against HPIV-1 [13,14] and clinical trials are underway to determine the efficacy of this virus for protection against HPIV-1 in humans [15]. Sendai virus naturally infects the respiratory tract of mice and recombinant viruses have been reported to efficiently transduce luciferase, lac Z and green fluorescent protein (GFP) genes in the airways of mice or ferrets as well as primary humannasal epithelial cells [16]. These data support the hypothesis that intranasal (i.n.) immunization with a recombinant Sendai virus will mediate heterologous gene expression in mucosal tissues and induce antibodies that are specific to a recombinant protein. A major advantage of a recombinant Sendai virus based vaccine is the observation that recurrence of parainfluenza virus infections is common in humans [12,17] suggesting that anti-vector responses are limited, making repeated administration of such a vaccine possible.
Background • What is NLP? • Components: tokenization, parsing, semantic analysis, information extraction, machine translation, speech processing, question answering, text summarization, sentiment analysis, relationship extraction, named entity recognition, text classification
Preprocessing • Text extraction • Sentence segmentation • Word normalization • Stemming • Part of speech tagging • Named entity extraction
Part of speech tagging Secretariat is expected to race tomorrow
Part of speech tagging NNP VBZ VBN TO VB NR Secretariat is expected to race tomorrow NNP VBZ VBN TO NN NR Secretariat is expected to race tomorrow
Parsing • Myriam slept. • Myriam wrote a novel. • Myriam gave Sally flowers. • Myriam ate pizza with olives. • Myriam ate pizza with Sally. • Myriam ate pizza with a fork. • Myriam ate pizza with remorse.
Sample phrase-structure grammar S NP VPNP DET NNP NP PPVP VBDVP VBD NPVP VBD NP NPVP VP PPPP PRP NP DET theDET thatDET aN childN butterflyN netVBD caughtVBD ateVBD sawPRP inPRP ofPRP with
Parse trees S NP VP DET N VBD PP NP DET N NP PRP caught That child DET N butterfly the with a net
Syntactic parsing • Collins parser • Bikel parser (includes Chinese, Arabic) use Lingua::CollinsParser; my $p = Lingua::CollinsParser->new(); my $cp_home = '/path/to/COLLINS-PARSER'; $p->load_grammar("$cp_home/models/model1/grammar"); $p->load_events( "$cp_home/models/model1/events"); my @words = qw(The bird flies); my @tags = qw(DT NN VBZ); my $tree = $p->parse_sentence(\@words, \@tags);
Stanford parser (VP (VBD rose) (NP (CD 7.2) (NN %)) (PP (IN in) (NP (NNP March))) (PP (TO to) (NP (NP (DT an) (JJ annual) (NN rate)) (PP (IN of) (NP (CD 549,000) (NNS units))))) (, ,) (ADVP (RB up) (PP (IN from) (NP (NP (DT a) (VBN revised) (CD 512,000)) (PP (IN in) (NP (NNP February)))))))) (, ,) (NP (DT the) (NNP Commerce) (NNP Department)) (VP (VBD said)) (. .))) (ROOT (S (S (NP (NP (NN Housing) (NNS starts)) (, ,) (NP (NP (DT the) (NN number)) (PP (IN of) (NP (NP (JJ new) (NNS homes)) (VP (VBG being) (VP (VBN built)))))) (, ,))
Stanford parser amod(rate-20, annual-19) pobj(to-17, rate-20) prep(rate-20, of-21) num(units-23, 549,000-22) pobj(of-21, units-23) advmod(rose-12, up-25) dep(up-25, from-26) det(512,000-29, a-27) amod(512,000-29, revised-28) pobj(from-26, 512,000-29) prep(512,000-29, in-30) pobj(in-30, February-31) det(Department-35, the-33) nn(Department-35, Commerce-34) nsubj(said-36, Department-35) nn(starts-2, Housing-1) nsubj(rose-12, starts-2) det(number-5, the-4) appos(starts-2, number-5) prep(number-5, of-6) amod(homes-8, new-7) pobj(of-6, homes-8) auxpass(built-10, being-9) partmod(homes-8, built-10) ccomp(said-36, rose-12) num(%-14, 7.2-13) dobj(rose-12, %-14) prep(rose-12, in-15) pobj(in-15, March-16) prep(rose-12, to-17) det(rate-20, an-18)
Named Entity Recognition • Abner
WordNet Noun S: (n) board (a committee having supervisory powers) "the board has seven members" S: (n) board, plank (a stout length of sawn timber; made in a wide variety of sizes and used for many purposes) S: (n) board (a flat piece of material designed for a special purpose) "he nailed boards across the windows" S: (n) board, table (food or meals in general) "she sets a fine table"; "room and board" S: (n) display panel, display board, board (a vertical surface on which information can be displayed to public view) S: (n) dining table, board (a table at which meals are served) "he helped her clear the dining table"; "a feast was spread upon the board" S: (n) control panel, instrument panel, control board, board, panel (electrical device consisting of a flat insulated surface that contains switches and dials and meters for controlling other electrical devices) "he checked the instrument panel"; "suddenly the board lit up like a Christmas tree" S: (n) circuit board, circuit card, board, card, plug-in, add-in (a printed circuit that can be inserted into expansion slots in a computer to increase the computer's capabilities) S: (n) board, gameboard (a flat portable surface (usually rectangular) designed for board games) "he got out the board and set up the pieces" Verb S: (v) board, get on (get on board of (trains, buses, ships, aircraft, etc.)) S: (v) board, room (live and take one's meals at or in) "she rooms in an old boarding house" S: (v) board (lodge and take meals (at)) S: (v) board (provide food and lodging (for)) "The old lady is boarding three men"
Wordnet dog, domestic dog, Canis familiaris => canine, canid => carnivore => placental, placental mammal, eutherian, eutherian mammal => mammal => vertebrate, craniate => chordate => animal, animate being, beast, brute, creature, fauna
Wordnet similarity • http://www.d.umn.edu/~tpederse/similarity.html use WordNet::Similarity::lch; use WordNet::QueryData; my $wn = WordNet::QueryData->new(); my $myobj = WordNet::Similarity::lch->new($wn); my $value = $myobj->getRelatedness("car#n#1", "bus#n#2"); ($error, $errorString) = $myobj->getError(); die "$errorString\n" if($error); print "car (sense 1) <-> bus (sense 2) = $value\n";
CMU-CAM LM toolkit • Language modeling • Extracting n-grams • Smoothing • Computing perplexities • http://www.speech.cs.cmu.edu/SLM/toolkit_documentation.html
Machine translation • Moses
Machine Translation • Noisy channel model (“Chinese Whispers”) f e e’ E F F E encoder decoder e’ = argmax P(e|f) = argmax P(f|e) P(e) e e translation model language model
Machine Translation • IBM Method IS THIS YOUR FAVORITE PLAY ? IS THIS YOUR FAVORITE PLAY PLAY PLAY ? ** ** ** THIS IS YOUR PLAY PLAY PLAY FAVORITE ? EST-CE QUE C’ EST VOTRE PIECE DE THEATRE PREFEREE ? LM TM
Clairlib mkdir corpus cd corpus wget -r -nd -nc http://clair.si.umich.edu/clair/corpora/chemical cd .. directory_to_corpus.pl --corpus chemical --base produced --directory corpus index_corpus.pl --corpus chemical --base produced tf_query.pl -c chemical -b produced -q health tf_query.pl -c chemical -b produced --all idf_query.pl -c chemical -b produced -q health idf_query.pl -c chemical -b produced --all corpus_to_network.pl -c chemical -b produced -o chemical.graph print_network_stats.pl -i chemical.graph --all > chemical.graph.stats extract_ngrams.pl -r "$CLAIRLIB/corpora/1984/1984.txt" \ -f text -w 1984.2gram -N 2 -sort –v list_to_cos_stats.pl --corpus sentences --base sentences_produced --data sentences_data --input 11sent.txt --step 0.05
Text summarization use Clair::Document; $file = "gulf.html"; my $doc = new Clair::Document(type=>"html", file=>"gulf.html"); $doc->strip_html(); $doc->split_into_sentences(); @summary = $doc->get_summary(size => 5, preserve_order => 1); foreach my $sent (@summary) { print "$sent->{text} "; }
Latent Semantic Indexing (LSI/LSA) • SenseClusters • (Ted Pedersen)
Machine learning • Clustering and classification • Weka – decision trees, naïve bayes, k-means, EM, etc. • For text classification (e.g., spam recognition) – svmlight • Feature extraction – using pointwise mutual information, Chi-Square…
Main URLs • Collins parser: • http://people.csail.mit.edu/mcollins/code.html • Bikel parser: • http://www.cis.upenn.edu/~dbikel/software.html • Moses • http://www.statmt.org/ • Clairlib/mead • http://www.clairlib.org • NLTK • http://www.nltk.org/ • Abner • http://pages.cs.wisc.edu/~bsettles/abner/ • JavaRAP • http://aye.comp.nus.edu.sg/~qiu/NLPTools/JavaRAP.html
Main URLs • Mallet • http://mallet.cs.umass.edu • Svmlight • http://svmlight.joachims.org • Weka • http://www.cs.waikato.ac.nz/ml/weka
Additional tools • Language identification: • http://www.let.rug.nl/~vannoord/TextCat/ • Porter stemmer: • http://tartarus.org/~martin/PorterStemmer/ • MXTerminator: • ftp://ftp.cis.upenn.edu/pub/adwait/jmx/ • PDFBOX: • http://pdfbox.apache.org/ • Coreference: • http://www.bart-coref.org/ • Stanford NER • http://nlp.stanford.edu/software/CRF-NER.shtml • Citation parsing • http://wing.comp.nus.edu.sg/parsCit/ • Lingpipe • http://alias-i.com/lingpipe/
Additional tools • Sentiment analysis • Sentence compression • Time-series analysis • More links: • http://cogcomp.cs.illinois.edu/page/software • http://nlp.stanford.edu/links/statnlp.html • http://www.coli.uni-saarland.de/~csporled/page.php?id=tools • http://www.aclweb.org