700 likes | 720 Views
Linguistic techniques for Text Mining. NaCTeM team www.nactem.ac.uk Sophia Ananiadou Chikashi Nobata Yutaka Sasaki Yoshimasa Tsuruoka. lexicon. ontology. Natural Language Processing. part-of-speech tagging. named entity recognition. deep syntactic parsing. annotated (structured)
E N D
Linguistic techniques for Text Mining NaCTeM team www.nactem.ac.uk Sophia Ananiadou Chikashi Nobata Yutaka Sasaki Yoshimasa Tsuruoka
lexicon ontology Natural Language Processing part-of-speech tagging named entity recognition deep syntactic parsing annotated (structured) text raw (unstructured) text ………………………………..………………………………………….……….... ... Secretion of TNF was abolished by BHA in PMA-stimulated U937 cells. …………………………………………………………….. S VP VP NP PP NP PP PP NP NN IN NN VBZ VBN IN NN IN JJ NN NNS . Secretion of TNF was abolished by BHA in PMA-stimulated U937 cells . protein_moleculeorganic_compound cell_line negative regulation
Basic Steps of Natural Language Processing • Sentence splitting • Tokenization • Part-of-speech tagging • Shallow parsing • Named entity recognition • Syntactic parsing • (Semantic Role Labeling)
Sentence splitting Current immunosuppression protocols to prevent lung transplant rejection reduce pro-inflammatory and T-helper type 1 (Th1) cytokines. However, Th1 T-cell pro-inflammatory cytokine production is important in host defense against bacterial infection in the lungs. Excessive immunosuppression of Th1 T-cell pro-inflammatory cytokines leaves patients susceptible to infection. Current immunosuppression protocols to prevent lung transplant rejection reduce pro-inflammatory and T-helper type 1 (Th1) cytokines. However, Th1 T-cell pro-inflammatory cytokine production is important in host defense against bacterial infection in the lungs. Excessive immunosuppression of Th1 T-cell pro-inflammatory cytokines leaves patients susceptible to infection.
A heuristic rule for sentence splitting sentence boundary = period + space(s) + capital letter Regular expression in Perl s/\. +([A-Z])/\.\n\1/g;
Errors IL-33 is known to induce the production of Th2-associated cytokines (e.g. IL-5 and IL-13). • Two solutions: • Add more rules to handle exceptions • Machine learning IL-33 is known to induce the production of Th2-associated cytokines (e.g. IL-5 and IL-13).
Tools for sentence splitting • JASMINE • Rule-based • http://uvdb3.hgc.jp/ALICE/program_download.html • Scott Piao’s splitter • Rule-based • http://text0.mib.man.ac.uk:8080/scottpiao/sent_detector • OpenNLP • Maximum-entropy learning • https://sourceforge.net/projects/opennlp/ • Needs training data
Tokenization • Convert a sentence into a sequence of tokens • Why do we tokenize? • Because we do not want to treat a sentence as a sequence of characters! The protein is activated by IL2. The protein is activated by IL2 .
Tokenization • Tokenizing general English sentences is relatively straightforward. • Use spaces as the boundaries • Use some heuristics to handle exceptions The protein is activated by IL2. The protein is activated by IL2 .
Tokenisation issues • separate possessive endings or abbreviated forms from preceding words: • Mary’s Mary ‘sMary’s Mary isMary’s Mary has • separate punctuation marks and quotes from words : • Mary. Mary . • “new” “ new “
Tokenization • Tokenizer.sed: a simple script in sed • http://www.cis.upenn.edu/~treebank/tokenization.html • Undesirable tokenization • original: “1,25(OH)2D3” • tokenized: “1 , 25 ( OH ) 2D3” • Tokenization for biomedical text • Not straight-forward • Needs dictionary? Machine learning?
Tokenisation problems in Bio-text • Commas • 2,6-diaminohexanoic acid • tricyclo(3.3.1.13,7)decanone • Four kinds of hyphens • “Syntactic:” • Calcium-dependent • Hsp-60 • Knocked-out gene: lush-- flies • Negation: -fever • Electric charge: Cl- K. Cohen NAACL-2007
Tokenisation • Tokenization: Divides the text into smallest units (usually words), removing punctuation. Challenge: What should be done with punctuation that has linguistic meaning? • Negative charge (Cl-) • Absence of symptom (-fever) • Knocked-out gene (Ski-/-) • Gene name (IL-2 –mediated) • Plus, “syntactic”uses (insulin-dependent) K. Cohen NAACL-2007
Part-of-speech tagging • Assign a part-of-speech tag to each token in a sentence. The peri-kappa B site mediates human immunodeficiency DT NN NN NN VBZ JJ NN virus type 2 enhancer activation in monocytes … NN NN CD NN NN IN NNS
Part-of-speech tags • The Penn Treebank tagset • http://www.cis.upenn.edu/~treebank/ • 45 tags NN Noun, singular or mass NNS Noun, plural NNP Proper noun, singular NNPS Proper noun, plural : : VB Verb, base form VBD Verb, past tense VBG Verb, gerund or present participle VBN Verb, past participle VBZ Verb, 3rd person singular present : : JJ Adjective JJR Adjective, comparative JJS Adjective, superlative : : DT Determiner CD Cardinal number CC Coordinating conjunction IN Preposition or subordinating conjunction FW Foreign word : :
Part-of-speech tagging is not easy • Parts-of-speech are often ambiguous • We need to look at the context • But how? I have to go to school. I had a go at skiing. verb noun
Writing rules for part-of-speech tagging • If the previous word is “to”, then it’s a verb. • If the previous word is “a”, then it’s a noun. • If the next word is … : Writing rules manually is impossible I have to go to school. I had a go at skiing. verb noun
Learning from examples The involvement of ion channels in B and T lymphocyte activation is DT NN IN NN NNS IN NN CC NN NN NN VBZ supported by many reports of changes in ion fluxes and membrane VBN IN JJ NNS IN NNS IN NN NNS CC NN ……………………………………………………………………………………. ……………………………………………………………………………………. training Unseen text Machine Learning Algorithm We demonstrate PRP VBP that … IN We demonstrate that …
Part-of-speech tagging with Hidden Markov Models tags words transition probability output probability
First-order Hidden Markov Models • Training • Estimate • Counting (+ smoothing) • Using the tagger
Machine learning using diverse features • We want to use diverse types of information when predicting the tag. He opened it Verb The word is “opened” The suffix is “ed” The previous word is “He” : many clues
Machine learning with log-linear models Feature function Feature weight
Machine learning with log-linear models • Maximum likelihood estimation • Find the parameters that maximize the conditional log-likelihood of the training data • Gradient
Computing likelihood and model expectation • Example • Two possible tags: “Noun” and “Verb” • Two types of features: “word” and “suffix” He opened it Noun Verb Noun tag = noun tag = verb
Conditional Random Fields (CRFs) • A single log-linear model on the whole sentence • The number of classes is HUGE, so it is impossible to do the estimation in a naive way.
Conditional Random Fields (CRFs) • Solution • Let’s restrict the types of features • You can then use a dynamic programming algorithm that drastically reduces the amount of computation • Features you can use (in first-order CRFs) • Features defined on the tag • Features defined on the adjacent pair of tags
Features • Feature weights are associated with states and edges W0=He & Tag = Noun He has opened it Noun Noun Noun Noun Tagleft = Noun & Tagright = Noun Verb Verb Verb Verb
A naive way of calculating Z(x) = 7.2 = 4.1 Noun Noun Noun Noun Verb Noun Noun Noun = 1.3 = 0.8 Noun Noun Noun Verb Verb Noun Noun Verb = 4.5 = 9.7 Noun Noun Verb Noun Verb Noun Verb Noun = 0.9 = 5.5 Noun Noun Verb Verb Verb Noun Verb Verb = 2.3 = 5.7 Noun Verb Noun Noun Verb Verb Noun Noun = 11.2 = 4.3 Noun Verb Noun Verb Verb Verb Noun Verb = 3.4 = 2.2 Noun Verb Verb Noun Verb Verb Verb Noun = 2.5 = 1.9 Noun Verb Verb Verb Verb Verb Verb Verb Sum = 67.5
Dynamic programming • Results of intermediate computation can be reused. He has opened it Noun Noun Noun Noun Verb Verb Verb Verb forward
Dynamic programming • Results of intermediate computation can be reused. He has opened it Noun Noun Noun Noun Verb Verb Verb Verb backward
Dynamic programming • Computing marginal distribution He has opened it Noun Noun Noun Noun Verb Verb Verb Verb
Maximum entropy learning and Conditional Random Fields • Maximum entropy learning • Log-linear modeling + MLE • Parameter estimation • Likelihood of each sample • Model expectation of each feature • Conditional Random Fields • Log-linear modeling on the whole sentence • Features are defined on states and edges • Dynamic programming
POS tagging algorithms • Performance on the Wall Street Journal corpus
POS taggers • Brill’s tagger • http://www.cs.jhu.edu/~brill/ • TnT tagger • http://www.coli.uni-saarland.de/~thorsten/tnt/ • Stanford tagger • http://nlp.stanford.edu/software/tagger.shtml • SVMTool • http://www.lsi.upc.es/~nlp/SVMTool/ • GENIA tagger • http://www-tsujii.is.s.u-tokyo.ac.jp/GENIA/tagger/
Tagging errors made by a WSJ-trained POS tagger … and membrane potential after mitogen binding. CC NN NN IN NN JJ … two factors, which bind to the same kappa B enhancers… CD NNS WDT NN TO DT JJ NN NN NNS … by analysing the Ag amino acid sequence. IN VBG DT VBG JJ NN NN … to contain more T-cell determinants than … TO VB RBR JJ NNS IN Stimulation of interferon beta gene transcription in vitro by NN IN JJ JJ NN NN IN NN IN
Taggers for general text do not work wellon biomedical text Performance of the Brill tagger evaluated on randomly selected 1000 MEDLINE sentences: 86.8% (Smith et al., 2004) Accuracies of a WSJ-trained POS tagger evaluated on the GENIA corpus (Tsuruoka et al., 2005)
MedPost(Smith et al., 2004) • Hidden Markov Models (HMMs) • Training data • 5700 sentences randomly selected from various thematic subsets. • Accuracy • 97.43% (native tagset), 96.9% (Penn tagset) • Evaluated on 1,000 sentences • Available from • ftp://ftp.ncbi.nlm.nih.gov/pub/lsmith/MedPost/medpost.tar.gz
Training POS taggers with bio-corpora(Tsuruoka and Tsujii, 2005)
Performance on new data • Relative performance evaluated on recent abstracts selected from three journals: • - Nucleic Acid Research (NAR) • - Nature Medicine (NMED) • - Journal of Clinical Investigation (JCI)
Chunking (shallow parsing) • A chunker (shallow parser) segments a sentence into non-recursive phrases. He reckons the current account deficit will narrow to NPVPNPVPPP only #1.8 billion in September . NPPPNP
Extracting noun phrases from MEDLINE(Bennett, 1999) • Rule-based noun phrase extraction • Tokenization • Part-Of-Speech tagging • Pattern matching Noun phrase extraction accuracies evaluated on 40 abstracts
Chunking with Machine learning • Chunking performance on Penn Treebank
Machine learning-based chunking • Convert a treebank into sentences that are annotated with chunk information. • CoNLL-2000 data set • http://www.cnts.ua.ac.be/conll2000/chunking/ • The conversion script is available • Apply a sequence tagging algorithm such as HMM, MEMM, CRF, or Semi-CRF. • YamCha: an SVM-based chunker • http://www.chasen.org/~taku/software/yamcha/
GENIA tagger • Algorithm: Bidirectional MEMM • POS tagging • Trained on WSJ, GENIA and Penn BioIE • Accuracy: 97-98% • Shallow parsing • Trained on WSJ and GENIA • Accuracy: 90-94% • Can output base forms • Available from http://www-tsujii.is.s.u-tokyo.ac.jp/GENIA/tagger/
Named-Entity Recognition We have shown that interleukin-1 (IL-1) and IL-2 control protein protein protein IL-2 receptor alpha (IL-2R alpha) gene transcription in DNA CD4-CD8-murine T lymphocyte precursors. cell_line • Recognize named-entities in a sentence. • Gene/protein names • Protein, DNA, RNA, cell_line, cell_type
Performance of biomedical NE recognition • Shared task data for Coling 2004 BioNLP workshop • - entity types: protein, DNA, RNA, cell_type, and cell_line
Features Classification models, main features used in NLPBA (Kim, 2004) Classification Model (CM): S: SVM; H: HMM; M: MEMM; C: CRF Features lx: lexical features; af: affix information (chracter n-grams); or; orthographic Information; sh: word shapes; gn: gene sequence; gz: gazetteers; po: part-of-speech tags; np: noun phrase tags; sy: syntactic tags; tr: word triggers; ab: abbreviations; ca: cascaded entities; do: global document information; pa: parentheses handling; pre: previously predicted entity tags; B: British National Corpus; W: WWW; V: virtually generated corpus; M: MEDLINE
CFG parsing S VP NP NP QP VBN NN VBD DT JJ CD CD NNS . Estimated volume was a light 2.4 million ounces .
Phrase structure + head information S VP NP NP QP VBN NN VBD DT JJ CD CD NNS . Estimated volume was a light 2.4 million ounces .
Dependency relations VBN NN VBD DT JJ CD CD NNS . Estimated volume was a light 2.4 million ounces .