900 likes | 1.12k Views
Biomedical Named Entity Recognition from Text using Genetic Algorithm Based Classifier Subset Selection. Nazife Dimililer Supervisor: Asst. Prof. Dr. Ekrem Varo ğlu. Outl i ne. Motivation Background Overview of IE tasks Definition of NER Corpus Used Objective of Thesis Related Work
E N D
Biomedical Named Entity Recognition from Text using Genetic Algorithm Based Classifier Subset Selection Nazife Dimililer Supervisor: Asst. Prof. Dr. Ekrem Varoğlu
Outline • Motivation • Background • Overview of IE tasks • Definition of NER • Corpus Used • Objective of Thesis • Related Work • Proposed System • Corpus • Individual Classifiers • Multi Classifier System • Future Work
Motivation Motivation of the Thesis • Vast amount of literature available online • Need for • Intelligent Information Retrieval • Automatically populating databases • Document Understanding/Summarization • … • NER is the first step of all IE tasks • Annotated Corpora : GENIA, BioCreative, FlyBase • Room for improvement • Applicability to other domains
What is Named Entity Recognition? Background Named entity recognition (NER) A subtask of information extraction Identifies and labels strings of a text as belonging to predefined classes (Named Entities) Example NEs : persons, organizations, expressions of times, drugs, proteins, cell types NER poses a significant challenge in Biomedical Domain
Background Overview of IE tasks in Biomedical Domain
Irregularities and mistakes in Tokenization Tagging (Irregular ) use of special symbols Lack of standard naming conventions Changing names and notations Continuous introduction of new names Abbreviations, Synonyms, Variations Homonyms or Ambiguous names Cascaded named entities Complicated constructions Comma separated lists Disjunctions & Conjunctions Inclusion of adjectives as part of some NEs Background Sources of Problems in Biomedical NER
Background State of Current Research for Biomedical NER A large number of systems have been proposed for biomedical NER. • Systems based on individual classifiers • Multiple Classifier systems with small number of members • external sources • hand crafted post processing • corpora with differing NEs • different evaluation schemes
Background State of Current Research in Biomedical NER • A very important milestone in this area was the Bio-Entity Recognition Task in JNLPBA in 2004. • Same systems as in newswire domain was used with slight changes. • Rich feature sets were exploited • Successful classifiers relied on external resources and post processing • Similar systems were used in the Biocreative tasks in 2004,2006 and 2009 and in other publications.
Objective Objective of the Thesis • Improve biomedical NER performance • Use a benchmark corpus • Apply classifier selection techniques to biomedical NER • Train reliable and diverse set of individual classifiers • Utilize a large set of individual classifiers • Use Genetic Algorithm to form an ensemble performing Vote based classifier subset selection
Corpus Used JNLPBA data : based on Genia Corpus v. 3.02 Contains 5 Entities: Protein RNA DNA Cell Line Cell Type IOB2 tagged : 11 classes Corpus • B-Protein I-Protein • B-RNA B-RNA • B-DNA B-DNA • B-Cell Line B-Cell Line • B-Cell Type B-Cell Type • Outside
Corpus Format of JNLPBA Data The O peri-kappa B-DNA B I-DNA site I-DNA mediates O human B-DNA immunodeficiency I-DNA virus I-DNA type I-DNA 2 I-DNA enhancer I-DNA activation O …. Human O immunodeficiency O virus O type O 2 O Our O data O suggest O that O lipoxygenase B-protein metabolites I-protein activate O ROI O formation O which O then O induce O IL-2 B-protein expression O via O NF-kappa B-protein B I-protein activation O . O
MeSH terms "human", "blood cells" and "transcription factors Super domain of "blood cells" and "transcription factors Corpus Data Set Statistics
Individual Classifiers Individual Classifier Architecture • Why use SVM? • Successfully used in many NLP tasks and bioinformatics • CoNNL 2000 and CoNNL 2004 • BioCreAtIve Competition 2004 • Ability to handle large feature sets • IOB2 notation is used to represent entities • Multi class classification problem • Features extracted from the training data only
Individual Classifiers Individual Classifier System used • YamCha : is a generic, customizable, and open source text chunker that uses Support Vector Machines • Tunable parameters : • Parsing direction: Left-to-Right/Right-to-Left • Range of context window • Degree of polynomial kernel
Context Window Individual Classifiers The default setting is "F:-2..2:0.. T:-2..-1".
Training Individual Classifiers All individual classifiers are trained using one-vs-all approach Backward or forward parse direction Different context windows Different degrees of the polynomial kernel Different feature (combination)s Individual Classifiers
Individual Classifiers All classifiers are based on SVM Features Types Lexical Features Morphological Feature Orthographic Features Surface Word Feature Tokens and the predicted tags are also used as features Individual Classifiers
Tokens : words in training data the token to be classified and the preceding and following tokens as specified by the context window. Previously Predicted Tags : Predicted tags of the preceding tokens. Specified by the context window. Individual Classifiers Features Used
Individual Classifiers Features Used (Cont.) Lexical Feature : represents grammatical functions of tokens. • Part Of Speech : tags from Penn Treebank Project added using the Geniatagger. Ex: Adverb, Determiner, Adjective • Phrase Tag :phrasal categories added using an SVM trained on newswire data Ex: Noun Phrase, Verb Phrase, Adjective Phrase • Base Noun Phrase Tag:basic noun phrases are tagged using fnTBL tagger.
Different n-grams of the current token An n-gram of a token is simply formed by using the last or first n characters of the token. Last 1/2/3/4 letters First 1/2/3/4 letters Example: TRANSCRIPTION Individual Classifiers Features Used : Morphological
Features Used : Orthographic Also known as Word Formation Patterns Information about the form of the word Example: Contains uppercase letters, digits etc. Two different approaches used: Simple : existence of a particular word formation pattern is represented by a binary feature Yes/No. Intricate : multiple word formation patterns represented using a list based on representation score Individual Classifiers
Features Used : Orthographic (Cont.) Orthographic Feature - Intricate Approach : A list of word formation patterns is formed in decreasing order of representation score. The representation score of an orthographic property denoted by i for entity labeled as j (RSi,j) is calculated as: Individual Classifiers • The orthographic features that have a representation score of more than 10% for outside tagged tokens are eliminated from the list
Features Used : Orthographic (Cont.) Individual Classifiers Orthographic Features used
Features Used : Orthographic (Cont.) Intricate Use of the Orthographic Feature : Priority Based : Each token is tagged with the first applicable word formation pattern on the list. Example Individual Classifiers
Features Used : Orthographic (Cont.) Intricate Use of the Orthographic Feature : Binary string : A binary string containing one bit to represent each word formation pattern in the list. Example: Individual Classifiers Initial letter capitalized combination of upper letter and other symbol combination of upper and lower case letters combination of upper letter and number Contains number combination of alphabetical chars and numbers combination of lower letter and other symbols combination of alphabetic chars and other symbols combination of lower letter and number contains upper letter
Features Used (Cont.) Surface Words: . A separate pseudo-dictionary for each entity containing tokens with the highest count in the training data such that x% of all tokens in the entity names are in the dictionary. Pseudo dictionaries with 50%, 60%, 70%,80% coverage. Each token is tagged with a 5-bit string where each bit corresponds to the pseudo dictionary of an entity. Individual Classifiers
Effect of Feature Extraction • Each feature type improves the performance in different perspectives • Precision • Recall • Boundaries • Entity based performances • Careful combination of features improves the overall performance
Individual Classifiers Effect of parse direction and lexical feature • Effect of backward parsing: • Precision and recall increased for both boundaries • Precision scores improved more than recallscores • An overall increase in full recall, precision and F-score • Effect of Lexical Features: • Single lexical features: higher precision than recall • Combinations : recall and precision values are more balanced. • Combinations slightly improve performance of both the left boundary and the right boundary F-scores
Individual Classifiers Effect of Morphological Features • F-score improves compared to the baseline system • Suffixes alone result in higher recall than precision • Prefixes alone result in higher precision than recall • Combination improves the overall performance • Morphological feature improves recall but degrades precision compared to the baseline
Individual Classifiers Effect of Orthographic Features • Performance is improved by all orthographic features • Best performance is achieved by the binary string. • For simple orthographic features, precision scores slightly higher than recall scores • intricate orthographic features provide higher recall values resulting in overall improvement in F-scores.
Individual Classifiers Effect of Surface Word Feature • Precision scores improved more than recall scores compared to the baseline classifier • Improvement on the right boundary is more pronounced. • Precision score is greater than the recall score • use pseudo-dictionary to generate classifiers with higher precision values than recall values
Individual Classifiers Effect of Feature Combinations • Some specific combinations do not have a significant improvement in performance. • Careful combination of features is useful for improving overall performance • Different combinations of feature/parameter sets favor different entities
Motivation for Multiple Classifier Systems Multiple Classifier System • For individual classifiers • A set of carefully engineered features improve performance • Unfortunately performance is still NOT satisfactory • Combining multiple classifiers into ensembles • The combined opinions of a number experts is more likely to be correct than that of a single expert
Multiple Classifier System Classifier Pool • Classifiers exploiting state-of-the-art feature sets => highest F-scores • Classifiers with high precision or recall • Classifiers with high precision but low recall and vice versa • One or more classifiers providing the highest F-score for each entity
Training Phase SVM Classifier Set Dictionary & Context Words SVM 1 Feature Set 1 SVM 2 GA based Classifier Selection Feature Set 2 . . . Feature Set M SVM M Best Fitting Ensemble Testing Phase SVM Classifier Set SVM 1 Feature Set 1 SVM 2 Test Data Classifier Fusion Feature Extraction Feature Set 2 . . . Feature Set M SVM M Multiple Classifier System Classifier Fusion Architecture Training Data Feature Extraction Post Processing Predicted Class
Fusion Algorithm Weighted Majority Voting : Full object F-score of each classifier on cross-validation data used as weight Class that receives the highest vote wins the competition Weighted combination of all votes Ties broken by random coin toss Proposed System
Weighted Majority Voting Weight : Full Object F-score
Genetic Algorithm Set Up Initial population: randomly generated bit strings Genetic Algorithm Features Population size : 100 Mutation rate : 2% Crossover Rate : 70% Crossover Operators: Two point crossover Uniform crossover Tournament size= 40 Elitist population 20% Multiple Classifier System
Proposed System Flow chart of the Genetic Algorithm Start Initialize population randomly Compute fitness of each chromosome Apply elitist policy to form new population Select best chromosome as the resultant ensemble Terminate? Yes Compute fitness of each chromosome No End Select parent and apply crossover Mutate Offspring
Genetic Algorithm Set Up Chromosome: List of classifiers to be combined. 3-fold cross validation results are used for individual classifiers Fitness of chromosomes: Full object F-score of the classifier ensemble Static Classifier Selection : Each bit represents a classifier Proposed Vote Based Classifier Selection : Each bit represents reliability of a classifier for predicting a class Proposed System
Classifier 2 Classifier 4 Classifier 1 Classifier 3 Multiple Classifier System Chromosome Structure of Static Classifier Selection Classifier M M classifiers chromosome has M bits If a gene=1, the corresponding classifier participates in the decision for all classes, otherwise it remains silent.
M classifiers chromosome has N classes NxM bits Class 2 Class 4 Class 1 Class 3 Multiple Classifier System Chromosome Structure for the Proposed Vote-based Classifier Selection Classifier M For each classifier, one gene is reserved to represent its probability to participate in the decision of each class.
Multiple Classifier System Proposed System Motivation for Vote Based Classifier Subset Selection • A classifier cannot predict all classes with the same performance • A subset of predictions may be unreliable • A subset of predictions may be correlated with predictions of other classifiers • Allow a classifier to vote only for the classes it trusts
Multiple Classifier System Multiple Classifier Systems used • Single Best (SB) : not an MCS, included as a reference • Full Ensemble (FE) : Ensemble containing all classifiers • Forward Selection (FS) : Ensemble formed using forward selection • Backward Selection (BS) : Ensemble formed using Backward Selection • GA generated Static Ensemble (GAS) : Ensemble formed using GA • Vote Based Classifier Subset Selection using GA (VBS) : vote based ensemble formed using GA.
Proposed System Performance of Ensembles Single Best >> Full Ensemble ≈ Forward Selection >> Backward Selection ≈ GA Static Ensemble 72.51 << Proposed Method
Multiple Classifier System Discussion on ensembles • All ensembles outperform SB • VBS has the highest F-score • GA based ensembles better • BS chose 38 classifiers • FE and BS similar: precision >> recall • FS and GAS chose 9 classifiers • Precision ad recall more balanced • VBS is different: uses 46 classifiers partially • Recall > precision
Backward and forward parsed classifiers are more balanced Backward and forward parsed classifiers are more balanced Multiple Classifier System Discussion on ensembles • BS eliminates mainly classifiers using only two features. • All eliminated classifiers are backward parsed • FS and GAS almost the same • 8 classifiers same • 9th classifier forward parsed for GAS • Even though the 9th classifier has lower F-score, GAS ensemble achives a higher F-score
Multiple Classifier System Entity Based F-scores for the ensembles Single Best Full Ensemble Forward Selection Backward Selection GA Static Ensemble Proposed Method VBS
Largest data set Smallest data set Multiple Classifier System Discussion of entity based scores • VBS achieves the best scores for all entities except for RNA • GAS ensemble outperforms VBS for RNA • VBS • Highest F-score for protein • Lowest F-score for cell line
0 11 0 0 Multiple Classifier System Distribution of Vote Counts for the VBS • None of the classifiers are eliminated from the ensemble • None of the classifiers vote for all eleven classes