Development of a Pediatric Text-Corpus for Part-of-Speech Tagging

Development of a Pediatric Text-Corpus for Part-of-Speech Tagging John Pestian1, Lukasz Itert1,2, and Włodzisław Duch2,3 1 BioMedical Informatics, 3333 Burnet Avenue, Children’s Hospital Research Foundation, Cincinnati, OH 45229, USA {jpestian, litert}@cchmc.org 2Department of Informatics, Nicholaus Copernicus University, 87-100 Torun, Poland, duch@ieee.org 3School of Computer Engineering, Nanyang Technological University, Singapore Zakopane, V-2004

Outline • The project • Tools • Data description and problems • Software • Results • Plans…

CCHMC project outline (simplified) INPUT (free medical text) Preprocessing MetaMap input Conclusions, important relations, any useful information i.e. hypothesis generation and validation MetaMAp Software- UMLS concept discovering and indexing Annotations: Concept Space (UMLS concepts) Decision support system

Semantic Retrieval System: Retrieving knowledge from clinical annotations database, discovering relations, extracting rules and facts, any useful information; semantic analysis of medical text. Ontology based approach (Merging UMLS and common sense ontology?) XML as the standard In fact: background for the expert systems (an artificial physician helper?) CCHMC project

Goals (too ambitious?) In the final stage we would like our system to help to answer questions like: • Is X related to Y? • Will X help patient with Y? • What causes X? • What causes changes of X? • What are therapy options for X?

UMLS* – the biggest medical ontology • Unified Medical Language System started in 1986 and maintained by National Library of Medicine (NLM). http://www.nlm.nih.gov/research/umls/ • Goal: “to aid the development of systems that help health professionals and researchers retrieve and integrate electronic biomedical information from a variety of sources.” • Consists of three main parts: • METATHESAURUS • SEMANTIC NETWORK • THE SPECIALIST LEXICON

UMLS in numbers • Metathesaurus:875,255 concepts and 2.14 million concept names in its source vocabularies. • Semantic Network:135 semantic types, 54 relationships. • SPECIALIST lexicon:183,000 lexical entries covering more than 292,000 strings.

UMLS – Example (keyword: “virus”) • Metathesaurus: Concept: Virus, CUI: C0042776, Semantic Type: Virus Definition (1 of 3): “Group of minute infectious agents characterized by a lack of independent metabolism and by the ability to replicate only within living host cells; have capsid, may have DNA or RNA (not both)”. (CRISP Thesaurus) Synonyms: Virus, Vira Viridae • Semantic Network: "Virus" causes "Disease or Syndrome" - relates concepts, not words!!! Other relations: “interacts with”, “contains”, “consists of” , “result of”, “related to”, … Other types: “Body location or region”, “Injury or Poisoning”, “Diagnostic procedure”, …

UMLS – Example c’d. • SPECIALIST lexicon: {base=virus entry=E0064702 cat=noun variants=reg} “reg” means regular plural form.

UMLS – additional features • Extended API (Java) with 20 packages and 151 classes. • Easy support for other languages (currently 15). • The UMLS Knowledge Server - a set of web based interaction tools and a programmer interface to allow users and developers access to the biomedical terminologies found within the UMLS.

MetaMap Transfer (MMTx)* • MetaMap - set of tools (programs) mapping arbitrary text to concepts in the UMLS Metathesaurus. Equivalently: it discovers Metathesaurus concepts in text. • Built-in part of speech tagger (helps in syntactic analysis) • Goal using MetaMap: to get the best possible accuracy i.e. map correctly as many concepts as possible. *http://mmtx.nlm.nih.gov/

Medical data • Clinical annotations: nurses and surgical notes, discharge summaries. Information about symptoms, procedures, findings, therapeutic response. • Typical annotation:4 y/o with H/O JRA and 6d H/O fever, headache, photophobia. Presented originally to Grant County and St Luke`s West and was started on Ceftriaxone on 7/30 for presumed sepsis. • Problems: • Multiple abbreviations, synonyms, punctuation, misspellings (hand written text), capitalization, spacing, non-letter characters • Understandable by specialists only. • It’s huge

Initial data processing – Encryption Broker software Features: • Classify raw input text into paragraph, sentences and words. • Search for medical and common sense abbreviations and symbols in text and replace them with appropriate definitions. • Search for ambiguous abbreviations and replace them with proper meaning based on surrounding text. • Proper processing of multi-words. • Make confidential data harmless. • Proper handling of punctuation, special symbols and exceptions • Output text is ready to be tagged

Tagging process Creating the training set: Random sample of data from 20.000.000 words set Hand tagging in India by linguistic group Use the Penn Treebank tagset Find UMLS multi-words in text Training: Training TreeTagger Validation: 10 CV validation (each 1/10th test part is created by bootstrapping method, the rest gives the training part) The final results is the average of three independent 10 CV procedures

TreeTagger* • Part of speech tagger based on ID3 decision tree. • Tagging is performed by analyzing the context of a word using trigrams. • Needs both the lexicon (base-list of words) and the training set (providing information about correlation between part of speech names within a sentence). • Fast and has great support for misspelled words as well as words non-existing in the lexicon. • As a base lexicon we use its own lexicon and UMLS lexicon converted to suitable format *H. Schmid, Probabilistic Part-of-Speech Tagging Using Decision Trees, In Proceedings of the Conference on New Methods in Language Processing, 1994.

Tagging results (395.000 set) Figure 1. The tagging accuracy vs. the training set size. All results are from the 10-CV tests. Black squares show points where actual calculations were done.

Training vs. testing accuracy Figure 2. Accuracies on the training (the dashed line) and the test set (the solid line) for 215000 words.

Unique (occurring once) trigrams Figure 3. The percentage of unique trigrams vs. the training set size.

Conclusions • The accuracy of the tagging system is going up as the size of the training set is increasing • Number of unique trigrams is decreasing as the training set grows – ML methods are supposed to give better results • Multi words support helps to increase the accuracy; bigger multi words contribution is expected in the MetaMap mapping process • Results from training set indicate that the accuracy limit for tests is near 93%

Plans • 1. Cleaning the text: many misspellings • 2. About 1000 ambiguous acronyms in >700K trigrams – disambiguation rules? • 3. Recognition memory techniques applied to sequence => term mapping. • 4. Semantic corrections via high-dimensional vector representation of the words; later episodic memory. • 5. Creation of XML and later DAML versions of annotations from unstructured text. • 6. Discovery System for intelligent decision support.

Development of a Pediatric Text-Corpus for Part-of-Speech Tagging

Development of a Pediatric Text-Corpus for Part-of-Speech Tagging

Presentation Transcript

Automatic Part-of-Speech Tagging of Arabic Text

Part of Speech (POS) Tagging

Part-of-speech tagging

Part-of-Speech Tagging

Part of Speech Tagging

Part-of-Speech (POS) tagging

Distributional Part-of-Speech Tagging

Persian Part Of Speech Tagging

Part-of-Speech Tagging

Part of Speech Tagging

Part-of-Speech Tagging

Part-of-Speech Tagging

Part-of-Speech Tagging

Part-of-Speech Tagging

Part-of-Speech Tagging

Part of Speech Tagging

Neural Networks Leverage Corpus-wide Information for Part-of-speech Tagging

Part-of-speech Tagging

Part of Speech Tagging

Part-of-speech tagging

Part-of-Speech Tagging