60 likes | 240 Views
Applications of Text Mining. Ewan Klein School of Informatics & NeSC. Text Mining. Goals Extract useful information from large bodies of unstructured or semi-structured documents Looks for patterns in natural language text Driven by application needs Three Areas: Adding Metadata
E N D
Applications of Text Mining Ewan Klein School of Informatics & NeSC
Text Mining • Goals • Extract useful information from large bodies of unstructured or semi-structured documents • Looks for patterns in natural language text • Driven by application needs • Three Areas: • Adding Metadata • E.g., identify Dublin Core elements from document headers • Information Extraction • Identify nuggets of text data and marshall them into a fixed format • Assisting Curation
Text mining and Curation • Example workflow: • Make an observation • Search the research literature for knowledge • Incorporate relevant information into database • Challenges: • Current Information Retrieval (IR) techniques often too imprecise • Which enzymes act as catalysts in the glycolysis pathway? • We want to identify a relation between two entities • Move to augmenting IR with more knowledge of text structure • Mostly supervised machine learning techniques • Still need training data for each domain • Need to integrate text mining into Grid applications
BlueDwarf for Text Mining • BioCreative Competitioin • Joint entry with Stanford • Recognition of drug names, chemical names, and protein names in MEDLINE abstracts • Java maximum entropy tagger • Used roughly 700,000 features in the early stages • Java memory size of 1950 Mb • Died on available Informatics and Stanford machines • BlueDwarf • Arrived at 1,247,77 features, memory: 2560 Mb • Several experiments running in parallel • Provisional results: we obtained top-scoring results