Domain Adaptation for Biomedical Information Extraction

Domain Adaptation for Biomedical Information Extraction Jing Jiang BeeSpace Seminar Oct 17, 2007

Outline • Why do we need domain adaptation? • Solutions: • Intelligent learning methods • Knowledge bases • Expert supervision • Connections with BeeSpace V4

Why do we need domain adaptation? • Many biomedical information extraction problems are solved by supervised machine learning methods such as support vector machines (SVMs). • Entity recognition • Relation extraction • Sentence categorization • In supervised machine learning, it is assumed that the training data and the test data have the same distribution.

Why do we need domain adaptation? • Existing labeled training data is often limited to certain domains. • GENIA corpus  human, blood cells, transcription factors • PennBioIE  Genetic variation in malignancy, Cytochrome P450 inhibition • Training data for sentence categorization in gene summarizer  fly • Even when the training data is diverse (containing multiple domains), it would still be nice to customize the classifier for the particular target domain that we are working on.

Why do we need domain adaptation?

Solutions to domain adaptation • Intelligent learning methods • Instance weighting • Feature selection • Knowledge bases • Expert supervision thesis research future work discussion

Domain adaptive learning methods • Two-stage approach • Two frameworks • Instance weighting • Feature selection • Use of unlabeled data

Intuition Source Domain Target Domain

Goal Source Domain Target Domain

Start from the source domain Source Domain Target Domain

Focus on the common part Source Domain Target Domain

Pick up some part from the target domain Source Domain Target Domain

Formal formulation? Source Domain Target Domain How to formally formulate these ideas?

Instance weighting Source Domain Target Domain instance space (each point represents an example) to assign different weights to different instances in the objective function

Instance weightingObservation source domain target domain

Instance weightingAnalysis of domain difference p(x, y) ps(y | x) ≠ pt(y | x) p(x)p(y | x) ps(x) ≠ pt(x) labeling difference instance difference ? labeling adaptation instance adaptation

X  Ds+ Dt,l+ Dt,u? Instance weightingThree sets of instances Dt, l Dt, u Ds

Instance weightingFramework labeled source data labeled target data unlabeled target data a flexible setup covering both standard methods and new domain adaptive methods

Feature selection Source Domain Target Domain feature space (each point represents a feature) to identify features that behave similarly across domains

Feature selectionObservation • Domain-specific features wingless daughterless eyeless apexless … “suffix -less” weighted high in the model trained from fly data • Useful for other organisms? in general NO! • May cause generalizable features to be downweighted fly genes

Feature selectionObservation • Generalizable features: generalize well in all domains fly mouse …decapentaplegic and winglessare expressed in analogous patterns in each… …that CD38 is expressed by both neurons and glial cells…that PABPC5is expressed in fetal brain and in a range of adult tissues.

Feature selectionObservation • Generalizable features: generalize well in all domains fly mouse …decapentaplegic and winglessare expressed in analogous patterns in each… …that CD38 is expressed by both neurons and glial cells…that PABPC5is expressed in fetal brain and in a range of adult tissues. “wi+2 = expressed” is generalizable

Feature selectionIntuition for identification of generalizable features source domains fly mouse D3 … DK 1 2 3 4 5 6 7 8 … … -less … … expressed … … 1 2 3 4 5 6 7 8 … … … expressed … … … -less 1 2 3 4 5 6 7 8 … … … expressed … … -less … … 1 2 3 4 5 6 7 8 … … … … expressed … … -less … expressed … … … -less … …

Feature selectionFramework • Matrix A is for feature selection

Feature selection results on gene/protein recognition

New directions to explore • Knowledge bases • Expert supervision

Knowledge bases – entity recognition • Well-documented nomenclatures • Fly, Mouse, Rat • Help filter out false positives? • Help select features? • Dictionaries of entities • “Dictionary features” • Automatic summarization of nomenclatures? • Automatic identification of good features?

Knowledge bases – sentence categorization in gene summarizer • For fly, the training sentences are automatically extracted from FlyBase. For other organisms, do we have similar resources?

Expert supervision – entity recognition • Computer system selects ambiguous examples for human experts to judge. • Computer system asks human experts other questions. • Similar organisms? • Typical surface features? (e.g. cis-regulatory elements, “-RE”) • Computer system summarizes possible features from pseudo labeled data, and asks human experts for confirmation.

Connections to BeeSpace V4 • A major challenge in BeeSpace V4 is extraction of new types of entities and relations. • Exploiting knowledge bases and expert supervision is especially important. • For new types, no labeled data is available even from other domains. Use of bootstrapping methods should be explored.

New entity types • Recognition of many new types will be dictionary based: organism, anatomy, biological process, etc. • Recognition of some new types will need some NER techniques: chemical, regulatory element

New relation types • Bootstrapping (?) • Seed patterns from knowledge bases or human experts • Human inspection of newly discovered patterns?

The end

Domain Adaptation for Biomedical Information Extraction

Domain Adaptation for Biomedical Information Extraction

Presentation Transcript

Graph-Based Methods for “Open Domain” Information Extraction

Graph-Based Methods for “Open Domain” Information Extraction

Domain Adaptation

Information Extraction

Information Extraction

information extraction

Towards Domain-Independent Information Extraction from Web Tables

Machine Translation Domain Adaptation

Adapting Open Information Extraction to Domain-Specific Relations

Information Extraction from Biomedical Text

Information Extraction

Automatic template creation for biomedical information extraction: theory and practice

Biomedical Information Extraction

Learning for Biomedical Information Extraction with ILP

Information Extraction from biomedical texts

Information Extraction

Domain Adaptation for Statistical Machine Translation

Graph-Based Methods for “Open Domain” Information Extraction

Information Extraction from BioMedical Abstracts