350 likes | 508 Views
Domain Adaptation for Biomedical Information Extraction. Jing Jiang BeeSpace Seminar Oct 17, 2007. Outline. Why do we need domain adaptation? Solutions: Intelligent learning methods Knowledge bases Expert supervision Connections with BeeSpace V4. Why do we need domain adaptation?.
E N D
Domain Adaptation for Biomedical Information Extraction Jing Jiang BeeSpace Seminar Oct 17, 2007
Outline • Why do we need domain adaptation? • Solutions: • Intelligent learning methods • Knowledge bases • Expert supervision • Connections with BeeSpace V4
Why do we need domain adaptation? • Many biomedical information extraction problems are solved by supervised machine learning methods such as support vector machines (SVMs). • Entity recognition • Relation extraction • Sentence categorization • In supervised machine learning, it is assumed that the training data and the test data have the same distribution.
Why do we need domain adaptation? • Existing labeled training data is often limited to certain domains. • GENIA corpus human, blood cells, transcription factors • PennBioIE Genetic variation in malignancy, Cytochrome P450 inhibition • Training data for sentence categorization in gene summarizer fly • Even when the training data is diverse (containing multiple domains), it would still be nice to customize the classifier for the particular target domain that we are working on.
Solutions to domain adaptation • Intelligent learning methods • Instance weighting • Feature selection • Knowledge bases • Expert supervision thesis research future work discussion
Domain adaptive learning methods • Two-stage approach • Two frameworks • Instance weighting • Feature selection • Use of unlabeled data
Intuition Source Domain Target Domain
Goal Source Domain Target Domain
Start from the source domain Source Domain Target Domain
Focus on the common part Source Domain Target Domain
Pick up some part from the target domain Source Domain Target Domain
Formal formulation? Source Domain Target Domain How to formally formulate these ideas?
Instance weighting Source Domain Target Domain instance space (each point represents an example) to assign different weights to different instances in the objective function
Instance weightingObservation source domain target domain
Instance weightingObservation source domain target domain
Instance weightingAnalysis of domain difference p(x, y) ps(y | x) ≠ pt(y | x) p(x)p(y | x) ps(x) ≠ pt(x) labeling difference instance difference ? labeling adaptation instance adaptation
X Ds+ Dt,l+ Dt,u? Instance weightingThree sets of instances Dt, l Dt, u Ds
Instance weightingFramework labeled source data labeled target data unlabeled target data a flexible setup covering both standard methods and new domain adaptive methods
Feature selection Source Domain Target Domain feature space (each point represents a feature) to identify features that behave similarly across domains
Feature selectionObservation • Domain-specific features wingless daughterless eyeless apexless … “suffix -less” weighted high in the model trained from fly data • Useful for other organisms? in general NO! • May cause generalizable features to be downweighted fly genes
Feature selectionObservation • Generalizable features: generalize well in all domains fly mouse …decapentaplegic and winglessare expressed in analogous patterns in each… …that CD38 is expressed by both neurons and glial cells…that PABPC5is expressed in fetal brain and in a range of adult tissues.
Feature selectionObservation • Generalizable features: generalize well in all domains fly mouse …decapentaplegic and winglessare expressed in analogous patterns in each… …that CD38 is expressed by both neurons and glial cells…that PABPC5is expressed in fetal brain and in a range of adult tissues. “wi+2 = expressed” is generalizable
Feature selectionIntuition for identification of generalizable features source domains fly mouse D3 … DK 1 2 3 4 5 6 7 8 … … -less … … expressed … … 1 2 3 4 5 6 7 8 … … … expressed … … … -less 1 2 3 4 5 6 7 8 … … … expressed … … -less … … 1 2 3 4 5 6 7 8 … … … … expressed … … -less … expressed … … … -less … …
Feature selectionFramework • Matrix A is for feature selection
New directions to explore • Knowledge bases • Expert supervision
Knowledge bases – entity recognition • Well-documented nomenclatures • Fly, Mouse, Rat • Help filter out false positives? • Help select features? • Dictionaries of entities • “Dictionary features” • Automatic summarization of nomenclatures? • Automatic identification of good features?
Knowledge bases – sentence categorization in gene summarizer • For fly, the training sentences are automatically extracted from FlyBase. For other organisms, do we have similar resources?
Expert supervision – entity recognition • Computer system selects ambiguous examples for human experts to judge. • Computer system asks human experts other questions. • Similar organisms? • Typical surface features? (e.g. cis-regulatory elements, “-RE”) • Computer system summarizes possible features from pseudo labeled data, and asks human experts for confirmation.
Connections to BeeSpace V4 • A major challenge in BeeSpace V4 is extraction of new types of entities and relations. • Exploiting knowledge bases and expert supervision is especially important. • For new types, no labeled data is available even from other domains. Use of bootstrapping methods should be explored.
New entity types • Recognition of many new types will be dictionary based: organism, anatomy, biological process, etc. • Recognition of some new types will need some NER techniques: chemical, regulatory element
New relation types • Bootstrapping (?) • Seed patterns from knowledge bases or human experts • Human inspection of newly discovered patterns?