240 likes | 250 Views
Developing a system for extracting interactions from text based on linguistic patterns and using iterative learning algorithms to acquire and refine extraction patterns for genic interactions.
E N D
Automatically Acquiring a Linguistically Motivated Genic Interaction Extraction System Mark A. GreenwoodMark StevensonYikun GuoHenk HarkemaAngus Roberts Natural Language Processing Group Department of Computer Science University of Sheffield, UK
Outline of Talk • Background to our Approach • Extraction Patterns • Acquiring And Using Extracting Patterns • Challenge Evaluation • Analysis • Conclusions and Future Work LLL05: Automatically Acquiring a Linguistically Motivated Genic Interaction Extraction System
Background to our Approach • We had developed a system to perform sentence filtering • Sentence filtering involves classifying sentences based on whether or not they are relevant to a given scenario. • We reported F-measure results of approximately 55% on a management succession task (Stevenson & Greenwood, 2005). • For participation in the LLL challenge we extended this system • We moved to extracting interactions rather than sentence filtering • We extended the pattern representation • Previously we had represented sentences using the verbs in the sentence and their direct arguments. LLL05: Automatically Acquiring a Linguistically Motivated Genic Interaction Extraction System
Outline of Talk • Background to our Approach • Extraction Patterns • Acquiring And Using Extracting Patterns • Challenge Evaluation • Analysis • Conclusions and Future Work LLL05: Automatically Acquiring a Linguistically Motivated Genic Interaction Extraction System
det the brown dog adj Extraction Patterns • We represent extraction patterns as paths in a dependency tree • Dependency trees represent text by linking each sentence word with those words which directly modify it. • For example the noun phrase “the brown dog” is represented by two dependency relations: • In these experiments we used MINIPAR (Lin, 1999) to generate the dependency trees from which the extraction patterns were taken. • The supplied dependency relations were not used due to time constraints of adapting our approach to the task. LLL05: Automatically Acquiring a Linguistically Motivated Genic Interaction Extraction System
Extraction Patterns • The nodes in a dependency trees can be either: • Lexical items (i.e. words) • Semantic categories such as gene, protein, agent, target, etc. • Lexical items are represented in lower case • Semantic categories are capitalised LLL05: Automatically Acquiring a Linguistically Motivated Genic Interaction Extraction System
Extraction Patterns Given the dependency tree representing the phrase “…AGENT represses the transcription of TARGET…” we extract chain shaped paths as extraction patterns. verb[v/repress](subj[n/AGENT]) verb[v/repress](obj[n/transcription](of[n/TARGET])) verb[v/repress](obj[n/transcription]+subj[n/AGENT]) verb[v/repress](obj[n/transcription](of[n/TARGET])+subj[n/AGENT]) LLL05: Automatically Acquiring a Linguistically Motivated Genic Interaction Extraction System
Outline of Talk • Background to our Approach • Extraction Patterns • Acquiring And Using Extracting Patterns • Challenge Evaluation • Analysis • Conclusions and Future Work LLL05: Automatically Acquiring a Linguistically Motivated Genic Interaction Extraction System
Learning Extraction Patterns Patterns Iterative Learning Algorithm • Begin with set of seed patterns which are known to be good extraction patterns • Compare every other pattern with the ones known to be good • Choose the highest scoring of these and add them to the set of good patterns • Stop if enough patterns have been learned, else repeat from step 2. Candidates Seeds Rank LLL05: Automatically Acquiring a Linguistically Motivated Genic Interaction Extraction System
Pattern Similarity • We determine the similarity between two patterns using a vector space model inspired by that commonly used in IR. • Each pattern can be represented by a set of pattern element-filler pairs • The set of pattern element-filler pairs in a corpus forms the basis for a vector space where the value is 1 if a pattern contains the pair, 0 otherwise. • The similarity of two patterns can then be computed as: • This is the cosine measure augmented with a matrix W which lists the similarity between each pattern element-filler pair. • The similarity between pattern element-filler pairs is computed using a WordNet similarity measure proposed by Banerjee and Pederson (2002) referred to as Adapted Lesk. LLL05: Automatically Acquiring a Linguistically Motivated Genic Interaction Extraction System
Pattern Similarity Extraction Patternsa. verb[v/block](subj[n/protein])b. verb[v/repress](subj[n/enzyme])c. verb[v/promote](subj[n/protein]) Matrix Labels1. subj_protein, 2. subj_enzyme, 3. verb_block,4. verb_repress, 5. verb_promote Similarity Matrix Similarity Values sim(a, b) = 0.925sim(a, c) = 0.55sim(b, c) = 0.525 LLL05: Automatically Acquiring a Linguistically Motivated Genic Interaction Extraction System
Acquiring Patterns • We use this approach to learn patterns containing a known agent or target from the training data. • The texts are pre-processed to include AGENT and TARGET as semantic class labels. • We restricted certain terms (e.g. repress) to domain specific senses in WordNet for similarity calculations. • We started from the 30 verbs in the PASBio project • 28 further other nouns and verbs were also restricted • At each iteration of the algorithm we accepted up to 4 new patterns which were within 0.95 of the best pattern being accepted. • The algorithm was allowed to run until no more patterns could be acquired. LLL05: Automatically Acquiring a Linguistically Motivated Genic Interaction Extraction System
Seed Patterns • We used the following manually selected seed patterns in all our experiments: verb[v/transcribe](by[n/AGENT]+obj[n/TARGET]) verb[v/be](of[n/AGENT]+s[n/expression](of[n/TARGET])) verb[v/inhibit](obj[n/activity](nn[n/TARGET])+subj[n/AGENT]) verb[v/bind](mod[r/specifically](to[n/TARGET])+subj[n/AGENT]) verb[v/block](obj[n/capacity](of[n/TARGET])+subj[n/AGENT]) verb[v/regulate](obj[n/expression](nn[n/TARGET])+subj[n/AGENT]) verb[v/require](obj[n/AGENT]+subj[n/gene](nn[n/TARGET])) verb[v/repress](obj[n/transcription](of[n/TARGET])+subj[n/AGENT]) LLL05: Automatically Acquiring a Linguistically Motivated Genic Interaction Extraction System
Extracting Relations • Text from which we wish to extract relations is processed to produce extraction patterns in the same way as before. • Any pattern which matches an acquired pattern is used to extract information. • The acquired patterns match with AGENT and TARGET matching anything • Not all patterns contain both an AGENT and TARGET so post-processing links part relations together. • So for example • The pattern verb[v/stimulates](subj[n/AGENT]+obj[n/TARGET]) • Matches against verb[v/stimulates](subj[n/GerE]+obj[n/cotD]) • Resulting in the interaction GerE cotD LLL05: Automatically Acquiring a Linguistically Motivated Genic Interaction Extraction System
Outline of Talk • Background to our Approach • Extraction Patterns • Acquiring And Using Extracting Patterns • Challenge Evaluation • Analysis • Conclusions and Future Work LLL05: Automatically Acquiring a Linguistically Motivated Genic Interaction Extraction System
Challenge Evaluation • We submitted three runs for evaluation • Baseline: A simple baseline system which pairs all dictionary elements in a sentence with each other in both orders. • Basic: A system trained on the basic data set without coreference as provided for the LLL-05 challenge. • Expanded: A system trained on the basic data set augmented with 78 automatically acquired weakly labelled MedLine sentences. • The basic and expanded systems differ only in the training data used to acquire the extraction patterns. LLL05: Automatically Acquiring a Linguistically Motivated Genic Interaction Extraction System
Challenge Evaluation • The baseline system did not achieve 100% recall as some constructs, such as “… A activates or represses B…” requires two interactions between A and B to be recognised. • Both approaches have low recall but a precision twice that of the baseline system. • While the performance is low it seems that supplying extra training data improves the performance of our approach. LLL05: Automatically Acquiring a Linguistically Motivated Genic Interaction Extraction System
Outline of Talk • Background to our Approach • Extraction Patterns • Acquiring And Using Extracting Patterns • Challenge Evaluation • Analysis • Conclusions and Future Work LLL05: Automatically Acquiring a Linguistically Motivated Genic Interaction Extraction System
Analysis • If we examine the algorithm at each iteration instead of just the final result we can see that: • The seed patterns are unable to extract a single interaction, i.e. the initial F-measure is zero. • As the seeds do not extract relations the performance of the system is solely due to the acquired patterns. • The algorithm is fairly resilient to the acquisition of bad patterns, i.e. with few exceptions, the F-measure steadily increases. LLL05: Automatically Acquiring a Linguistically Motivated Genic Interaction Extraction System
Outline of Talk • Background to our Approach • Extraction Patterns • Acquiring And Using Extracting Patterns • Challenge Evaluation • Analysis • Conclusions and Future Work LLL05: Automatically Acquiring a Linguistically Motivated Genic Interaction Extraction System
Conclusions • We used a pattern representation based on dependency trees and an iterative algorithm to learn representative patterns. • The seed patterns were not well suited to the task and future work will include experimenting with different seed sets. • The small amount of training data seems to hinder our approach (adding 78 extra sentences saw a 2.7% increase in F-measure) • The similarity measure we adopted seems well suited to this task where similar meaning can be conveyed in different ways. LLL05: Automatically Acquiring a Linguistically Motivated Genic Interaction Extraction System
Future Work • We intend to try dependency parsers other than MINIPAR to see if they are more suited to biomedical texts. • We are already looking at other pattern representations to see if they are more suited to the task/domain. • We intend to continue our work on sentence filtering as this would provide a useful first step in any extraction system. LLL05: Automatically Acquiring a Linguistically Motivated Genic Interaction Extraction System
Any Questions? Copies of these slides can be found at: http://www.dcs.shef.ac.uk/~mark/nlp/pubs/
Bibliography • Satanjeev Banerjee and Ted Pedersen. An Adapted Lesk Algorithm for Word Sense Disambiguation Using WordNet. In Proceedings of the Fourth International Conference on Computational Linguistics and Intelligent Text Processing (CICLING-02), 2002. • Mark Craven and Johan Kumlien. Constructing Biological Knowledge Bases by Extracting Information from Text Sources. In Proceedings of the Seventh International Conference on Intelligent Systems for Molecular Biology, 1999. • Dekan Lin. MINIPAR: a minimalist parser. Maryland Linguistics Colloquium. University of Maryland, College Park. 1999. • Mark Stevenson and Mark A. Greenwood. A Semantic Approach to IE Pattern Induction. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL-05), 2005. • Tuangthong Wattarujeekrit and Parantu Shah and Nigel Collier. PASBio: Predicate-Argument Structures for Event Extraction in Molecular Biology. BMC BioInformatics, 5:155. 2004. LLL05: Automatically Acquiring a Linguistically Motivated Genic Interaction Extraction System