90 likes | 276 Views
Relation Extraction for Academic Collaboration 10-709 Project Proposal. Justin Betteridge, Matthew Bilotti, Simon Fung, Sophie Wang Jan 26, 2006. Relation Extraction. We want: CollaboratesWith( <x>, <y> ) where <x>, <y> are of type ‘person’
E N D
Relation Extraction for Academic Collaboration10-709 Project Proposal Justin Betteridge, Matthew Bilotti, Simon Fung, Sophie Wang Jan 26, 2006
Relation Extraction • We want: CollaboratesWith( <x>, <y> ) where <x>, <y> are of type ‘person’ • Two redundant sources of information for co-training: • Extraction Patterns to find Relations expressed in surface text or tables on the web • Rote learner keeps track of Relations it is told about, aggregating evidence in the form of confidence scores when Relations are multiply-extracted from different sources
Sketch of a Co-Training Algorithm Let: R = a set of Relations; P = a set of Extraction Patterns Initialize: R <- seed Relations, P <- seed Patterns do, until termination condition is reached: • For each p in P, where p is of the form ( “before context”, <x>, “between context”, <y>, “after context” ), query Google using the literal context strings in the Pattern to retrieve text windows from which a set of Relations ( <x>, <y> ) can be extracted. • For each new Relation, compute new confidence score and add it to R, combining evidence if necessary. • Weed out any r in R the confidence of which is below a threshold, or optionally, any r the arguments of which are unlikely to be of type person. • For each r in R, where r is of the form ( <x>, <y> ), query Google to retrieve a set of text windows containing the strings <x> and <y>. From these text windows, generalize a set of Patterns ( “before”, <x>, “between”, <y>, “after”) • For each new Pattern, compute new confidence score and add it to P, combining evidence if necessary. • Weed out any p in P the confidence of which is below a threshold.
Coverage as a Confidence Measure • Confidence for an Extraction Pattern p • For each r in R, query Google to see if p can extract r • Coverage is the number of relations in R extractable by p divided by |R| • Confidence for a Relation r • For each p in P, query Google to see if p can extract r • Similarly, coverage is the number of patterns in P that can extract r divided by |P|
Combining Confidence Scores • Given a Relation with confidence c • Extracted again; pattern has confidence p • New confidence score of s (may be < c) • One idea: MYCIN Calculus [Shortliffe 76] • new confidence = c + ( 1 – c ) * p * s • intuitively, going p * s percent of the way from old confidence c to maximal confidence 1.0 • Another idea: = ( c + p * s ) / ( 1 + c * p * s ) • confidences increase monotonically, stay between 0 and 1.0, but never reach 1.0
Example Seed Data for Co-Training • Extraction Patterns • <x> “in collaboration with” <y> • <x> “joint work with” <y> • Patterns that extract information from tables, lists of citations, etc... • Relations • CollaboratesWith( mbilotti, ehn ) • CollaboratesWith( jbetter, teruko ) ...
Extraction Pattern Examples Query: “in collaboration with” site:web.mit.edu/biology/www
Open Questions • Additional useful sources of information: • Anchor text and link structure: advisor-advisee cross-refs, department or lab organization • Heuristics or Named Entity Recognition to weed out relation arguments that are not people • Confidence metrics for patterns, relations • Methods of combining confidence scores • Termination condition