90 likes | 105 Views
Terminology problems in literature mining and NLP. John MacMullen SILS Bioinformatics Journal Club Fall 2003. Assumptions of the paper.
E N D
Terminology problems in literature mining and NLP John MacMullen SILS Bioinformatics Journal Club Fall 2003
Assumptions of the paper • “knowledge encoded in textual documents is organized around sets of domain-specific terms, which are used as a basis for sophisticated knowledge acquisition.” [938] • “Terms represent the most important concepts in a domain and characterize documents semantically.” [939] • “the basic problem is to recognize domain-specific concepts and to extract instances of specific relationships among them.” [938] SILS Bioinformatics Journal Club – Fall 2003
Current approaches to auto term recognition • Morpho-syntactic feature identification • Hybrid linguistic and statistical approaches • Machine learning techniques Problems • Terms are ambiguous and have variation; they are hardly ever mono-referential • The lack of naming conventions (controlled vocabularies), the existence of acronyms, and the large existing heterogeneous literatures increase complexity. SILS Bioinformatics Journal Club – Fall 2003
Context: Term variation problems in NLP SILS Bioinformatics Journal Club – Fall 2003
Terminology Processing Workflow 2,082 MEDLINE abstracts related to ‘nuclear receptors’ Nenadic, Spacsic & Ananiadou (2003), Fig 1 SILS Bioinformatics Journal Club – Fall 2003
ATR approach • C-values (“termhoods”) [940] • Term frequency • “Frequency of occurrence as a substring of other candidate terms” (receptor) • “Number of candidate terms containing the given candidate term as a substring” • “Number of words contained in the candidate term” • NC-values (“termhood estimations”) [940] • Includes context of candidate terms • “Frequency of co-occurrence with top-ranked context words” • NC-values = a linear combination of C-values and context factors for each term SILS Bioinformatics Journal Club – Fall 2003
Clustering & Evaluation • Clustering • CSL (contextual, syntactical, lexical) • Clustering implies underlying perspectives or queries • Evaluation • Recall – the probability a relevant item will be retrieved • Precision – the probability that a retrieved item will be relevant SILS Bioinformatics Journal Club – Fall 2003
Other questions • Corpus construction: “a larger corpus does not have a proportionally higher number of acronyms” [942] True? • “All term variants are considered jointly for the calculation of termhood” [942] What would happen if they weren’t? • In what ways is the hybrid similarity measure corpus dependent? [942] SILS Bioinformatics Journal Club – Fall 2003
References • Nenadic, G., Spasic, I., & Ananiadou, S. (2003). Terminology-driven mining of biomedical literature. Bioinformatics 19(8), 938-943. http://bioinformatics.oupjournals.org/cgi/reprint/19/8/938 SILS Bioinformatics Journal Club – Fall 2003