100 likes | 114 Views
This article explores the representation of biomedical sublanguages using symbolic notation. It discusses the challenges of recognizing domain-specific concepts and extracting relationships among them. The use of linguistic probabilities and representing meaning with notation is also examined. Practical applications and potential barriers to implementation are discussed, along with the relevance of these assumptions in an interdisciplinary environment.
E N D
Representation of biomedical sublanguages using symbolic notation John MacMullen SILS Bioinformatics Journal Club Fall 2003
Terminology article assumptions • “Knowledge encoded in textual documents is organized around sets of domain-specific terms…” [938] • “Terms represent the most important concepts in a domain and characterize documents semantically.” [939] • “[T]he basic problem is to recognize domain-specific concepts and to extract instances of specific relationships among them.” [938] • Terms are ambiguous and have variation; they are hardly ever mono-referential • The lack of naming conventions (controlled vocabularies), the existence of acronyms, and the large existing heterogeneous literatures increase complexity. [from Nenadic, G., Spasic, I., & Ananiadou, S. (2003). Terminology-driven mining of biomedical literature. Bioinformatics 19(8), 938-943.] SILS Bioinformatics Journal Club – Fall 2003
Harris’ Assumptions • “[T]here is a particular structure to science information in general, and to the information of each subscience in particular”, [because] “for each subscience there are particular subsets of nouns that occur with particular subsets of verbs or other words” [215]. • “[I]t is not intrinsic properties of sounds and meanings that determine the possible word-sequences of sentences.” […] “For each word, we find roughly stable inequalities of probability among the words in its required (positive probability) set” [216]. • “What is common to the texts of a given subject matter is that first-level words of a given subset require zero-level words of only a particular subset” [217]. • “We thus obtain for the science several statement-types […]” [217]. • “What we have here is thus an information-theoretic approach to the structure of information, as against solely the amount of information” [217, emphasis added] SILS Bioinformatics Journal Club – Fall 2003
Linguistic probabilities • “For each word, we find roughly stable inequalities of probability among the words in its required (positive probability) set” [216]. • “The meaning of a word is indicated, and in part created, by the meanings of the words in respect to which it has higher than average probability” [217] • “Words with highest probability in respect to another word, or which otherwise can be shown structurally to have highest expectancy, add little or no information” [217]. SILS Bioinformatics Journal Club – Fall 2003
Representing Meaning with Notation • Movement towards structured rather than natural language • Representing sentences as propositions whose truth can be tested • Example [from 217-218]: If ‘G’ = “antigen”, and ‘J’ = “injected into”, and ‘B’ = “ear”, then ‘GJB’ = “antigen injected into ear” SILS Bioinformatics Journal Club – Fall 2003
Symbolic representation SILS Bioinformatics Journal Club – Fall 2003
Applications • Investigate “the possibilities of obtaining standard notations for science languages,not by fiat but by boiling down from actual use…” [220]. • “[R]elate the information structure of a science to anything else that characterizes the field, in order to reach if possible a ‘‘structure’’of the science” [220]. • “[S]ee how tabular or other two-dimensional displays can represent the data (or the Result statements) of articles, for human inspection or for computer processing” [220]. SILS Bioinformatics Journal Club – Fall 2003
Other Propositions • “[W]hen, in a given science, articles written in different languages are analyzed […], we obtain the same sentence-types and structures,with only small differences due to the languages. • “The word class and subclass symbols,and the sentence-types,are therefore not just a sublanguage of a particular language, but an independent symbolic linguistic system” [219]. • Difference between ‘equivalance’ and ‘equal’? Example: • La casa di Gianni è bianco. [original] • The house of Gianni is white. [literal or equal] • John’s house is white. [equivalent] SILS Bioinformatics Journal Club – Fall 2003
Questions • Assume Harris’ notation method is valid and works well. • How might it be implemented in practice? (This is both an algorithm question and a policy question.) • Who would apply it? • What would some of the barriers be? • Do Harris' arguments hold in an interdisciplinary environment? SILS Bioinformatics Journal Club – Fall 2003
References • Harris, Zellig S. (2002). The structure of science information. Journal of Biomedical Informatics, 35, 215-221. • Linguistic String Project @ NYU: http://www.cs.nyu.edu/cs/projects/lsp/ • MedLEE project (Medical Language Extraction and Encoding System): http://cat.cpmc.columbia.edu/medleexml/ • Zellig Harris homepage: http://www.dmi.columbia.edu/zellig/ SILS Bioinformatics Journal Club – Fall 2003