270 likes | 760 Views
Research Computing Center of Moscow State University NCO Center for Information Research. Sociopolitical Domain as a Bridge from General Words to Terms of Specific Domains. Natalia V. Loukachevitch, Boris V. Dobrov. General Words and Terms in Automatic Text Processing.
E N D
Research Computing Center of Moscow State University NCO Center for Information Research Sociopolitical Domain as a Bridge from General Words to Terms of Specific Domains Natalia V. Loukachevitch, Boris V. Dobrov
General Words and Terms in Automatic Text Processing • Texts in electronic collections contain as general words as terms • Two different research domains: lexicology and terminology • Wuster (founder of Vienna school of terminology): terminologists begin consideration from a concept, but lexicologists from a form of a linguistic expression
Wuster: difference between lexicological and terminological approaches • terminological research starts from the concept which has to be precisely delimited • in terminology concepts are considered to be independent from their designations • terminologists talk about ‘concepts’ while linguists talk about ‘word meanings’
Construction of Wordnets and Terminology Research • Development of wordnets: • Construction of hierarchical semantic networks • Search for similar “synsets” for different languages • building the top ontology of language-independent concepts • Approaches to study of general words and terms become closer
Theory of Terminology: Properties of Ideal Term • the term must relate directly to the concept. It must express the concept clearly, • there should be no synonyms where absolute, relative or apparent, • the contents of terms should be precise and not overlap in meaning with other terms, • the meaning of the term should be independent of context.
Theory of terminology: serious difference between a general word and a term • biunivocal relationship between concepts and terms in each special field of knowledge • For a terminology nothing could be better than that: no synonymy, no homonymy and no polysemy • A huge gap between general words and terms BUT!
Term Formation and Words of General Language • A general sense of a word and a terminological senses of a word are really different: “function” as a general word, “function” in mathematics, “function” in biology • Cruse: “senses of a lexical form are antagonistic to one another; that is to say , they can not be brought into play simultaneously without oddness”
A word and a term are very similar in meaning • arson - Law. the malicious burning of another's house or property, or in some statutes, the burning of one's own house or property, as to collect insurance (Random House Unabriged dictionary) • A general dictionary uses a very strict definition
How to distinguish terminological and general senses • Teacher in court accused of school arsonA teacher charged with setting fire to a West Yorkshire school has appeared in court. Amina Ditta, 23, of Scholemoor Road, Bradford, has faced the city's magistrates court charged with one count of arson. The charge relates to an incident last Wednesday at Atlas Primary School in Manningham, where Ms Ditta was employed. She spoke only to confirm her personal details and was represented by barrister Mr Narinda Sekhon. She was granted conditional bail to return to court on June 12. • (http://www.ananova.com/news/story/sm_579296.html)
Traditional point of view: definitions • Traditional terminologists: definitions of terms are strict in comparison to glosses of general words • Contemporary point of view: degree of vagueness in term definitions is lower, but in many cases it is inevitable. • Taxation in Russian legislation: New construction vs. repair
How many general and terminological senses are so close? - 1 • Building - relatively permanent enclosed construction over a plot of land, having a roof and usually windows and often more than one level, used for any of a wide variety of activities, as living, entertaining, or manufacturing (Unabridged Webster dictionary) • Domains • Construction industry • Domain of public utilities • It is impossible to separate senses • Practically all denotations are the same
How many general and terminological senses are so close? - 2 • transportation means, job positions, technical devices, food, agricultural plants and animals other natural objects, art workand others – • Produced by professionals • we use them in everyday life • social, political and economic processes • planned or restricted by professionals, • our life is influenced by them
General words and terminologies • Intersection is significant • Number of words in general dictionaries -- 40-50 percents belong to the intersection area • We call this intersection area -- socio-political domain -- domain of social life -- it describes everyday life of contemporary society
The sociopolitical domain and domains in WordNet • Many researchers proposed sets of domains for WordNet and EuroWordNet • The sociopolitical domain is approximately equal to sum of the proposed domains • A synset is related to the sociopolitical domain if there is a professional domain (not science) that has a term with very similar sense (+- vagueness) • Emotions and feelings do not belong to the sociopolitical domain
Multiword terms from specific domains • A lot of multiword terms from professional domains are understandable to native speakers • Multinational country • Single member constituency • Amicable agreement • Global market • Criminal omission • Special criteria for inclusion of multiword expressions
Features of Sociopolitical Domain-1 • Texts of various genres – official documents, international treaties, legislative documents, newspaper articles are related to the sociopolitical domain. • Development of a unified linguistic resource for automatic text processing of such various texts • A broad basis for development of domain-specific resources
Features of Sociopolitical Domain-2 • Inclusion of multiword terms facilitates disambiguation procedures • Ambiguity within the domain is much lower than in the whole resource, distinctions between senses are more definite and more important – it is possible to use different disambiguation procedures within the sociopolitical domain and out of the domain • Procedures of identification of lexical cohesion, lexical chains can be also different for synsets in the sociopolitical area and out of it, because of more thematic definiteness of concepts in the sociopolitical domain (“privatization” vs. “creation”)
Experience of Work in Sociopolitical Domain • Project University Information System RUSSIA (www.cir.ru) – 800 thousand Russian Documents (after 1991) • Russian thesaurus on Sociopolitical life (since 1994) – concept-based network of 30 thousand concepts, 75 thousand words and terms • Automatic text processing since 1995 – text categorization, automatic conceptual indexing, text summarization
University Information System RUSSIA (www.cir.ru) 800,000/ 7.5Gb
Socio-Political Domain vs. Lexicon Sciences 110,000 text entries 50,000 concepts Lexicon 75,000 text entries 30,000 concepts Socio-Political Domain Levels of Hierarchy
Specific Domains vs. Socio-Political Socio-Political Domain Elections Geography Industrial Production Levels of Hierarchy
Interrelations between Socio-Political Domains Socio-Political Domain Taxation Law Accounting Banking Levels of Hierarchy
Sciences vs. Socio-Political Domain Social Sciences Socio-Political Domain Natural Sciences Socio-Political Domain
Specific applications of Sociopolitical thesaurus • Terms of economics and sociology were included – automatic text categorization of scientific papers (700 Categories – JEL (Journal of Economic Literature subject headings) • Terms of non-production spheres were added – automatic text categorization of Russian legislation (3000 categories of the commercial subject headings system)
Conclusions-1 • A border between a general language lexicon and terminologies of specific domains is not sharp and abrupt. • It looks more like a broad strip and contains general language senses practically coinciding with concepts of social subdomains and concepts of specific domains understandable for native speakers
Conclusions-2 • Detailed description of concepts, terms, words from this “transition area”, called “sociopolitical domain”, can be naturally added to a wordnet’ semantic network • and facilitate solution of such problems as lexical disambiguation and identification of the text structure, enhance coverage of domain-specific texts by wordnets’ synsets, improve effectiveness of the wordnets use in various automatic text processing applications