550 likes | 710 Views
Lomonosov Moscow State University Research Computing Center. Center for Information Research. Problems of Ontology Development for a Broad Domain Loukachevitch Natalia louk_nat@mail.ru Leading Researcher of Lomonosov Moscow State University. Technologies.
E N D
Lomonosov Moscow State UniversityResearch Computing Center Center for Information Research Problems of Ontology Development for a Broad Domain Loukachevitch Natalia louk_nat@mail.ru Leading Researcher of Lomonosov Moscow State University
Technologies • Ontologies for Natural Language Processing and Information Retrieval Applications • Applications • Conceptual indexing • Query expansion • Text Categorization • Document Clustering • Question-Answering • Automatic Summarization • Linguistic Ontologies • RuThes thesaurus (52 thousand concepts, 150 thousand words and expressions) • Ontology on Natural Sciences and Technologies (60 thousand concepts) • Banking Thesaurus for Information Retrieval applications et. al.
Projects of Our Research Group-1 • State Bodies • Central Bank of the Russian Federation (2006 – ..) • Development of banking thesaurus, conceptual indexing, text categorization • Central Election Committee of the RF (1999 – ..) • Information-retrieval system, conceptual indexing, text categorization, • State Duma of RF (1999 – ..) • Information retrieval system on Duma records • Accounting Chamber of RF (2003) • Creation of a terminology dictionary • other state bodies • Text categorization, clusterization, development of domain-specific ontologies,
Projects of Our Research Group-2 • Commercial organizations • Rambler Media company (2007– ..) • Automatic clusterization, categorization, summarization of news flow • Personalization of news and advertisements • Spam detection • Information extraction • Garant Legal Information Company (2002 – …) • Text categorization of legal documents • Summarization of court decisions • Learning to rank in information-retrieval • etc.
Plan of Tutorial • Ontologies: general remarks • Main paradigms and their problems • Level of formalization • Broad vs. simple domains • Boundaries of a domain • Main source of knowledge - texts • Domain-specific texts • Concepts and terms, term extraction • Synonyms and near-synonyms • Ambiguity of terms • Establishing relations • Example: Ontology-based text categorization
Domains and Tasks • Ontology vs. Machine Learning? • Description of domains is difficult • Data can need generalization • Some knowledge can be already described in ontology-based resources • Therefore for many tasks we need • Ontology+Machine learning
Ontologies: general remarks • Ontology - formal representation of knowledge as a set of concepts within a domain and the relationships between those concepts • Main components: • Concepts (classes) • Instances (individuals) • Relations • Attributes • Axioms (rules)
cat siamese TaxonomyClasses object organism animal mammal frog instances
Ontology development paradigms • Formal, logically sound ontologies • Logical inference, • Some domains are difficult to formalize • Inconsistency is a huge problem • Semantic Web • Lot of specific ontologies • Rdf triples, Same_as links • a lot of “messy” data • Ontologies for Natural Language processing • Less formal • Relation to language semantics • Formalization is restricted with current state of natural language processing
Ontology-1: Ontology Spectrum (Obrst, 2006) strong semantics Modal Logic First Order Logic Logical Theory Is Disjoint Subclass of with transitivity property Description Logic DAML+OIL, OWL From less to more expressive UML Conceptual Model Is Subclass of Semantic Interoperability RDF/S XTM Extended ER Thesaurus Has Narrower Meaning Than ER Structural Interoperability DB Schemas, XML Schema Taxonomy Is Sub-Classification of Relational Model, XML Syntactic Interoperability weak semantics
Ontology-2,Semantic Web. Linking Data Projecthttp://www.w3.org/wiki/SweoIG/TaskForces/CommunityProjects/LinkingOpenData
Approach 3. Ontologies for Natural Language Processing • Relations between the concepts and lexical meanings are quite complex • How represent synonyms and near-synonyms • How detailed lexical senses of ambiguous words should be represented • Large volume vs. complexity of description • WordNet as a symbol of this approach • (!) For different tasks – different types of ontologies
Plan of Tutorial • Ontologies: general remarks • Main paradigms and their problems • Level of formalization • Broad vs. simple domains • Boundaries of a domain • Main source of knowledge - texts • Domain-specific texts • Concepts and terms, term extraction • Synonyms and near-synonyms • Ambiguity of terms • Establishing relations • Example: Ontology-based text categorization
Complicatedvs. simpledomains • Simple domains (wine ontology) • Explicit boundaries • Boundaries are determined with “physical processes” e.g. production, services • Clear roles of entities • Small number of classes (may have many instances) or many uniform classes • Complicated domains (terrorism, financial control) • Vague boundaries, • The same entities used in different roles and functions • Knowledge stored in text documents,
Wine ontologyhttp://www.w3.org/TR/owl-guide/wine.rdf Wine WhiteWine Region WhiteLoire WhiteBurgundy Meal course WhiteBordeaux Grape TableWine SweetWine RedWine
Complicated domains: vague boundaries • Interdisciplinarity • state financial control(economy+law + finances) • Counter-terrorism(criminal law + international law+ + constitutional law+state bodies+ buildings+vehicles+weapons…) • Two main parts • Center of the domain • Additional concepts from neighbour domains
Boundaries of domain: Terrorism • Center of domain • Terrorist acts, groups, terrorists • Anti-terrorist activity • Additional spheres • Geographic places, • Weapons and explosives, • Transport, • Financial payment, • Ideology, Religion etc. • Re-use of ontologies?
Problem: Distortion of Reality • General concepts necessary for domain description are treated as subordinates of domain concepts • Name of concept is general but its intended sense in domain specific • Law(=antiterrorist law=), • Intelligence • (= antiterrorist intelligence) • Problems in ontology mapping, ontology reuse • Thesaurus onRadiological terrorism • http://www.jasonmorrison.net/content/2004/a-thesaurus-for-radiological-terrorism-research/
Plan of Tutorial • Ontologies: general remarks • Main paradigms and their problems • Level of formalization • Broad vs. simple domains • Boundaries of a domain • Main source of knowledge - texts • Domain-specific texts • Concepts and terms, term extraction • Synonyms and near-synonyms • Ambiguity of terms • Establishing relations • Example: Ontology-based text categorization
Ontology Development and Domain-Specific Texts • Knowledge stored in texts • Domain-specific text collection • As many as possible • Necessary to find exact boundaries • Automatic extraction of terms from texts (Term acquisition) • Terms are expressions corresponding to concepts of a specific domain • Top-level modeling • Use of existing ontologies
Automatic Term Acquisition from Texts • Linguistic criteria (noun groups) • Lexical restrictions (f.e. evaluative words good, bad are rarely parts of terms) • Statistical criteria (Frequency, Mutual information, and many others) • !!Use of machine learning approaches to improve term extraction • Formation of ordered list of term-candidates
The most frequent phrases in documents of financial control domain • Translation from Russian • Federal budget • Russian Federation • Accounting Chamber • Federal law • Overall sum (-) • Resources of federal budget (?) • Oblast budget • Financial means • Use of financial means (?) • Wages • Ministry of finance • Budget resources • Tax body
Analysis of Term-Candidate List • In the beginning of the list there are many evident terms • Further there are many unclear expressions • whether they are terms (domain experts can have different opinions) • whether they are related to the domain • where is a boundary of the domain • A lot of synonymic variants • Ambiguity of terms
Boundaries of the domain • Bottom-up+top-down • Term extraction from texts – a bottom-up stage • Extracted expressions are necessary to understand what types of entities are needed in the domain – in fact design of top-level taxonomy • Top-down analysis • Combined approach to concept selection (frequency from the collection+top-level taxonomy restrictions)
Synonyms and variants of “money laundering” • CRIMINAL LAUNDERING • ILLEGAL LAUNDERING • LAUNDERING • LAUNDERING ACTIVITIES • LAUNDERING OF MONEY • LAUNDERING OPERATIONS • MONEY LAUNDERING • MONEY LAUNDERING ACTIVITIES • MONEY LEGALIZATION • MONEY WASHING • PROFIT LAUNDERING • PROFIT WASHING
Lexical ambiguity • Homonymsare words that share the same spelling but have different meanings (unrelated in origin) • bank (financial institution vs. land (river bank)), • rarely met in the same domain except broad one • easily recognized by non-linguists • different concepts, different sets of relations • Polysemes are words with the same spelling and distinct but related meanings • bank (financial institution vs. building) • very often met in any domains • regular polysemes (institutions and their buildings) • difficult forrecognition by non-linguists • tendency to use the same concept of ontology for related senses
Lexical ambiguity (polysemes) • Transport • They have succeeded in stopping the transport of live animals (=moving) • mechanism of contactless payment in public transport(=vehicles) • Regular polysemy • Tree – wood (material): birch • Non-linguists cannot recognize different senses, feel strange deviations in relations
Lexical ambiguity(polysemes) • How to help yourselves – nonambiguous synonymic phrases • Transport1 = Transportation process • Transport2 = transport vehicle • Birch1 = birch tree • Birch2 = birch wood • Possible to see different entities behind closely related senses
Relations of an ontology • The set of relations of ontology can be non-evident • Main relations • Class-subclass • Instance relation • Role relations • Different properties: transitivity et.al. • Old AI books and manuals: the same relation in all cases – “is_a” • Diagnostic expression “X is a Y” can be appropriate in all cases
Class-subclass relation • Relation between two sets of entities (classes) (many-to-many): birch - tree • Properties: transitive, inheritance • Rules: • If class A is a subclass of class B, then each instance of class A is also an instance of B • Top-level classes (categories) should coincide for A and B • Real example of a mistake: • river – water object – water – substance -> • Moscow river – is a Substance? ?
Instance relation • Relation one-to-many • Moscow river – instance of river • Teacher – instance of profession • Not transitive • Rex, Poodle, dog breed, dog – what relations • Rex is an instance of poodle • Poodle is an instance of dog breed • Poodle is a subclass of dog • Rex is not a dog breed • Rex is a dog Dog Dog breed Instance Subclass Instance Poodle X Instance Rex
Roles and types • Roles: student, employer, terrorist, player • Types: Person, animal, building, car • Role is a type in some conditions • A student is a person in the role of learning • Properties of roles: • Roles are created dynamically • Roles can play other roles • A type can play many different roles
Confusion of type-role relations with class-subclass relations • Frequent mistake of almost every beginner • Not every person is an employer, an organization is not an employer in all situations • Problems with inference Employer X X Organization Person
Text-motivated confusion of types and roles • Natural substances such as salt, sugar, vinegar, alcohol, .. are also used as traditional preservatives.(wikipedia) • Often salt and other preservatives are added to canned foods.(http://www.family-health-and-nutrition.com/this-vs-that.html) • What relation is between salt and preservative? • Class-subclass? • Class – instance? • .. • In practice, beginners usually try to establish relations “Class-subclass”, however this is a type-role relation, preservative is a role of substances.
Automatic extraction of relations from texts • A lot of scientific publications: extraction of synonyms, taxonomies, part-whole relations etc. • But in complex domain it is impossible fully rely on automatic tools • In many cases evident relations are extracted • Causes • Multiword expressions • Ambiguity of language expressions • Contextual dependence • Necessity of very large domain text collection processing
Plan of Tutorial • Ontologies: general remarks • Main paradigms and their problems • Level of formalization • Broad vs. simple domains • Boundaries of a domain • Main source of knowledge - texts • Domain-specific texts • Concepts and terms, term extraction • Synonyms and near-synonyms • Ambiguity of terms • Establishing relations • Example: Ontology-based text categorization
Automatic text categorization • Main approaches • Knowledge-based methods (based on rules) • Machine learning methods – very popular in scientific conferences • Text categorization in real practice (operational text categorization) • Training collection should exist • Experts should categorize documents in a consistent way • Every category needs enough number of training examples In practice knowledge-based systems are widely used • Reuter company (provider of known training collection Reuter-21578) uses a knowledge-based system for text categorization of own documents
Subjectivity of experts Experts’ agreementin manual text categorization is around 60%
Our text categorization projects • Use of both approaches in dependence of task and data • Knowledge-based approach uses knowledge of our large resource RuThes thesaurus • Projects • Classifier for Central Election Committee (450 categories, 4 levels) • Classifier of Russian legislation (1169 categories, 3000 categories) • Classifier of English economic research papers (700 categories) • Classifier of public opinion polls (350 categories) • Classifier of banking document and news (200 categories) • General news classifiers • and others
Thesaurus on sociopolitical life Sociopolitical domain: social life of contemporary society Includes: thematic vocabulary and terminology from such domains as economy, finance, defense, law, sport, arts, military conflicts etc. Domain for such documents as government documents, legal acts, international treaties, newspaper articles, news reports 36thousand concepts, 100 thousand terms, 140 thousand direct relations Applications: conceptual indexing; automatic text categorization, document clustering, automatic text summarization, question-answering.
Socio-Political Domain Socio-Political Domain Taxation Law Accounting Banking Levels of Hierarchy
Thesaurus-based text categorization • Use of knowledge described in the Thesaurus • Manual description of Boolean expressions for categories based on small number of thesaurus concepts • Automatic thesaurus-based expansion of Boolean expressions • Thesaurus-based thematic representation of the text content independent of the genre and the length of a text (lexical chain technique)
Describing a category with supporting concepts • Categotization of legal acts • 200.020.020. Heads of states summits • { ( HEADS OF STATES SUMMITY ) • OR { ( NEGOTIATIONSN ) ( INTERNATIONAL NEGOTIATIONSY ) ( INTERNATIONAL CONTACTSN ) ( MEETINGN )} AND ( HEAD OF STATEL )}
Expanded representation of the category • {( HEADS OF STATES SUMMITY ) • ( summit, summit meeting, top-level meeting, head of states meeting ) • OR { ( NEGOTIATIONSN ) ( negotiations, talks ) ( INTERNATIONAL NEGOTIATIONSY ) ( international talks, interstate talks, diplomatic negotiations, international talks, multinational talks, intergovernmental talks, contracting nations, negotiating states …) ( INTERNATIONAL CONTACTSN ) ( international intercourse, transnational contacts… ) ( MEETINGN )} AND ( HEAD OF STATEL) ( leader of country, president, president of country, federal president, RF president, US president, monarch, …, emir, emir of Kuwait … )}
ROMIP: Russian Seminar on Information Retrieval • Russian TREC • Text categorization task • Categories: DMOZ, 247 categories of 2nd level Top/World/Russian/*/* • Training collection: «DMOZ» (presented by Rambler) • 300 000 documents, 2100 sites. • Testing collection: Belorussian Internet «BY.web» (granted by Yandex company) • 1 500 000 documents, 19 000 sites • Our task: • Thesaurus-based text categorization • Measuring of time to create categorization system • Evaluation