370 likes | 517 Views
The Harmony of Music and Computing. Expanding a Domain-Specific Database. Jantine Trapman. Overview. Components LT4eL Cornetto Creation / expansion of Music Ontology Automatic Creation Watson Prompt Mapping Music Ontology Cornetto. Components. LT4eL Cornetto. Components: LT4eL.
E N D
The Harmony of Music and Computing Expanding a Domain-Specific Database Jantine Trapman
Overview • Components • LT4eL • Cornetto • Creation / expansion of Music Ontology • Automatic Creation • Watson • Prompt • Mapping • Music Ontology • Cornetto
Components • LT4eL • Cornetto
Components: LT4eL Language Technology for eLearning www.lt4el.eu • Development of search and management facilities in the LMS: • Keyword Extractor • Glossary Candidate Finder • Semantic Search
Semantic Search • Based on: • (multilingual) documents (LOs) for eight languages • semantic annotation of LOs • ontology • lexicon for each language involved • Corpus and ontology are restricted to Computing domain
Computing Ontology (1) • Creation: • Manually annotated keywords in eight languages extracted from LOs • Translated into (English) concepts • Definitions collected on the WWW and added to concepts • Extension with additional concepts from: • Restrictions on existing concepts • Superconcepts of existing concepts • Missing subconcepts • Annotation of LOs
DOLCE WordNet Computing German Polish Romanian LT4eL lexicons Maltese Portuguese English Bulgarian Czech Dutch Computing Ontology (2) • Domain ontology: • Domain: Computing • Manually created • 1406 concepts • 50 from DOLCE • 250 intermediate concepts from OntoWordNet • Use: • Lexicon development for 8 languages • Semantic annotation LOs • LO indexing
Computing Lexicon • Concepts were translated in all languages • Each entry contains three types of information: • Concept (and superconcept): CDDrive (is-a Drive) • Definition: a drive that reads a compact disc and that is connected to an audio system • Set of terms in a given language: CD-speler, CD drive
Expansion of the LT4eL KB • Future: more domains needed • Task: • Expansion ontology and lexicons • Preferably semi-automatic • Three options: • Top-down • Bottom-up • Both, ingredients: • Cornetto, WordNet • Music ontology • Watson, Prompt
SUMO/ MILO Wordnet Dutch WordNet (DWN) Cornetto Database Referentie Bestand Nederlands (RBN) Cornetto • Combinatorial and Relational • Network as Toolkit for Dutch Language Technology • Referentie Bestand Nederlands (RBN) lexical units • Dutch part of EuroWordNet: Dutch WordNet (DWN) synsets • SUMO/MILO plus extensions terms and axioms • Core: table of Cornetto Identifiers (CIDs) http://www.let.vu.nl/onderzoek/projectsites/cornetto/index.html
[noun] zanger:1c_n-42316 • Morphology: type:derivation; structure:zingen[*er]; plurforms:zangers • Syntax: gender:m/f; article:de • Semantics: reference:common; countability:count; type:human; subclass:beroepsnaam/beoefenaar; resume:iemand die zingt • Pragmatics: domain:muz
Example Lexical Entry Cornetto (2) • Combinatorics zanger1: • De redacteur van het woordenboek was ook een zanger • De zanger van de band • SUMO: (+, , hasSkill) • Synonyms: zanger, zangeres HAS_HYPERONYM musicus, musicienne, muzikant HAS_HYPONYM baszanger, sopraan, blueszanger, charmezanger, ... • Equivalence relations:EQ_SYNONYM singer, vocalist, vocalizer, vocaliser /ENG20-09908715-n link with WordNet 2.0! • WordNet Domains: music
Tasks • Extract music related terms from Cornetto • Create a domain ontology for Music • Map between terms from lexicon and concepts in ontology • Map music ontology to OntoWN and DOLCE • Adjust Cornetto data to LT4eL format
Questions (1) • How can we automatize the process of ontology building and to which extent? • How can we profit from existing resources from the Semantic Web to enrich ontologies? • To which extent do Watson and PROMPT support the reuse of existing resources?
Music Ontology • Automatic Creation • Expansion with: • Watson • Prompt
Automatic Creation (1) • (Basili et al. 2007): automatic ontology extraction from open-domain corpus (BNC) • Designed for three tasks: • lexical ambiguity resolution within a specific domain • restricting a set of terms to a subset relevant for an ontology to be constructed • expanding this new ontology with other, novel and relevant concepts, relations and instances.
Automatic Creation (2) • Preprocessing: • Corpus split in 40 sentence text segments • PoS tagging • Filtering of noun phrases • General steps: • Term extraction through Latent Semantic Analysis (Deerwester et al. 1990) • Ontology extraction from WordNet based on Conceptual Density (Agirre and Rigau 1996)
Music Ontology (Basili et al. ‘07) • 46 primitive classes • Leaf concepts have a synset ID from WordNet • No properties, only super-/subconcept relation • So.. a rather small and shallow ontology expansion by exploiting Semantic Web techniques
Watson (1) http://watson.kmi.open.ac.uk/WatsonWUI/ • Every URI is clickable: all resources are available • Information about: • Size • Representation language • Number of classes, properties, individuals etc. • Review rating • Interface for SPARQL queries • Possibility of (upwards) navigation
Watson (2) • Also available as • Protégé plug-in (under development) • API • New concepts can be added • Manually • One by one • Much human action required • Faster than creation from scratch, but still a tedious exercise
Watson (3) • Watson provides in • a list of URIs of available semantic databases • a list of candidate concepts • What is still lacking: • a (semi-)automatic way to merge or align new concepts or ontologies to an existing one. • Possible solution: Prompt
PROMPT (1) http://protege.stanford.edu/plugins/prompt/prompt.html • Protégé plug-in • Functionalities: • Comparison • Inclusion • Merging • Alignment • Requirement: ontologies for merge etc. must be available offline • Prompt goes beyond purely syntactic matching • Evaluation shows that experts followed 90% of Prompt’s suggestions
Prompt (2) • Saves time and effort: • linguistically similar classes are found quickly • inherited properties and subclasses can be added automatically • similar structures are automatically detected • automatic consistency check • Resources must have the exact same markup language • Merging: • faster but more complex • requires good insight in resources
Mapping • Music Ontology • Cornetto
Resources • Music Ontology: • Some nodes have WordNet ID (from the automatic process • Many haven’t, especially those added with Watson • Cornetto entries: • have synset ID from Dutch WN • have mapping to WordNet entry through equivalence or near-equivalence e.g.
Questions (2) • To which extent does WordNet support a mapping between: • The Cornetto lexicon and a newly created ontology partly based on Wordnet; • The existing ontology and lexicon from LT4eL, and Cornetto + ontology
Procedures • A concept either has or has not a WN synset ID • Mapping via WordNet synset ID: • Lookup synset ID in Cornetto • Establish related DWN synset(s) • Results: until now without problems although near-equivalence relations are expected to give mismatches • Mapping without synset ID: • Syntactic matching of conceptname with terms from WordNet synsets • compare definitions and glosses
Examples “easy match” • zanger:1 d_n-20810 (iemand die zingt) is [EQ_SYNONYM] of: singer, vocalist, vocalizer, vocaliser /ENG20-09908715-n (a person who sings ) • strijkkwartet:1 d_n-14287 (ensemble van vier strijkers) and: strijkkwartet:2 d n-19905 (ensemble voor vier strijkers) are [EQ_NEAR_SYNONYM] of: soloist:1/ENG20-09931035 • Note: Cornetto contains mismatch between WN and DWN
Matching without ID (1) • For each owl:Class in Music ontology • try to match with: • target attribute in relation element of Cornetto XML structure, where • Attribute relation_name is (EQ_)NEAR_SYNONYM e.g. • Add synset ID to concept (for mapping to OntoWordNet) <owl:Class rdf:about=“http:///myOntos/music.owl#orchestra"/> <relation relation_name="EQ_NEAR_SYNONYM" target20-previewtext="symphony orchestra:1, symphony:2" version="pwn_1_6" target20="ENG20-07750308-n" target="ENG16-06123240-n">
Matching without ID (2) • Compare definitions and glosses: • many ontology classes have a definition • each WN synset has a gloss • preprocess: stemming and filtering nouns • Consider percentage of nouns in concept definition that match with a certain gloss • Evaluate results • Note: some definitions are equal to WN glosses
Current work • Matching without ID on class name and definitions/glosses • Manually check results for precision and recall • Problem: MWEs, e.g. class Brass_Instrument: • has no precise WN counterpart, but • Brass does exist, but • it has multiple senses how can we disambiguate? • Question: ID allows easy and reliable match, but can we do the task without?
Remaining and Future work • Attuning format lexicon to LT4eL format • Mapping to OntoWordNet (semi-automatic) • Mapping to DOLCE (manual task) • Ontology evaluation • Experiments with WordNets from different languages • Involve additional lexical info to improve LT4eL search engine e.g. use morphological info about plural forms