310 likes | 566 Views
From Enumerative Structures in Texts towards Hierarchical Structures in Ontologies. Nathalie Aussenac-Gilles Mouna KAMEL – Bernard ROTHENBURGER IC3 team at IRIT, Toulouse ILIKS 2011, Aix-en-Provence. Semantic relations. lexico-syntactic patterns NP0 (is|are) NP 1.
E N D
From Enumerative Structures in Texts towardsHierarchical Structures in Ontologies Nathalie Aussenac-Gilles Mouna KAMEL – Bernard ROTHENBURGER IC3 team at IRIT, Toulouse ILIKS 2011, Aix-en-Provence
Semantic relations lexico-syntactic patterns NP0 (is|are) NP1 The remaining elite American companies are Allen Edmonds and Alden Shoe Company. High-heeled footwear is footwear that raises the heels, … Slingbacks are shoes which are secured by a strap behind the heel, … Court shoes, known in the US as pumps, are typically high-heeled, … Platform shoe: shoe with very thick soles and heels ILIKS 2011 - Hierachical structures
Semantic relations lexico-syntactic patterns ??? NP0 (is|are) NP1 Variants include kitten heels (typically 1½-2 inches high) and stilletto heels (with a very narrow heel post) and wedge heels (with a wedge-shaped sole rather than a heel post). Ballet flats, known in the UK as ballerinas, ballet pumps or skimmers, are shoes with a very low heel … ILIKS 2011 - Hierachical structures
Semantic relations lexico-syntactic patterns ??? ILIKS 2011 - Hierachical structures
Semantic relations lexico-syntactic patterns ??? ILIKS 2011 - Hierachical structures
Semantic relations TITLES lexico-syntactic patterns ??? ILIKS 2011 - Hierachical structures
What’s new ? • Going beyond the sentence • Discourse analysis (Rhetorical Structure Theory, Segmented Discourse Representation Theory,…) • Using typo-dispositional clues • Logical structure of a text, Textual Architecture Model … • Typology of Enumerative Structures (ES) • Vertical and paradigmatic ES • Translation Enumerative Structure > Hierarchical structure • Evaluation ILIKS 2011 - Hierachical structures
Discourse Structure The key U.S. and foreign annual interest rates below are a guide to general levels but don't always represent actual transactions. PRIME RATE: 10 1/2%. The base rate on corporate loans at large U.S. money center commercial banks. FEDERAL FUNDS: 8 3/4% high, 8 11/16% low, 8 5/8% near closing bid, 8 11/16% offered. Reserves traded among commercial banks for overnight use in amounts of $1 million or more. Source: Fulton Prebon (U.S.A.) Inc. DISCOUNT RATE: 7%. The charge on loans to depository institutions by the New York Federal Reserve Bank. CALL MONEY: 9 3/4% to 10%. The charge on loans to brokers on stock exchange collateral. COMMERCIAL PAPER : placed directly by General Motors Acceptance Corp.: 8.50% 30 to 44 days; 8.25% 45 to 65 days; 8.375% 66 to 89 days; 8% 90 to 119 days; 7.875% 120 to 149 days; 7.75% 150 to 179 days; 7.50% 180 to 270 days. ILIKS 2011 - Hierachical structures
Discourse Structure The key U.S. and foreign annual interest rates below are a guide to general levels but don't always represent actual transactions. PRIME RATE: 10 1/2%. The base rate on corporate loans at large U.S. money center commercial banks. FEDERAL FUNDS: 8 3/4% high, 8 11/16% low, 8 5/8% near closing bid, 8 11/16% offered. Reserves traded among commercial banks for overnight use in amounts of $1 million or more. Source: Fulton Prebon (U.S.A.) Inc. DISCOUNT RATE: 7%. The charge on loans to depository institutions by the New York Federal Reserve Bank. CALL MONEY: 9 3/4% to 10%. The charge on loans to brokers on stock exchange collateral. COMMERCIAL PAPER : placed directly by General Motors Acceptance Corp.: 8.50% 30 to 44 days; 8.25% 45 to 65 days; 8.375% 66 to 89 days; 8% 90 to 119 days; 7.875% 120 to 149 days; 7.75% 150 to 179 days; 7.50% 180 to 270 days. (190) [The key U.S. and foreign annual interest rates below are a guide to general levels but don't always represent actual transactions.A] [PRIME RATE: 10 1/2%. The base rate on corporate loans at large U.S. money center commercial banks.B] [FEDERAL FUNDS: 8 3/4% high, 8 11/16% low, 8 5/8% near closing bid, 8 11/16% offered. Reserves traded among commercial banks for overnight use in amounts of $1 million or more. Source: Fulton Prebon (U.S.A.) Inc.C] [DISCOUNT RATE: 7%. The charge on loans to depository institutions by the New York Federal Reserve Bank.D] [CALL MONEY: 9 3/4% to 10%. The charge on loans to brokers on stock exchange collateral. E] [COMMERCIAL PAPER placed directly by General Motors Acceptance Corp.: 8.50% 30 to 44 days; 8.25% 45 to 65 days; 8.375% 66 to 89 days; 8% 90 to 119 days; 7.875% 120 to 149 days; 7.75% 150 to 179 days; 7.50% 180 to 270 days.F]wsj_0602 Carlson, L and Marcu D. (2001). Discourse Tagging Manual. Unpublished manuscript, http://www.isi.edu/~marcu/discourse/tagging-ref-manual.pdf. ILIKS 2011 - Hierachical structures
Discourse Structure The key U.S. and foreign annual interest rates below are a guide to general levels but don't always represent actual transactions. PRIME RATE: 10 1/2%. The base rate on corporate loans at large U.S. money center commercial banks. FEDERAL FUNDS: 8 3/4% high, 8 11/16% low, 8 5/8% near closing bid, 8 11/16% offered. Reserves traded among commercial banks for overnight use in amounts of $1 million or more. Source: Fulton Prebon (U.S.A.) Inc. DISCOUNT RATE: 7%. The charge on loans to depository institutions by the New York Federal Reserve Bank. CALL MONEY: 9 3/4% to 10%. The charge on loans to brokers on stock exchange collateral. COMMERCIAL PAPER : placed directly by General Motors Acceptance Corp.: 8.50% 30 to 44 days; 8.25% 45 to 65 days; 8.375% 66 to 89 days; 8% 90 to 119 days; 7.875% 120 to 149 days; 7.75% 150 to 179 days; 7.50% 180 to 270 days. (190) [The key U.S. and foreign annual interest rates below are a guide to general levels but don't always represent actual transactions.A][PRIME RATE: 10 1/2%. The base rate on corporate loans at large U.S. money center commercial banks.B] [FEDERAL FUNDS: 8 3/4% high, 8 11/16% low, 8 5/8% near closing bid, 8 11/16% offered. Reserves traded among commercial banks for overnight use in amounts of $1 million or more. Source: Fulton Prebon (U.S.A.) Inc.C] [DISCOUNT RATE: 7%. The charge on loans to depository institutions by the New York Federal Reserve Bank.D] [CALL MONEY: 9 3/4% to 10%. The charge on loans to brokers on stock exchange collateral. E] [COMMERCIAL PAPER placed directly by General Motors Acceptance Corp.: 8.50% 30 to 44 days; 8.25% 45 to 65 days; 8.375% 66 to 89 days; 8% 90 to 119 days; 7.875% 120 to 149 days; 7.75% 150 to 179 days; 7.50% 180 to 270 days.F]wsj_0602 Carlson, L and Marcu D. (2001). Discourse Tagging Manual. Unpublished manuscript, http://www.isi.edu/~marcu/discourse/tagging-ref-manual.pdf. ILIKS 2011 - Hierachical structures
Elaboration-Set-Member List B C D E F Discourse Structure (190) [The key U.S. and foreign annual interest rates below are a guide to general levels but don't always represent actual transactions.A][PRIME RATE: 10 1/2%. The base rate on corporate loans at large U.S. money center commercial banks.B] [FEDERAL FUNDS: 8 3/4% high, 8 11/16% low, 8 5/8% near closing bid, 8 11/16% offered. Reserves traded among commercial banks for overnight use in amounts of $1 million or more. Source: Fulton Prebon (U.S.A.) Inc.C] [DISCOUNT RATE: 7%. The charge on loans to depository institutions by the New York Federal Reserve Bank.D] [CALL MONEY: 9 3/4% to 10%. The charge on loans to brokers on stock exchange collateral. E] [COMMERCIAL PAPER placed directly by General Motors Acceptance Corp.: 8.50% 30 to 44 days; 8.25% 45 to 65 days; 8.375% 66 to 89 days; 8% 90 to 119 days; 7.875% 120 to 149 days; 7.75% 150 to 179 days; 7.50% 180 to 270 days.F]wsj_0602 The representation in RST is generally carried out manually. Carlson, L and Marcu D. (2001). Discourse Tagging Manual. Unpublished manuscript, http://www.isi.edu/~marcu/discourse/tagging-ref-manual.pdf. ILIKS 2011 - Hierachical structures
Typo-dispositional markers Typographical markers Dispositional markers ILIKS 2011 - Hierachical structures
Enumerative Structure The act of enumerating : stating the successive elements of a same conceptual domain, these elements being hierarchically directly or indirectly linked to a classifying concept Can take several forms (examples from the “shoe” wikipedia page) ILIKS 2011 - Hierachical structures
Enumerative Structures (ES) • PRIMER {ITEM} CONCLUSION • Horizontal vs. Vertical • Syntagmatic vs. Paradigmatic ILIKS 2011 - Hierachical structures
Enumerative Structures (ES) Erich von Hornbostel and Curt Sachs adopted Mahillon's scheme and published an extensive new scheme for classification in ZeitschriftfürEthnologie in 1914. Hornbostel and Sachs used most of Mahillon's system, but replaced the term autophone with idiophone.The original Hornbostel-Sachs system classified instruments into four main groups: • Idiophones, which would be an instrument that you could hit, strike, shake or scrape – such as the xylophone and rattle. They produce sound by vibrating themselves; they are sorted into concussion, percussion, shaken, scraped, split, and plucked idiophones. • Membranophones, which would be an instrument that uses a stretched skin, or membrane (key word being "stretched")such as drums or kazoos, produce sound by a vibrating membrane; they are sorted into predrummembranophones, tubular drums, friction idiophones, kettledrums, friction drums, and mirlitons. • Chordophones, which would be an instrument that uses stretched string or cord – such as the piano or cello, produce sound by vibrating strings; they are sorted into zithers, keyboard chordophones, lyres, harps, lutes, and bowed chordophones. • Aerophones, which would be an instrument that you produce a sound by blowing air into – such as the pipe organ or oboe, produce sound by vibrating columns of air; they are sorted into free aerophones, flutes, organs, reedpipes, and lip-vibrated aerophones. Sachs later added a fifth category, electrophones, such as theremins, which produce sound by electronic means.[107] Within each category are many subgroups. The system has been criticised and revised over the years, but remains widely used by ethnomusicologists and organologists. From the "Musical instrument" wikipedia page ILIKS 2011 - Hierachical structures
Enumerative Structures (ES) Erich von Hornbostel and Curt Sachs adopted Mahillon's scheme and published an extensive new scheme for classification in ZeitschriftfürEthnologie in 1914. Hornbostel and Sachs used most of Mahillon's system, but replaced the term autophone with idiophone.The original Hornbostel-Sachs system classified instruments into four main groups: • Idiophones, which would be an instrument that you could hit, strike, shake or scrape – such as the xylophone and rattle. They produce sound by vibrating themselves; they are sorted into concussion, percussion, shaken, scraped, split, and plucked idiophones. • Membranophones, which would be an instrument that uses a stretched skin, or membrane (key word being "stretched")such as drums or kazoos, produce sound by a vibrating membrane; they are sorted into predrummembranophones, tubular drums, friction idiophones, kettledrums, friction drums, and mirlitons. • Chordophones, which would be an instrument that uses stretched string or cord – such as the piano or cello, produce sound by vibrating strings; they are sorted into zithers, keyboard chordophones, lyres, harps, lutes, and bowed chordophones. • Aerophones, which would be an instrument that you produce a sound by blowing air into – such as the pipe organ or oboe, produce sound by vibrating columns of air; they are sorted into free aerophones, flutes, organs, reedpipes, and lip-vibrated aerophones. Sachs later added a fifth category, electrophones, such as theremins, which produce sound by electronic means. Within each category are many subgroups. The system has been criticised and revised over the years, but remains widely used by ethnomusicologists and organologists. Primer ILIKS 2011 - Hierachical structures From the "Musical instrument" wikipedia page
Enumerative Structures (ES) Erich von Hornbostel and Curt Sachs adopted Mahillon's scheme and published an extensive new scheme for classification in ZeitschriftfürEthnologie in 1914. Hornbostel and Sachs used most of Mahillon's system, but replaced the term autophone with idiophone.The original Hornbostel-Sachs system classified instruments into four main groups: • Idiophones, which would be an instrument that you could hit, strike, shake or scrape – such as the xylophone and rattle. They produce sound by vibrating themselves; they are sorted into concussion, percussion, shaken, scraped, split, and plucked idiophones. • Membranophones, which would be an instrument that uses a stretched skin, or membrane (key word being "stretched")such as drums or kazoos, produce sound by a vibrating membrane; they are sorted into predrummembranophones, tubular drums, friction idiophones, kettledrums, friction drums, and mirlitons. • Chordophones, which would be an instrument that uses stretched string or cord – such as the piano or cello, produce sound by vibrating strings; they are sorted into zithers, keyboard chordophones, lyres, harps, lutes, and bowed chordophones. • Aerophones, which would be an instrument that you produce a sound by blowing air into – such as the pipe organ or oboe, produce sound by vibrating columns of air; they are sorted into free aerophones, flutes, organs, reedpipes, and lip-vibrated aerophones. Sachs later added a fifth category, electrophones, such as theremins, which produce sound by electronic means. Within each category are many subgroups. The system has been criticised and revised over the years, but remains widely used by ethnomusicologists and organologists. Primer Items From the "Musical instrument" wikipedia page ILIKS 2011 - Hierachical structures
Enumerative Structures (ES) Erich von Hornbostel and Curt Sachs adopted Mahillon's scheme and published an extensive new scheme for classification in ZeitschriftfürEthnologie in 1914. Hornbostel and Sachs used most of Mahillon's system, but replaced the term autophone with idiophone.The original Hornbostel-Sachs system classified instruments into four main groups: • Idiophones, which would be an instrument that you could hit, strike, shake or scrape – such as the xylophone and rattle. They produce sound by vibrating themselves; they are sorted into concussion, percussion, shaken, scraped, split, and plucked idiophones. • Membranophones, which would be an instrument that uses a stretched skin, or membrane (key word being "stretched")such as drums or kazoos, produce sound by a vibrating membrane; they are sorted into predrummembranophones, tubular drums, friction idiophones, kettledrums, friction drums, and mirlitons. • Chordophones, which would be an instrument that uses stretched string or cord – such as the piano or cello, produce sound by vibrating strings; they are sorted into zithers, keyboard chordophones, lyres, harps, lutes, and bowed chordophones. • Aerophones, which would be an instrument that you produce a sound by blowing air into – such as the pipe organ or oboe, produce sound by vibrating columns of air; they are sorted into free aerophones, flutes, organs, reedpipes, and lip-vibrated aerophones. Sachs later added a fifth category, electrophones, such as theremins, which produce sound by electronic means. Within each category are many subgroups. The system has been criticised and revised over the years, but remains widely used by ethnomusicologists and organologists. Primer Items Conclusion From the "Musical instrument" wikipedia page ILIKS 2011 - Hierachical structures
Enumerative Structures (ES) • PRIMER {ITEM} CONCLUSION • Horizontal vs. Vertical • Syntagmatic vs. Paradigmatic ILIKS 2011 - Hierachical structures
Horizontal vs. Vertical ES Under IAU definitions, there are eightplanets in the Solar System. In order of increasing distance from the Sun, there are the four terrestrialplanets, Mercury, Venus, Earth, and Mars, then the four gas-giantones, Jupiter, Saturn, Uranus, and Neptune. ~~~~~~~~~~~~~ ~~~~~~~~~~~~~ ~~~~ PRIMER versus • Under IAU definitions, in the Solar System and in order of increasing distance from the Sun, there are eightplanets: • • four terrestrialplanets: • - Mercury, • - Venus, • - Earth, • - Mars. • • four gas-giantplanets: • - Jupiter, • - Saturn, • - Uranus, • - Neptune. ~~~~~~~~~~~~~ ~~~~~~~~ ITEMS ~~~~ ILIKS 2011 - Hierachical structures
Horizontal vs. Vertical ES Under IAU definitions, there are eightplanets in the Solar System. In order of increasing distance from the Sun, there are four terrestrialplanets, Mercury, Venus, Earth, and Mars, then the four gas-giantones, Jupiter, Saturn, Uranus, and Neptune. versus • Under IAU definitions, in the Solar System and in order of increasing distance from the Sun, there are eightplanets: • • four terrestrialplanets: • - Mercury, • - Venus, • - Earth, • - Mars. • • four gas-giantplanets: • - Jupiter, • - Saturn, • - Uranus, • - Neptune. less ambiguous ILIKS 2011 - Hierachical structures
False enumerative structures ? • This overconsumptionismainlyresponsible for the growingresistance of bacteria to antibiotics : • The more a country consumes antibiotics the more resistant the bacteriabecome : in France, Staphylococcus aureus isresistant to méthicilline in 57% of the cases, though the observedfrequencyisonly 1% in Denmark and 9% in Germany. • and to eachnoticeable and lasting decrease of antibioticconsumption corresponds a decrease of thisresistancephenomenon. justify(non-volitionnal-cause(A,B), sequence (motivation(contrast(D,E),C), non-volitionnal-cause(F,G)) SE(Primer([A,B]),enum(item([C,E,D]),item([F,G])) [This overconsumption A][is mainly responsible for the growing resistance of bacteria to antibiotics B]. [The more a country consumes antibiotics the more resistant the bacteria become C]:[In France, Staphylococcus aureus is resistant to methicillin in 57% of the cases D], [though the observed frequency is only 1% in Denmark and 9% in Germany E].[and to each noticeable and lasting decrease of antibiotic consumption F][corresponds a decrease of this resistance phenomenon. G] From "The Sunday Times" wikipedia page ILIKS 2011 - Hierachical structures
Enumerative Structures (ES) • PRIMER {ITEM} CONCLUSION • Horizontal vs. Vertical • Syntagmatic vs. Paradigmatic ILIKS 2011 - Hierachical structures
Syntagmatic vs. Paradigmatic ES • Character shoes have : • a one to three inch heel • which is usually made of leather there are dependencies between items versus • Men's shoes can also be decorated in various ways: • Plain-toes: have a sleek appearance and no extra decorations on the vamp. • Cap-toes: has an extra layer of leather that "caps" the toe. This is possibly the most popular decoration. heads of items are syntactically equivalent ILIKS 2011 - Hierachical structures
Paradigmatic ES > Hierarchical Structure • A shoe is mainly composed of : • the sole, which protects the bottom of the feet, more or less raised on the back of the heel and • the vamp, upper part that wraps the foot ~~~ ~~~ ~~~~ Part-of ILIKS 2011 - Hierachical structures
Typology of primers The primer isincomplete (proposition whichissyntacticallyincomplete and for which the missing components are given by the items) • Women'sdressshoes : • Pumps • SlingBuncks • Loafers • Mules • Ballet flats • Sandals The primer is a noun phrase : root of the tree > noun phrase relation > is-a • The sole of a sandal can be made of : • rubber, • leather, • wood, • tatami, or • rope The primer is only composed of a noun phrase and of a verb : root of the tree > noun phrase relation > meaning of the verb ILIKS 2011 - Hierachical structures
Typology of primers The primer iscomplete (proposition whichissyntacticallycomplete) • Someshoes are exclusivelyworn by • women : • pumps • Stilettoheels • Ballet flats root of the tree > subject or object ??? relation > is-a • Snowshoes today are divided into three types: • aerobic/running (small and light; not intended for backcountry use); • recreational (a bit larger; meant for use in gentle-to moderate walks of 3–5 miles (4.8–8.0 km)); and • mountaineering (the largest, meant for serious hill-climbing, long-distance trips and off-trail use). The primer contains a numeral or a linguistic clue such as "categories", "types", "following", etc. root of the tree > term which cooccurs with this marker relation > is-a ILIKS 2011 - Hierachical structures
Application Enrichment of the OntoTopoontology (ANR-07-MDCO-005, http://www.lri.fr/geonto) Domain : map data description – geography Resources: ILIKS 2011 - Hierachical structures
Application Source Ontology WIKIPEDIA 728 concepts Module for annotating 402 pages (183 Ì Enumerative Structures) 383 Parallel ES ILIKS 2011 - Hierachical structures
Application Source Ontologie WIKIPEDIA 728 concepts Module for annotating 402 pages (183 Ì SE) 383 PES Module for extracting hierarchical structures ~400 new concepts ~300 new instances ~300 Hierarchical Structures ILIKS 2011 - Hierachical structures
Conclusion • Results: • PES are increasinglyfound in electronicdocuments • Additional tool based on the layout are possible • Translation process by successive annotations exists, thisprocessdepends on the type of the primer • It candramaticallyimprovean ontology • Perspectives: • Combine structure to the texique and syntaxe for ontologylearning • Tacklemore complexgrammatical constructions and spelling variations • Improveontologyenrichmentwithhierarchical structures ILIKS 2011 - Hierachical structures