410 likes | 538 Views
Lexical knowledge schemes for modeling words and expressions in communication. Computational Lexicology & Terminology Lab Wauter Bosma Isa Maks Roxane Segers Hennie van der Vliet Piek Vossen LCC-meeting, October, 9 th , 2008, VU University Amsterdam. Overview. Genre as a knowledge scheme
E N D
Lexical knowledge schemes for modeling words and expressions in communication Computational Lexicology & Terminology LabWauter BosmaIsa MaksRoxane SegersHennie van der VlietPiek Vossen LCC-meeting, October, 9th, 2008, VU University Amsterdam
Overview • Genre as a knowledge scheme • What do we do at CLTL? • How does it relate to genre? • Projects at CLTL • Discussion LCC meeting, October 9th, 2008, VU University Amsterdam
A view on genre • Genre is an abstract knowledge scheme that natural language speakers can apply to effectively structure communication. • How and where is such a scheme stored? • How is this knowledge activated and applied in a communicative setting? • How can we benefit from these insights in computerized information and communication systems? LCC meeting, October 9th, 2008, VU University Amsterdam
Social behaviour Text: structure & content Communication targets Intentions strategy genre Participants medium Attitudes form language grammar entities World Knowledge lexicon relations Ontology objects relations LCC meeting, October 9th, 2008, VU University Amsterdam
Focus of Computational Lexicology and Terminology Lab (CLTL) • Lexicon = model of abstract knowledge to efficiently process and produce natural language in communicative settings • Symbolic & abstract representation of forms related to concepts: • forms are variants that can refer to more-or-less the same semantic content: • shootV – shootingN – agressionN- fightN - conflictN – warN – WOIIName • payV – exchangeV - buyV – sellV – merchandiseN - tradeN - businessN • Also encode pragmatic aspects of use • Sentiment, subjectivity & attitude • Perspective • Domain restrictions LCC meeting, October 9th, 2008, VU University Amsterdam
Focus of Computational Lexicology and Terminology Lab (CLTL) • Broad notion of knowledge: • words & expressions (what is a word, what is a concept?) • phrases, sentences and text (incorporating grammar) • genres • Abstract symbolic representations related to statistical expectation patterns • Tagged corpus represents an 'experience' of language use: • "X drinks beer", "Y drinks wine", "Z drinks milk" • Lexicon is the highest abstraction of these experiences that gives the most effective prediction of how words and expressions behave: • "XYZ drink beverages" • Corpus-based lexicon or corpus data represented as a lexicon LCC meeting, October 9th, 2008, VU University Amsterdam
Focus of Computational Lexicology and Terminology Lab (CLTL) • Validation of models and databases with lexical knowledge: • Can we define types of structures (lexical and compositional expressions) that correctly predict their behavior in language use? -> pluriform-object-count-noun (police), object-count-noun (police officer), group-object-count-noun (eikenbos (oak forest)), mass-object-uncount-noun (bos (forest)) • Can we build a comprehensive database using these types? • Use the database in corpus research and analysis: • import corpus data into the lexical database • apply the database to textual corpora in computer applications: • Automatic tagging of corpora with features • Automatically mine textual data using the lexicon as a background knowledge resource, e.g. to find facts of causal relations for environmental phenomena LCC meeting, October 9th, 2008, VU University Amsterdam
Ontology • concepts instead of words • identity criteria • language neutral • domain and perspective neutral • no genre dependency • logically valid • for inferencing • Lexical database • generic list of words and terms • abstracts from various text corpora • differentiation for different domains • and genres • most generic representation • in a language community Map Validate Integrate • Term database: • generic list of terms • derived from text corpus • patterns and features that • are dominant in domain and genre • Text corpus with empirical data • linear text • every word occurrence is unique • domain and genre specific Derive LCC meeting, October 9th, 2008, VU University Amsterdam
Projects at CLTC • Cornetto (Stevin project: STE05039) • Kyoto (FP7 ICT Work Programme 2007 under Challenge 4 - Digital libraries and Content, project ICT-211423) • Camera projects: • From sentiments and opinions in text to positions of political parties • The semantics of history • A term bank for the Belastingdienst (Steunpunt Terminologie) • DutchSemCor (NWO investeringssubsidie) LCC meeting, October 9th, 2008, VU University Amsterdam
Cornetto • COmbinatorial Relational NEtwork voor Taal TOepassingen • Goal: to develop a lexical semantic database for Dutch: • 90K Entries: generic and central part of the language • Rich horizontal and vertical semantic relations • Combinatoric information • Ontological information LCC meeting, October 9th, 2008, VU University Amsterdam
Lexical Unit & Synsets • Lexical Unit = form-meaning relation, such that: • form = abstract representation of certain realizations; • part-of-speech is the same; • meaning is the same, where meaning is defined by a reference to a unique Synset; • Synset = Set of synonyms (LUs) that refer to the same entities in most contexts. • Defined by lexical semantic relations; • Defined by reference to ontology Terms or logical expressions involving Terms from the ontology; LCC meeting, October 9th, 2008, VU University Amsterdam
Data Organization Lexical Unit Internal relations Correspond to word-meaning pair Synonyms form morphology syntax semantics pragmatics usage examples Synset Model meaning relations Collection of Terms and Axioms Princeton Wordnet Czech Wordnet German Wordnet SUMO MILO Korean Wordnet Wordnet Domains Spanish Wordnet Arabic Wordnet French Wordnet LCC meeting, October 9th, 2008, VU University Amsterdam
Data overview LCC meeting, October 9th, 2008, VU University Amsterdam
Combinatorics Combinatorics Combinatorics Combinatorics in een band spelen (to play in a band) een goede/sterke band (a good strong bond) de band starten (to start a tape) de band oppompen (to pump air in a tire) een band oprichten (to start a band) een band plakken (to fix a whole in a tire) de banden verbreken (to break all bonds) op de band opnemen (to record on a tape) een band hebben met iemand (to have a bond with s.o.) de band afspelen (to play from a tape) een lekke band (flat tire) de band speelt (the band plays) de band springt (the tire explodes) artiest (artist) groep (groep) voorwerp (object) toestand (state) middel (device) muziek (music) informatiedrager (data carrier) gezelschap (group of people) relatie (reltion) muzikant (musician) lezen (read) schrijven (write) ring (ring) verhouding (relation) muziekgezelschap (music group) geluidsdrager (audio carrier) musiceren (to make music) band#1 (band) band#5 (bond) band#2 (tire) band#3/geluidsband (audio tape) zwemband (tire for swimming) cassettebandje (audio cassette) fietsband (bike tire) autoband (car tire) moederband (mother bond) familieband (family bond) jazzband (jazz band) popgroep (pop group) binnenband (inner tire) buitenband (outer tire) bloedband (blood bond) LCC meeting, October 9th, 2008, VU University Amsterdam
Integrating the ontology: Sumo terms and axioms LCC meeting, October 9th, 2008, VU University Amsterdam
Lexicon versus Ontology LABELS for ROLES: {bluswater} {theewater} {koffiewater} Ontology Abstract Physical Organism Element Process Possession Transaction Dog H20 CO2 {buy} PoodleDog subj receiver obj giver ind obj goods LABELS for ROLES: {watchdog}EN, {waakhond}NL, {banken}JP ((instance x Canine) (role x GuardingProcess)) NAMES for TYPES: {poodle}EN {poedel}NL {pudoru}JP ((instance x Poodle) {sell} subj obj ind obj
Kyoto • Yielding Ontologies for Transition-Based Organization • Funded: • 7th Framework Program-ICT of the European Union: Intelligent Content and Semantics • Goal: • Platform for knowledge sharing across languages and cultures • Enables knowledge transition and information search across different target groups, transgressing linguistic, cultural and geographic boundaries. • Open text mining and deep semantic search • Wiki environment that allows people in the field to maintain their knowledge and agree on meaning without knowledge engineering skills • URL: http://www.kyoto-project.eu/ • Duration: March 2008 – March 2011 • Effort: 364 person months of work LCC meeting, October 9th, 2008, VU University Amsterdam
KYOTO (ICT-211423) Overview • Languages: • English, Dutch, Italian, Spanish, Basque, Chinese, Japanese • Domain: • Environmental domain, BUT usable in any domain • Global: • Both European and non-European languages • Available: • Free: as open source system and data (GPL) • Future perspective: • Content standardization that supports world wide communication • Global Wordnet Grid LCC meeting, October 9th, 2008, VU University Amsterdam
Citizens Governors Companies Environmental organizations Environmental organizations Global Wordnet Grid Domain Wikyoto Capture Universal Ontology Wordnets Concept Mining Docs Dialogue Top Abstract Physical Fact Mining Search URLs Process Substance Experts Middle water CO2 Index Kybots Tybots Images water pollution CO2 emission Sudden increase of CO2 emissions in 2008 in Europe Domain LCC meeting, October 9th, 2008, VU University Amsterdam
User perspective • Ecosystem services • nature as a resource: food, transport, recreation, medicine, material • nature for waste absorption • economic dependency • state of nature • footprint • poverty LCC meeting, October 9th, 2008, VU University Amsterdam
qualifies qualifies Lexicon versus Ontology • Ecosystem services • Nature as a resource • Nature for waste absorption • State of nature • Threats to nature Ontology Abstract Physical Artifacts Organism Element Process Spider H20 CO2 Possession Transaction alien invasive species species migration green house gas ecosystem-based drinking water production green roof branding rural products sustainable products LCC meeting, October 9th, 2008, VU University Amsterdam
System components • Wikyoto = wiki environment for a social group: • to model the terms and concepts of a domain and agree on their meaning, within group, across languages and cultures • to define the types of knowledge and facts of interest • Tybots = Term extraction robots, extract term data from text corpus • Kybots = Knowledge yielding robots, extract facts from a text corpus • Linguistic processors: • tokenizers, segmentizers, taggers, grammars • named entity recognition • word sense disambiguation • generate a layered text annotation in Kyoto Annotation Format (KAF) LCC meeting, October 9th, 2008, VU University Amsterdam
Capture Server Document Base Linear KAF Concept User Tybot server (TermExtraction) Semantic Annotation Document Base Linear KAF ExtractedTerms Generic K-TMF KybotEditor Fact User Kybot Profiles Kybot Server (FactExtraction) TermEditor (Wikyoto) Document Base Linear Generic KAF Domain Wordnet K-LMF Domain Ontology OWL_DL LCC meeting, October 9th, 2008, VU University Amsterdam
Synthesize Ontologize Axiomatize Conceptual modeling Source Documents [[the emission]NP [of greenhouse gases]PP [in agricultural areas]PP] NP TYBOT Concept Miners Linguistic Processors Morpho-syntactic analysis English Wordnet Ontology Term hierarchy location:3 substance:1 naturalprocess:1 of Abstract Physical region:3 area emission gas emission:3 Process Substance geographical area:1 area:1 gas:1 CO2 emission:2 greenhouse gas agricultural area Chemical Reaction H20 CO2 GreenhouseGas greenhouse gas:1 rural area:1 in CO2 GlobalWarming CO2Emission farmland:2 (instance s1 Substance) (instance e1 Warming) (katalyist s1 e1) WaterPollution LCC meeting, October 9th, 2008, VU University Amsterdam
Source Documents Morpho-syntactic analysis [[the emission]NP [of greenhouse gases]PP [in agricultural areas]PP] NP Fact mining by Kybots Linguistic Processors Ontology Logical Expressions Wordnets & Linguistic Expressions Generic Abstract Physical Fact analysis Patient [[the emission]NP ] Process: e1 [of greenhouse gases]PP Patient: s2 [in agricultural areas]PP] Location: a3 Substance Process Chemical Reaction H2O CO2 Domain Patient CO2 emission water pollution LCC meeting, October 9th, 2008, VU University Amsterdam
Wordnets in LMF Ontologies in OWL Facts in RDF FactAF KAF G-WN G-KON SUMO DOLCE GEO plugin plugin Kybots DE-WN DE-KON WIKIPEDIA Kyoto Server FRAMENET KAF Tybots DE-TN Simplified Term Fragment Simplified Ontology Fragment population Group ?Population marine species terrestrial species Interview Interview Do populations consist of marine species? Smart Kytext Are terrestrial species a type of populations? .... populations such as terrestrial and marine species ..... .... populations declined .....terrestrial and marine species.. in forests .....declined pdf Hidden Shown A.. ... decline ... population ... ..Z Do populations always consist of marine species? Are terrestrial species never marine species? LCC meeting, October 9th, 2008, VU University Amsterdam
Ontology Lexical database: wordnet Abstract Physical substance:1 natural process:1 Process Substance Ontologize C02 emission:2 emission:3 gas:1 ChemicalReaction H20 Greenhouse Gas CO2 CO2 Emission Global Warming greenhouse gas:1 Synthesize Text mining by Kybots Term database • gas • green house gas -> gas • increase(AG) • in 2003 (TIME) • CO2 -> green house gas • emission (PA) • -in European countries (LO) Text corpus Sudden increase of green house gases in 2003........ C02 emission in European countries....Green house gases such as C02, .... Concept Mining by Tybots Axiomatize Maximal abstraction& integrity Language neutral integrity (instance s1 Substance) (instance e1 Warming) (katalyist s1 e1) Generic text based Linear text
From sentiments and opinions in text to positions of political parties • Most language use does not express facts but personal opinions and positions with respect to facts or issues, often disguised for some communicative or manipulative goal. • CAMERA project involving 2 AIOs from FdL and 1 AIO from Political Sciences • Combines contemporary theories and methods in linguistics and political science to develop an automated research tool for rich text-mining: • Complexity of language use, the linguistic modeling of subjectivity and the representation of this knowledge in a lexicon. • Complex dimensionality of competition between political parties. • Mining tool for language-meaning research can be applied to enhance the Kieskompas (Electoral Compass). LCC meeting, October 9th, 2008, VU University Amsterdam
aio-1 Modeling Lexical database Lexical Analysis Derivation Co-occurrence Lexical acquisition aio-2 Corpus Linguistics Linguistic rules Quantitative Text Analyis Search Concordance Manual Coding & Tagging • Omstreden democratie: • Jan Kleinnijenhuis • Wouter van Atteveldt Political Text Corpus Automated Tagging & Analysis Morpho-syntactic Parsers Political Database system integrator-4 Manual Coding Search Quantitative Data Analysis Political Analysis aio-3 Interpretation rules
AIO-1: Lexical model and acquisition for sentiment and opinion analysis in Dutch text • Words & expressions in political text • Model sentiment, subjectivity, lexical framing and attitudinal implications • Build a lexicon encoding these layers • Validate the lexicon in the mining application applied to the text corpus LCC meeting, October 9th, 2008, VU University Amsterdam
Levels of subjectivity • sentiment orientation, e.g. • small (neutral), splendid (positive), dull (negative) • funeral (negative), birthday party (positive), meeting (neutral) • explicit attitudinal and deontic implications • hate, love, favour, desire, want • impossible, possible, can, cannot • demand, beg, hope, wish • implicit attitudinal and deontic implications • neutral: describe, cite, quote • subjective: tell my story, shout, cry out, suggest LCC meeting, October 9th, 2008, VU University Amsterdam
Some concepts of saying The reporter expresses attitude towards the subject (is not aware) nazeggen:1, herhalen:4, echoën:2meesmuilen:1herkauwen:2toesnauwen:1, aanblaffen:2, sissen:2, toebijten:1, toeblaffen:1 toesmijten:2,toevoegen:4uitputten:3 verzuchten:1 pretenderen:1, beweren:1 Subject of speech act has attitude towards (is aware): afzeggen:1, cancellen:1ontkennen:1, miskennen:1, ontveinzen:1toewensen:1, wensen:2verbieden:1aanzetten:12, beklemtonen:2, hameren:2, tamboereren:2 onderstrepen:2, onderlijnen:1, accentueren:1toezeggen:1, beloven:1uitlaten:5, beoordelen:1distantiëren:1erkennen:2, toegeven:1 opmerken:2, aantekenen:4 LCC meeting, October 9th, 2008, VU University Amsterdam
Synsets or lexical units • {brilliant:3, glorious:4, magnificent:1, splendid:2} • {bus:4, jalopy:1, heap:3} • has_hyperonym: {car:1, auto:1, automobile:1, machine:4, motorcar:1} • {fiets:1, brik:7, kar:3, karretje:2, rijwiel:1, velo:1} LCC meeting, October 9th, 2008, VU University Amsterdam
The semantics of history • Camera project involving 1 AIO from FdL and 1 AIO from FEW (Exact Science) • Goal: an ontology and lexicon for a historical multimedia archive of the Rijksmuseum. • Applied to an innovative information system for accessing the historical archive. LCC meeting, October 9th, 2008, VU University Amsterdam
The semantics of history = semantics of change • Represent different realities: • related through causal changes over time • representing different views or perspectives on the same reality, e.g. form a different historical angle or from different geographical or social parties. • Changes are typed as events LCC meeting, October 9th, 2008, VU University Amsterdam
Events as key notions • Historical events: • events considered from a distance in time and abstraction of detail. • referenced by names (WOII, de Val van Srebrenica), nouns (war) or nominalizations (the violation of human rights) • News events: • Reports on (the same) reality but more in the active verbal form: US soldiers shoot Iraqi citizens. • Close to the actual event • lacking a historical abstraction and filtering. • Both news and historic imply subjectivity and perspective on these events but probably make different selections and use different genres to convey this information. • News becomes history over time, and we therefore expect a smooth transition in the use of language to refer to the same events, adding more and more historical perspective. LCC meeting, October 9th, 2008, VU University Amsterdam
“Val van Srebrenica” in Wikipedia • Headings: • 1992 ethnic cleansing campaign • The conflict in eastern Bosnia • Struggle for Srebrenica • Text: • A fierce struggle for territorial control then ensued among the three major groups in Bosnia: Bosniak (commonly known as 'Bosnian Muslims'), Serb and Croat. In the eastern part of Bosnia, close to Serbia, conflict was particularly fierce between Serbs and Bosniaks • Serb military and paramilitary forces from the area and neighboring parts of eastern Bosnia and Serbiagained control of Srebrenica for several weeks in early 1992, killing and expelling Bosniak civilians. In May 1992, Bosnian government forces under the leadership of Naser Orić recaptured the town • thus proceeded with the ethnic cleansing of Bosniaks from Bosniak ethnic territories in Eastern Bosnia and Central Podrinje LCC meeting, October 9th, 2008, VU University Amsterdam
Letter from the Dutch minister of defense • De afgelopen zes maanden werd de uitvoering van deze taken aanzienlijk bemoeilijkt door de Bosnisch-Servische weigering de enclave voldoende te laten bevoorraden. Door een gebrek aan brandstof moesten patrouilles te voet worden uitgevoerd. Ook blokkeerden de Bosnische Serviers sinds mei jl. de rotatie van het personeel van Dutchbat, waardoor de bezetting werd teruggebracht van 630 naar 430 blauwhelmen. De vijandelijkheden namen geleidelijk toe, waardoor op 3 juni jl. een observatiepost in het zuidoostelijke deel van de enclave moest worden opgegeven • Historical terms: blokkade, val, opgave, overgave LCC meeting, October 9th, 2008, VU University Amsterdam
Event Ont. Historic Ont. Alignment Data model Structured Ontology Data Conversion Ontolization Terms & Relations Semi Structured Data Lexical mapping Term Extraction Lexicalization Free Text Lexicon Smart Indexing Objects Events Locations People Smart Retrieval Validation conflict struggle ethnic cleansing …. killing expelling gain control
AIO at FdL • Lexical framing of events in news reporting and historical descriptions. • Use historical thesaurus to group all the words and expressions in a lexicon relative to the same events • Differentiate implications of the lexical variation: packaging of events • Classification of news LCC meeting, October 9th, 2008, VU University Amsterdam