140 likes | 248 Views
Darja Fišer, Senja Pollak, Špela Vintar University of Ljubljana, Dept. of Translation Studies {darja.fiser, spela.vintar}@guest.arnes.si, senja.pollak@ff.uni-lj.si. Learning to Mine Definitions from Slovene Structured and Unstructured Knowledge - Rich Resources. Aim.
E N D
Darja Fišer, Senja Pollak, Špela Vintar University of Ljubljana, Dept. of Translation Studies {darja.fiser, spela.vintar}@guest.arnes.si, senja.pollak@ff.uni-lj.si Learning to MineDefinitionsfromSloveneStructuredandUnstructuredKnowledge-RichResources
Aim • Extractdefinitionsofspecialisedconceptsfromtexts (journals, textbooksetc.). • Use Wikipedia to learnrulesthathelpdistinguishbetweenproperdefinitionsandnon-definitions. • Extractcandidatesentencesfromtextsusing 3 approaches: • patterns (A cell is thesmallestlivingunit in anorganism) • automatic term recognition • wordnet • Applyrules to selectgooddefinitionsanddiscardnon-definitions LREC2010 Malta
LearningrulesfromWikipedia title definition non-definition LREC2010 Malta
LearningrulesfromWikipedia • Slovene Wikipedia (December 2009): 162,500 articles • only well-formed pages retained • morphosyntactic annotation and lemmatization with ToTaLe (Erjavec et al. 2005) • structural parsing: 19,964instances • building a classification model in Weka (Witten and Frank 2005) • features: most frequent PoS and lemmata LREC2010 Malta
Learningrules - Results • best: J48 decision tree classifier • experimenting with full and merged PoS, absolute frequency (AF) and binary values • 10-fold cross-validation LREC2010 Malta
Extractingdefinitionsfromtexts:Resources • “unstructuredtexts”: subsetoftheFidaPluscorpus (http://www.fidaplus.net) • knowledge-rich: textbooks, popularsciencevolumes (e.g. “Allaboutmushrooms”) • variousdomains: astronomy, physics, geography, botany ... • sloWNet – Slovenewordnet(Fišer 2007, http://lojze.lugos.si/~darja/slownet.html) • Automatic term recognitionsystemforSlovene(Vintar 2004, http://lojze.lugos.si/cgitest/extract.cgi) LREC2010 Malta
Extractingdefinitionsfromtexts: 1. Usingwordnethyperonymy • Thesentence is a definitioncandidateif: • the sentence starts with a sloWNet literal andcontains at least one more literal from the samehyperonymy chain (i.e. its hyponym or itshypernym) <term id=ENG20-13313485-n>Diabetes</term> je <termid=ENG20-13268088-n>bolezen</term>, ki je posledica pomanjkanja inzulina, hormona, ki skrbi, da celice v telesu dobivajo glukozo (sladkor). [Diabetes is a disease resulting from insulin deficiency,thehormone providing glucose (sugar) for body cells.] LREC2010 Malta
Extractingdefinitionsfromtexts:2. Usingautomatic term recognition • Thesentence is a definitioncandidateif: • the sentence contains at least twodomain-specific termsandthefirst term is in the nominative case <term score=“80.45“>Ekvator</term> je najdaljši vzporednik, ki deli Zemljo na severno in <term score=”43.21”>južno poloblo</term>. [The Equator is the largest circle of latitude dividing the Earth into the Northern and the SouthernHemispheres.] LREC2010 Malta
Extractingdefinitionsfromtexts:3. Usingpatterns • Thesentence is a definitioncandidateif: • the sentence contains a definingmorphosyntacticpattern(NP[nominative] is_a NP [nominative]). NP is_a NPCelica je strukturna in funkcionalna enota vseh živih organizmov. [A cell is a structural and functional unit of all livingorganisms.] LREC2010 Malta
Results manual evaluation of all definition candidates sloWNet: best precision, ATR: best recall what is a definition?? LREC2010 Malta
Classificationaccuracy For definitions only: LREC2010 Malta
Which is the “best” definition? The Equator is an imaginary line on the Earth's surface equidistant from the North Pole and South Pole that divides the Earth into a Northern Hemisphere and a Southern Hemisphere. An equator is the intersection of a sphere's surface with the plane perpendicular to the sphere's axis of rotation and containing the sphere's center of mass. The longest of the five main circles of latitude on Earth (the others being the Arctic and Antarctic Circles and the Tropics of Cancer and Capricorn) is called the Equator. LREC2010 Malta
Definitionsdepend on context... andmayspanoverseveralsentences Head lice are parasites that live in the hair and scalp of humans. HEAD LICE, also called Pediculus Humanus Capitis are small blood-sucking, wingless insects found on the human scalp. They are approximately the size of a sesame seed and cannot jump or fly. They are six-legged creatures with claws, which help them cling to and crawl through human hair. Head lice are an emerging social problem, not only in economically poor countries but also in practically all other societies. LREC2010 Malta
Conclusions & futurework • Wikipediacanhelp us learnthepropertiesofdefinitions, • Knowledge-richtexts are a goodsourceofdefinitions, • A semantically-richapproach (usingwordnetand ATR) yieldsmanydefinitionsanddefiningcontexts. • Defining a definition is hard... • Encyclopaedicdefinitionsdifferfromthosefound in runningtexts, • Futurework: • useotherfeatures in learning, • useactivelearning, • redefinedefinitionsandpossiblyre-evaluatedefinitioncandidates LREC2010 Malta