170 likes | 301 Views
Semiautomatic Extension of CoreNet using a Bootstrapping Mechanism on Corpus-based Co-occurrences. Chris Biemann (University of Leipzig) Sa-Im Shin (KORTERM, KAIST) Key-Sun Choi (KORTERM, KAIST) Friday, 27th of August Coling 2004, Genève. Outline.
E N D
Semiautomatic Extension of CoreNet using a Bootstrapping Mechanism on Corpus-based Co-occurrences Chris Biemann (University of Leipzig) Sa-Im Shin (KORTERM, KAIST) Key-Sun Choi (KORTERM, KAIST) Friday, 27th of August Coling 2004, Genève
Outline • The necessity of the extension of lexical-semantic word nets • CoreNet – a WordNet for Korean, Japanese and Chinese • Co-occurrence statistics on large corpora • The Pendulum Algorithm • Results and Evaluation
Why extending WordNet? • Manual Construction is done by experts- time-consuming- expensive • General-purpose WordNet often does not fit specialized domain • Existing ressources have coverage problems
Bootstrapping of lexical items For learning by bootstrapping, two things are needed: A start set of some known items with classes and a rule set that states, how more information can be obtained using known items. Generic bootstrapping algorithm: Knowledge=0 New=Start_set While New>0 Knowledge+=New New=0 New=find new items using Knowledge and Rule_set known items # items Phase of growth Phase of exhaustion new items iteration
Benefits and Backdraws of Bootstrapping Pro: • Only small start sets (seeds) are needed, those can be rapidly prepared • Process needs no further supervision (weakly supervized learning) Cons: • Danger of Error Propagation • When to stop is unclear
CoreNet – ontology for Korean, Japanese and Chinese Size of Korean part: 2,954 concepts Features • Rather large groups of words per concept as opposed to fine-grained WordNet structure • Same concept hierarchy is used for all word classes
KAIST Corpus and Co-occurrences Size of KAIST corpus (unannotated version): • 38 Million tokens • 2.3 Million sentences • 3.8 Million types Co-occurrence Statistics (sentence based): • occurrence of two or more words within a well-defined unit of information (sentence) • Significant Co-occurrences reflect semantic relations between words • Significance Measure (log-likelihood): k= number of sentences with a and b
reference word TOP 25 co-occurrences ordered by significance 연필 (pencil) 지우개 (eraser) (25), 만년필 (fountain pen) (22), 국어 (Korean) (14), 볼펜 (ball pen) (14), 쥐는 (grasping) (14), 한자루도 (a pen) (14), 한쪼가리 (a part of) (14), 문구세트 (stationary set) (13), 문화연필은 (Mun-Hwa pencil) (13), 자루 (the measure of numbering pencils) (11), 필통 (pencil box) (11), 한토막 (a part) (11), 공책 (notebook) (10), 기념품을 (souvenir) (9), 노트 (notebook) (9), 시간 (time) (9), 그린 (drawing) (8), 사진 (picture) (8), 한글을 (Korean) (8), 가방 (bag) (7), 쓰던 (writing) (7), 쓰면 (writing) (7), 아이들은 (children) (7), 종이 (paper) (7), 줄은(decreasing) (7) [..] jurisdiction over (305), court (188), under (183), courts (145), federal (121), Court (95), case (73), court's (68), state (45), within (43), Appeals (38), ruled (38), Circuit (36), SEC (36), law (36), Commission (34), GSBCA (34), appeals (34), House (33), committees (33), Judge (31), Act (29), CFTC (29), Committee (29), subcommittee (28) [...] Co-occurrence set examples Co-occurrence sets alone exhibit too many different relations to the reference word for the use of CoreNet extension
Pendulum-Algorithm: Bootstrapping with verification LastLearned=StartSet; Knowledge=StartSet; NewLearned=0; while (LastLearned>0) { for all i in LastLearned { Candidates=getCooccurrences(i); for all c in Candidates { VerifySet=getCooccurrences(c); if |VerifySet Knowledge| >threshhold { NewLearned+=c; Knowledge+=c; } } } LastLearned=NewLearned; NewLearned=0; } Search step Verification step
Pendulum Example Seed: 관자놀이 (temple), 눈 (eye), 뺨 (cheek), 시(poem), 쌍꺼풀 (double eyelid), 부위마다 (part of face), 아랫입술 (lower lip), 오관 (the five sensory organs), 입 (mouth), 코 (nose), 혀 (tongue) Search with관자놀이 (temple): …, 복사뼈 (malleolus bone),… Verify for 복사뼈 (malleolus bone):부위마다 (part of face), 안면부 (part of the face), 인당 (ligament), 인중 (philtrum), 경골 (tibial), 관자놀이 (temple), 경혈을 (spots on the body suitable for acupuncture), 손끝으로 (with fingertip), 용천 (spring), 청명 (serenity), 4차례씩 (per 4 times), 두드릴 (tabbing), 발바닥 (the sole of the foot), 코와 (with nose), 등 (back), 오리 (duck), 영향 (influence), 상부 (high part), 위쪽 (front part), 신체 (body), 예방하는 (preparing), 중간 (middle), 입 (mouth), 질병을 (disease), 코 (nose), 한가운데 (center), 가볍게 (lightly), 곳 (place), 누르고 (pressing), 지정된 (appointed).
Evaluation • Selection of concepts performed by a non-Korean speaker • Evaluation performed manually, only new words counted • Heuristics for avoiding result set infection- iteratively lower threshold for verification from 8 downto 3 until the result set is too large- take lowest threshold for result set with reasonable size (not exceeding start set) • Typical run needed 3-7 iterations to converge
CoreNet ID Name of Concept Size # new # ok precision 50 human good/bad 119 36 5 13.89% 111 human relation 274 3 2 66.67% 113 partner / co-worker 123 23 8 34.78% 114 partner / member 71 5 3 60.00% 181 human ability 213 7 2 28.57% 430 store 128 12 11 91.67% 471 land, area 260 10 2 20.00% 548 insect, bug 75 43 6 13.95% 552 part of animal 736 10 6 60.00% 553 head 139 7 4 57.14% 577 forehead 72 4 2 50.00% 590 legs and arms 86 7 3 42.86% 672 plant (vegetation) 461 30 15 50.00% 817 cloths 246 3439 34 231 18 87 52.94% 37.67% Sum: Results Not enough for automatic extension, but a good source for candidates
Problems... ...and possible solutions • „Coverage is low“- increase corpus size for relevant domains- make use of other features, e.g. patterns • „Precision is not satisfactionary“- obtain multiple concepts simultaneously- meta-level bootstrapping- make use of other features, e.g. POS tags for word class information This work gives a baseline of what is reachable without employing language-dependent features
Summary Language-independent method for semi-automatic extension of lexical-semantic word nets using • Co-occurrence data on basis of a plain text corpus • Pendulum Algorithm for keeping precision high in Bootstrapping
Questions? THANK YOU!
... ... ... ... ... Local Ontology Engineering • Bottom-up approach: Given an existing (small) ontology, how can it be (semi)automatically extended? • Top nodes of ontologies are scarcely lexicalized focus rather on leaves than on branch and trunk nodes • Local view: extension does not take global structure into account but operates within sub-trees Focus on localareas
Using word class information – German Example Algorithm:Word set WAs long as new words w are found candidates C= co-occurrences of w of different word class for all c in C: if co-occurrence set of c contains enough words of W with different class of c: add c to W 19.23 Hieb- und Stichwaffe (DORNSEIFF 2003) Waffe • Stichwaffe · Bajonett · Damaszener · Degen · Dolch · Florett · Lanze · Säbel · Schwert · Sense · Speer · Spieß • Messer · Fahrtenmesser · Jagdmesser · Klinge · Stilett • Hiebwaffe · Baseballschläger · Faustkeil · Keule · Knüppel · Morgenstern · Prügel · Schlagring · Schlagstock · Stock · Totschläger • Bumerang · Hellebarde · Streitaxt · Tomahawk • Armatur · Bewaffnung · Rüstung · Wehr • Arsenal · Rüstkammer · Waffenkammer · Waffenlager · Zeughaus • bewaffnen · rüsten · wappnen • einprügeln · einschlagen · einstechen · erschlagen · erstechen · prügeln · schlagen · stechen · verprügeln · zuschlagen · zustechen New for 19.23 Abrißbirne · Axt · Drahtesel · Eisenstange · Fäuste · Golfschläger · Hüften · Lüfte · Peitsche · Pendel · Racket · Sattel · Schläger · Skins · Takt · Tanzbein · Unterleib · Zepter · einschlug · ersticht · fechten · ficht · kreuzen · rammt · schwang · schwangen · schwingen · schwingt · traktiert · zückt · zückte