290 likes | 459 Views
Unsupervised Natural Language Processing using Graph Models The Structure Discovery Paradigm. Chris Biemann University of Leipzig, Germany Doctoral Consortium at HLT-NAACL 2007, Rochester, NY, USA April 22, 2007. Outline. Review of traditional approaches
E N D
Unsupervised Natural Language Processing using Graph ModelsThe Structure Discovery Paradigm Chris BiemannUniversity of Leipzig, Germany Doctoral Consortium at HLT-NAACL 2007, Rochester, NY, USA April 22, 2007
Outline Review of traditional approaches • Knowledge-intensive vs. knowledge-free • Degrees of Supervision • Computational Linguistics vs. statistical NLP A new approach • The Structure Discovery Paradigm Graph-based SD procedures • Graph models for language processing • Graph-based SD procedures • Results in task-based evaluation
Knowledge-Intensive vs. Knowledge-Free In traditional automated language processing, knowledge is involved in all cases where humans manually tell machines • How to process language by explicit knowledge • How a task should be solved by implicit knowledge Knowledge can be provided by the means of: • Dictionaries, e.g. thesaurus, WordNet, ontologies, … • (grammar) rules • Annotation
Degrees of Supervision Supervision is providing positive and negative training examples to Machine Learning algorithms, which use this as a basis for building a model that reproduces the classification on unseen data Degrees: • Fully supervised (Classification): Learning is only carried out on fully labeled training set • Semi-supervised: Unlabeled examples are also used for building a data model • Weakly-supervised (Bootstrapping): A small set of labeled examples is grown and classifications are used for re-training • Unsupervised (Clustering): No labeled examples are provided
Computational Linguistics and Statistical NLP CL: • Implementing linguistic theories with computers • Rule-based approaches • Rules found by introspection, not data-driven • Explicit knowledge • Goal: understanding language itself Statistical NLP: • Building systems that perform language processing tasks • Machine Learning approaches • Models are built by training on annotated dataset • Implicit knowledge • Goal: Build robust systems with high performance There is a continuum rather than a sharp cutting edge
Structure Discovery Paradigm SD: • Analyze raw data and identify regularities • Statistical methods, clustering • Knowledge-free, unsupervised • Structures: as many as can be discovered • Language-independent, domain-independent, encoding-independent • Goal: Discover structure in language data and mark it in the data
Example: Discovered Structures „ Increased interest rates lead to investments in banks .“ <sentence lang=12, subj=34.11> <chunk id=c25> <word POS=p3 m=0.0 s=s14>In=creas-ed</word> <MWU POS=p1 s=s33> <word POS=p1 m=5.1 s=s44>interest</word> <word POS=p1 m=2.12 s=s106>rate-s</word> </MWU> </chunk> <chunk id=c13> <MWU POS=p2> <word POS=p2 m=17.3 s=74>lead</word> <word POS=p117 m=11.98>to</word> </MWU> </chunk> <chunk id=c31> <word POS=p1 m=1.3 s=33>investment-s</word> <word POS=p118 m=11.36>in</word> <word POS=p1 m=1.12 s=33>bank-s</word> </chunk> <word> POS=298> . </word> </sentence> • Annotation on various levels • Similar labels denote similar properties as found by the SD algorithms • Similar structures in corpus are annotated in a similar way
Consequences of Working in SD • Only input allowed is raw text data • Machines are told how to algorithmically discover structure • Self-annotation process by marking regularities in the data • Structure Discovery process is iterated Text Data Find regularities by analysis SD algorithms SD algorithm SD algorithm SD algorithm Annotate data with regularities
Pros and Cons of Structure Discovery Advantages: • Cheap: only raw data needed • Alleviation of acquisition bottleneck • Language and domain independent • No data-resource mismatch (all resources leak) Disadvantages: • No control over self-annotation labels • Congruence to linguistic concepts not guaranteed • Much computing time needed
Building Blocks in SD Hierarchical levels of basic units in text data: • Letters • Words • Sentences • Documents These are assumed to be recognizable in the remainder. SD allows for • arbitrary numbers of intermediate levels • grouping of basic into complex units, but these have to be found by SD procedures.
Similarity and Homogeneity For determining which units share structure, a similarity measure for units is needed. Two kinds of features are possible: • Internal features: compare units based on the lower level units they contain • Context features: compare units based on other units of same or other level that surround them A clustering based on unit similarity yields sets of units that are homogeneous w.r.t. structure This is an abstraction process: Units are subsumed under the same label.
What is it good for? How do I know? • Many structural regularities can be thought of, some are interesting, some are not. • Structures discovered by SD algorithms will not necessarily match the concepts of linguists • Working in the SD paradigm means to over-generate structure acquisition methods and to check, whether these are helpful Methods for telling helpful from useless SD procedures: • „Look at my nice clusters“-approach: Examine data by hand. While good in the initial phase of testing, this is inconclusive: choice of clusters, coverage… • Task-based evaluation: Use the labels obtained as features in a Machine Learning scenario and measure the contribution of each label type. Involves supervision, is indirect
Graph models for SD procedures Motivation for graph representation • Graphs are an intuitive and natural way to encode language units as nodes and their similarities as edges - but also other representations are possible • Graph clustering can efficiently perform abstraction by grouping units into homogeneous sets with Chinese Whispers Some graphs on basic units • Word co-occurrence (neighbor/sentence), significance, higher orders • Word context similarity based on local context vectors • Sentence/document similarity on common words
Some graph-based SD procedures • Language Separation: • Cluster sentence-based significant word co-occurrence graph • Use word lists for language identification • Induced POS • Cluster local stop word context vector similarity graph • Cluster second order neighbor word co-occurrence graph • Train and apply trigram tagger • Word Sense Disambiguation • Cluster neighborhood of target word of sentence-based significant co-occurrence graph into sense clusters • Compare sense clusters with local context for disambiguation • Semantic classes • Cluster similarity graph of words and induced POS contexts • Use contexts for assigning semantic classes
“Look at my nice languages!” Cleaning CUCWeb Latin: In expeditionibus tessellata et sectilia pauimenta circumferebat. Britanniam petiuit spe margaritarum: earum amplitudinem conferebat et interdum sua manu exigebat .. Scripting: @echo @cd $(TLSFDIR);$(CC) $(RTLFLAGS) $(RTL_LWIPFLAGS) -c $(TLSFSRC) … @echo @cd $(TOOLSDIR);$(CC) $(RTLFLAGS) $(RTL_LWIPFLAGS) -c $(TOOLSSRC) .. Hungarian: A külügyminiszter a diplomáciai és konzuli képviseletek címjegyzékét és konzuli … Köztestületek, jogi személyiséggel és helyi jogalkotási jogkörrel. Esperanto: Por vidi ghin kun internacia kodigho kaj kun kelkaj bildoj kliku tie chi ) La Hispana.. Ne nur pro tio, ke ghi perdigis la vivon de kelk-centmil hispanoj, sed ankau pro ghia efiko.. Human Genome: 1 atgacgatga gtacaaacaa ctgcgagagc atgacctcgt acttcaccaa ctcgtacatg 61 ggggcggaca tgcatcatgg gcactacccg ggcaacgggg tcaccgacct ggacgcccag 121 cagatgcacc … Isoko (Nigeria): (1) Kọ Ileleikristi a rẹ rọwo inọ Ọghẹnẹ yọ Esanerọvo? (5) Kọ Jesu o whu evaọ uruwhere?
Task-based unsuPOS evaluation UnsuPOS tags are used as features, performance is compared to no POS and supervised POS. Tagger was induced in one-CPU-day from BNC • Kernel-based WSD: better than noPOS, equal to suPOS • POS-tagging: better than noPOS • Named Entity Recognition: no significant differences • Chunking: better than noPOS, worse than suPOS
Summary • Structure Discovery Paradigm contrasted to traditional approaches: • no manual annotation, no resources (cheaper) • language- and domain-independent • iteratively enriching structural information by finding and annotating regularities • Graph-based SD procedures • Evaluation framework and results
Questions? THANKS FOR YOUR ATTENTION!
Structure Discovery Machine I From linguistics, we have the following intuitions that can lead to SD algorithms that capture their underlying structure: • There are different languages • Words belong to word classes • Short sequences of words form multi word units • Words can be semantically decomposable (compounds) • Words are subject to inflection • Morphological congruence between words • There are grammatical dependencies between words and sequences of words • Words can have different semantic properties • Semantic congruence between words • A word can have several meanings
Structure Discovery Machine II The following methods are SD algorithms: • Language Identification: as introduced • POS Induction: as introduced • MWU detection by Collocation extraction • Unsupervised Compound Decomposition and Paraphrasing: (work in progress) • Unsupervised Morphology (MorphoChallenge): letter successor varieties • Unsupervised Parsing: Grammar Induction based on POS and neighbor-based co-occurrences • Semantic classes: Similarity in context patterns of words and POS (work in progress) • WSI+WSD: Clustering Co-occurrences+Disambiguation (work in progress)
Chinese Whispers Graph Clustering D L2 B L4 5 8 A L1L3 E L3 3 C L3 6 • Explanations • Nodes have a class and communicate it to their adjacent nodes • A node adopts one of the the majority class in its neighborhood • Nodes are processed in random order for some iterations • Properties • Time-linear in number of edges: very efficient • Randomized, non-deterministic • Parameter-free • Numbers of clusters found by algorithm • Small World graphs converge fast Algorithm: initialize: forall vi in V: class(vi)=i; while changes: forall v in V, randomized order: class(v)=highest ranked class in neighborhood of v;
Language Seperation Evaluation • Cluster the co-occurrence graph of a multilingual corpus • Use words of the same class in a language identifier as lexicon • Almost perfect performance
unsuPOS: Steps Unlabelled Text high frequency words medium frequency words Distributional Vectors NB-cooccurrences Graph 1 Graph 2 Chinese Whispers Graph Clustering Partitioning 1 Partitioning 2 Maxtag Lexicon Partially Labelled Text Trigram Viterbi Tagger Fully Labelled Text ... , sagte der Sprecher bei der Sitzung . ... , rief der Vorsitzende in der Sitzung . ... , warf in die Tasche aus der Ecke . 17 C1: sagte, warf, rief C2: Sprecher, Vorsitzende, Tasche C3: in C4: der, die ... , sagte|C1 der|C4 Sprecher|C2 bei der|C4 Sitzung . ... , rief|C1 der|C4 Vorsitzende|C2 in|C3 der|C4 Sitzung . ... , warf|C1 in|C3 die|C4 Tasche|C2 aus der|C4 Ecke . ... , sagte|C1 der|C4 Sprecher|C2 bei|C3 der|C4 Sitzung|C2 . ... , rief|C1 der|C4 Vorsitzende|C2 in|C3 der|C4 Sitzung|C2 . ... , warf|C1 in|C3 die|C4 Tasche|C2 aus|C3 der|C4 Ecke|C2 .
Word cluster ID cluster members (size) unsuPOS: Ambiguity Example I 166 I (1) saw 2 past tense verbs (3818) the 73 a, an, the (3) man 1 nouns (17418) with 13 prepositions (143) a 73 a, an, the (3) saw 1 nouns (17418) . 116 . ! ? (3)
unsuPOS: Medline tagset 1 (13721) recombinogenic, chemoprophylaxis, stereoscopic, MMP2, NIPPV, Lp, biosensor, bradykinin, issue, S-100beta, iopromide, expenditures, dwelling, emissions, implementation, detoxification, amperometric, appliance, rotation, diagonal, 2(1687) self-reporting, hematology, age-adjusted, perioperative, gynaecology, antitrust, instructional, beta-thalassemia, interrater, postoperatively, verbal, up-to-date, multicultural, nonsurgical, vowel, narcissistic, offender, interrelated, 3(1383) proven, supplied, engineered, distinguished, constrained, omitted, counted, declared, reanalysed, coexpressed, wait, 4(957) mediates, relieves, longest, favor, address, complicate, substituting, ensures, advise, share, employ, separating, allowing, 5(1207) peritubular, maxillary, lumbar, abductor, gray, rhabdoid, tympanic, malar, adrenal, low-pressure, mediastinal, 6(653) trophoblasts, paws, perfusions, cerebrum, pons, somites, supernatant, Kingdom, extra-embryonic, Britain, endocardium, 7(1282) acyl-CoAs, conformations, isoenzymes, STSs, autacoids, surfaces, crystallins, sweeteners, TREs, biocides, pyrethroids, 8(1613) colds, apnea, aspergilloma, ACS, breathlessness, perforations, hemangiomas, lesions, psychoses, coinfection, terminals, headache, hepatolithiasis, hypercholesterolemia, leiomyosarcomas, hypercoagulability, xerostomia, granulomata, pericarditis, 9(674) dysregulated, nearest, longest, satisfying, unplanned, unrealistic, fair, appreciable, separable, enigmatic, striking, i 10(509) differentiative, ARV, pleiotropic, endothermic, tolerogenic, teratogenic, oxidizing, intraovarian, anaesthetic, laxative, 13(177) ewe, nymphs, dams, fetuses, marmosets, bats, triplets, camels, SHR, husband, siblings, seedlings, ponies, foxes, neighbor, sisters, mosquitoes, hamsters, hypertensives, neonates, proband, anthers, brother, broilers, woman, eggs, 14(103) considers, comprises, secretes, possesses, sees, undergoes, outlines, reviews, span, uncovered, defines, shares, s 15(87) feline, chimpanzee, pigeon, quail, guinea-pig, chicken, grower, mammal, toad, simian, rat, human-derived, piglet, ovum, 16(589) dually, rarely, spectrally, circumferentially, satisfactorily, dramatically, chronically, therapeutically, beneficially, already, 18(124) 1-min, two-week, 4-min, 8-week, 6-hour, 2-day, 3-minute, 20-year, 15-minute, 5-h, 24-h, 8-h, ten-year, overnight, 120- 21(12) July, January, May, February, December, October, April, September, June, August, March, November 23(13) acetic, retinoic, uric, oleic, arachidonic, nucleic, sialic, linoleic, lactic, glutamic, fatty, ascorbic, folic 25(28) route, angle, phase, rim, state, region, arm, site, branch, dimension, configuration, area, Clinic, zone, atom, isoform, 247(6) P<0_001, P<0_01, p<0_001, p<0_01, P<_001, P<0_0001 391(119) alcohol, ethanol, heparin, cocaine, morphine, cisplatin, dexamethasone, estradiol, melatonin, nicotine, fibronectin,