1 / 28

Label Propagation for Tax Law Thesaurus Extension

This master's thesis explores using label propagation on graphs to extend thesauri in the legal context, offering solutions for information retrieval and synonym sets for legal content providers. The tool evaluates and extends existing thesauri, aiming to semi-automate the process. Research questions, methods, and potential use-cases are discussed, highlighting the benefits of semi-supervised learning and nearest neighbors approach.

youngm
Download Presentation

Label Propagation for Tax Law Thesaurus Extension

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Label Propagation for Tax Law Thesaurus Extension Markus Müller, 09.11.2018, Master’s Thesis Final Presentation Advisors Prof.Dr. Stephan Günnemann (Group for Data Mining and Analytics) JörgLandthaler, Elena Scepankova

  2. Outline • Problem: Thesauri in the Legal Context • Base Technology: Word Embeddings • Opportunity: Label Propagation on Graphs Motivation Research Approach • Research Questions • Research Methods • Thesaurus Extension Tool Evaluation Results Conclusion & Future Work • Quantitative Evaluation • Qualitative Evaluation • Baseline Comparison 181109 Mueller Label Propagation Thesaurus Extension MA Final

  3. Problem: Thesauri in the Legal Context Thesauri enhance Information Retrieval via Synonym Sets Legal Content Providers Creating and Maintaining Thesauri is hard Providetheiruserswithaccesstorelevant legal documents Mostly manual work, multiple domain-specific thesauri Search Query Expansion Leading Providers in Germany Abwrackprämie Also showing results for “Umweltprämie” […] Abwrackprämie, the colloquial term for Umweltprämie […] Wolters Kluwer 2016 [1] 181109 Mueller Label Propagation Thesaurus Extension MA Final

  4. Focus: Thesaurus Extension as a Solution Approach Existing Thesaurus Text Corpus Suggestwordsfromtextcorpusas synonym set (synset) additions Subjecttoresearch at thischair: Landthaler et al. (2017) extendedsynsetsstartingfrom individual synset words 181109 Mueller Label Propagation Thesaurus Extension MA Final

  5. Potential Use-Cases for Thesaurus Extension semi-automation, quality assurance Suggest synset adjustments semi-automation Suggest additions to manually created synsets thesauri linkage Identify relations between synsets of different thesauri 181109 Mueller Label Propagation Thesaurus Extension MA Final

  6. Problem withVanilla Word Embeddingsfor Thesaurus Extension But then: Overall structure is not taken into account Word Embedding Technologies map similar words to similar vectors state country king A & B: Labeledwith different synsets Rest: Unlabeled X B A monarch X would fit betterto B thanto A Blue & Red: Words from different existingsynsets Green: Extension suggestion ⇒ Opportunity: Semi-Supervised Learning ⇒ Nearest Neighbors: Extend synset with words close to synset words 181109 Mueller Label Propagation Thesaurus Extension MA Final

  7. Research Idea: Label Propagation for Thesaurus Extension Label Propagation isusedby Google in Combinationwith Word Embeddingsforknowledgegraphextension, e.g. forEmotion Associationand Smart Replies https://ai.googleblog.com/2016/10/graph-powered-machine-learning-at-google.html & RaviandDiao (2015) RQ1: Can we apply Label Propagation to Word Embeddings to find new Synonyms? Sparsely-labeled Graph with Labels from existing Thesaurus Fully-Labeled Graph Intuition Label Propagation Graph Construction Embedding Generation Text Corpus Word Embeddings 181109 Mueller Label Propagation Thesaurus Extension MA Final

  8. Research Questions How can we get semantic & context information into a graph for LP? (RQ2) Can we model the thesaurus extension problem on the LP technology? (RQ3) What LP algorithms work best? (RQ4) ? Is LP a suitable technology for thesaurus extension in the legal domain? (RQ1) How much automation for thesaurus creation is achievable with LP? (RQ5) 181109 Mueller Label Propagation Thesaurus Extension MA Final

  9. Research Approach Build a Thesaurus Extension Tool for trying out many approaches Quantitative Evaluation Automatic Parameter Studies Qualitative Evaluation Manual Studies Comparison with Vanilla Word Embeddings Approach 181109 Mueller Label Propagation Thesaurus Extension MA Final

  10. Thesaurus Extension Tool: Architecture Extendable & Open Source on sebischair @ GitHub word2vec, fastText, GloVe Special Character Handling Special Character Handling • k-nearest-neighbors • 𝜀-neighborhoodgraph • LabelPropagation • LabelSpreading via Group for Data Mining and Analytics Manyvariants incl. parametersimplementedforlaterevaluation Pipes & Filters Architecture, Buschmann et al. (1996) Input/Output Filter 181109 Mueller Label Propagation Thesaurus Extension MA Final

  11. Quantitative Evaluation: Set-up • Evaluation Thesaurus (Subset): • 2,552 thesaurussynsets • Training Set: 3,277 words • Test Set: 2,887 words • Tax Law Data Set by DATEV (in German) • textcorpus: 132,581 legal documents • handcraftedexistingthesaurus: 12,288 synsets Hyper-Parameter Studies on thesePhases Embeddings Generation Graph Construction Label Propagation Pre-Processing Goal: Find hyper-parameter configurationwithhighestaccuracy ⇒ asinputfor Qualitative Evaluation Challenge: Lots ofpossibleconfigurations (> 1,000 runs) 181109 Mueller Label Propagation Thesaurus Extension MA Final

  12. Quantitative Evaluation: LessonsLearned & Final Result Greatest performance impact: Word Embeddings Choice High performance through hyper-parameter optimization OptimizedConfigurationResults But: Also good suggestions outside of the existing thesaurus? Configuration: Pre-Processing: Keep letters & hyphens, muß⇒muss, singlelinesaving Embedding Generation: 400 dimensions, 40 iterations Graph Construction: k-nearestneighbors, k=12, weightedundirectededges, noself-referencesallowed Label Propagation: LabelSpreading, 𝛼=0.2, 15 iterations 181109 Mueller Label Propagation Thesaurus Extension MA Final

  13. Qualitative Evaluation: Set-up Show synset suggestions to humans & get ratings Pre-Study Identify influence factors for good suggestions Main Study (2x) Rate suggestions of best configurations Scores 0: Not similartopredicted synset 1: Same semanticarea 2: Shouldbeaddedto synset • Rated54 synsets per study, 10 suggestions per synset ⇒ 540 ratings/study • Originally planned with legal experts • In the end, conducted by JörgLandthaler & Markus Müller, supported by Text Corpus via ElasticSearch instance 181109 Mueller Label Propagation Thesaurus Extension MA Final

  14. Qualitative Evaluation: Pre-Study LessonsLearned High confidence, high synset training number and low synset prediction number lead to better rating E.g. correlationbetweenpredictionconfidenceand score Human Score Scores 0: Not similartopredicted synset 1: Same semanticarea 2: Shouldbeaddedto synset Predictionconfidenceofalgorithm 181109 Mueller Label Propagation Thesaurus Extension MA Final

  15. Qualitative Evaluation: Main Study LessonsLearned fastText again considerably better than word2vec But: Why does fastText perform better? Ratings Scores 0: Not similartopredicted synset 1: Same semanticarea 2: Shouldbeaddedto synset 181109 Mueller Label Propagation Thesaurus Extension MA Final

  16. Qualitative Evaluation: Interpretation fastTextpredominantely suggests syntactically similar words, word2vec suggests really different words (⇒ more interesting) Our evaluations favored syntactically similar words Example We compiled a list of common challenges around Thesaurus Extension 181109 Mueller Label Propagation Thesaurus Extension MA Final

  17. „Synset Vector“ Baseline: Approach • Nearestneighborsapproach, operatesdirectly on wordembeddings • Self-designed, inspiredby Rothe and Schütze (2016) [4] Intuition withk=2 181109 Mueller Label Propagation Thesaurus Extension MA Final

  18. „Synset Vector“ Baseline: LessonsLearned Baseline performs equal or better than label propagation approach, while being less complex Quantitative Results Qualitative Results withbaselinek=200 withbaselinek=30 Scores 0: Not similartopredicted synset 1: Same semanticarea 2: Shouldbeaddedto synset 181109 Mueller Label Propagation Thesaurus Extension MA Final

  19. Conclusion Label Propagation approach was not better than Baseline, but overall results were promising fastText and word2vec predictions could be used in a semi-automated way for Thesaurus Extension And: We contributed to the problem area 181109 Mueller Label Propagation Thesaurus Extension MA Final

  20. Conclusion: Contributions & Future Work Contributions Future Work withregardsto Label Propagation Evaluation with a corpus in a different languageand/ormoretrainingdata? Evaluation within a different applicationareabesidestaxlaw? Augment wordembeddingswithothersemanticknowledge, e.g. Wikidata, Wikipedia, Freebase • Created Open Source „ThesaurusLabelPropagation“ tool • Foundimplementationissuesaroundlabelpropagation in „scikit-learn“ (32.000 stars) • Significantlyoptimizedperformanceforgraphconstruction on wordemebeddings • Conductedmultiple hyper-parameter studies(>1000 individual runs) & optimizedconfigurations • Ratedconfigurationswithin5 qualitative evaluations(overall 2,500 suggst. manuallyrated) • Identificationofinfluencefactorsforqualityofsuggestionresults • Classificationoftypicalthesauruschallenges • Introduced & evaluatednewbaselineapproach 181109 Mueller Label Propagation Thesaurus Extension MA Final

  21. References • Buschmann, Frank, Regine Meunier, Hans Rohnert, Peter Sommerlad, and Michael Stal. 1996. “A System of Patterns: Pattern-Oriented Software Architecture.” • Dirschl, Christian. 2016. “Thesaurus Generation andUsage at Wolters Kluwer Deutschland GmbH.” Jusletter IT 25. Februar 2016, February. • Landthaler, Jörg, Bernhard Waltl, Dominik Huth, Daniel Braun, Christoph Stocker, Thomas Geiger, and Florian Matthes. 2017. “Extending Thesauri Using Word EmbeddingsandtheIntersectionMethod.” In Proceedingsof 2nd Workshop on AutomatedSemantic Analysis of Information in Legal Texts. London, UK. • Ravi, Sujith, andQimingDiao. 2015. “Large Scale Distributed Semi-Supervised Learning Using Streaming Approximation.” ArXiv:1512.01752 [Cs], December. http://arxiv.org/abs/1512.01752. • Rothe, Sascha, and Hinrich Schütze. 2015. “AutoExtend: Extending Word EmbeddingstoEmbeddingsforSynsetsandLexemes.” In Proceedingsofthe 53rd Annual Meeting oftheAssociationforComputationalLinguisticsandthe 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), 1:1793–1803. 181109 Mueller Label Propagation Thesaurus Extension MA Final

  22. www.muellermarkus.com Markus Müller Master’s Student Informatics 17132 mail@muellermarkus.com

  23. BackupHyper-Parameter Study on Word Embeddings 181109 Mueller Label Propagation Thesaurus Extension MA Final

  24. BackupHyper-Parameter Study on Graph Construction 181109 Mueller Label Propagation Thesaurus Extension MA Final

  25. BackupQualitative Evaluation: Correlations PredictionConfidence Numberof Training Words in Synset NumberofPredictionsfor Synset 181109 Mueller Label Propagation Thesaurus Extension MA Final

  26. BackupChallenges around Thesaurus Extension 181109 Mueller Label Propagation Thesaurus Extension MA Final

  27. BackupPossible Reasons and Future Work Language & Training Data Evaluation with a corpus in a different languageand/ormoretrainingdata? ContextofTax Law Evaluation within a different applicationarea? Graph Type Augment wordembeddingswithothersemanticknowledge, e.g. Wikidata, Wikipedia, Freebase [3] 181109 Mueller Label Propagation Thesaurus Extension MA Final

  28. BackupSupervised, Semi-Supervised, Transductive • Supervised learning: Learn on labeled training instances, perform prediction on unknown test data. • Inductive semi-supervised learning: Learn on labeled training instances and unlabeled training instances, perform prediction on unknown test data. • Transductive semi-supervised learning: Learn on labeled training instances and unlabeled training instances, perform prediction on known test [=training] data. 181109 Mueller Label Propagation Thesaurus Extension MA Final

More Related