1 / 30

Language technologies in processing ( social ) media texts : Project Xlike

Language technologies in processing ( social ) media texts : Project Xlike. Marko Tadić University of Zagreb Faculty of Humanities and Social Sciences Department of Linguistics. InFuture2013 , Zagreb 2013-11-06. Outline. XLike p roject presentation Rationale, scope and objectives

Download Presentation

Language technologies in processing ( social ) media texts : Project Xlike

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Language technologies in processing (social) media texts: Project Xlike Marko Tadić University of Zagreb Faculty of Humanities and Social Sciences Department of Linguistics InFuture2013, Zagreb 2013-11-06

  2. Outline • XLike project presentation • Rationale, scope and objectives • Expected results and their use • Language Technologies • Multilingual pipelines • Semantic-based cross-lingual annotation • Informal language processing • Conclusions InFuture 2013,Zagreb2013-11-06

  3. XLike project

  4. XLike: facts • XLike • Cross-lingual Knowledge Extraction • FP7-ICT-2011-7 • Objective ICT-2011.4.2 – Language Technologies • Information access and mining • cross-lingual information search and retrieval • text-mining and information extraction for multilingual collections • Small or medium scale focused research project (STREP) • Web site • http://xlike.org • Videos • http://videolectures.net/xlike/ InFuture 2013,Zagreb2013-11-06

  5. XLike: project partners • Jozef Stefan Institute, Ljubljana, Slovenia (coordinator) • Karlsruhe Institute of Technology, Karlsruhe, Germany • University Politecnica Catalunya, Barcelona, Spain • University of Zagreb, Zagreb, Croatia • Tsinghua University, Beijing, China • iSOCO, Madrid, Spain • Bloomberg, New York, USA • Slovenian Press Agency, Ljubljana, Slovenia InFuture 2013,Zagreb2013-11-06

  6. XLike: associated partners • Indian Institute of Technology, Mumbai, India • to cover Hindi language • group around prof. Pushpak Bhattacharyya • New York Times, New York, USA • R&D department • active participation at project meetings • Xinghua News, Beijing, China • AFP, Paris, France • as two additional case studies • adding French processing InFuture 2013,Zagreb2013-11-06

  7. XLike: Rationale, scope and objectives Two key research objectives • To extract formal knowledge from multilingual texts into cross-lingual interlingua, and • To adapt linguistic techniques to deal with informal language from social media InFuture 2013,Zagreb2013-11-06

  8. The goal of the project: logic-based and statistics-based interlingua Logic / Statistics InFuture 2013,Zagreb2013-11-06

  9. The need for cross-linguality:Increasing proportion of non-English articles in Wikipedia InFuture 2013,Zagreb2013-11-06

  10. The need for interdisciplinary approach:Research areas dealing with text InFuture 2013,Zagreb2013-11-06

  11. Project architecture:processing pipeline InFuture 2013,Zagreb2013-11-06

  12. Expected results and their use • In terms of research: • The key result is semantic and statistics interlingua for linking information across languages • In terms of technology: • The key result is XLike Toolkit, open source environment for cross-lingual language processing • Easy reusability and potential further commercial exploitation InFuture 2013,Zagreb2013-11-06

  13. Results so far: data acquisition • NewsFeed service (http://newsfeed.ijs.si/) • continuous, real-time aggregated stream of semantically enriched mainstream global news articles • A large comparable cross-lingual corpus • covering up-to 50 major languages based on Wikipedia InFuture 2013,Zagreb2013-11-06

  14. Statistics-based Cross-lingual linking • Scalable approach for statistics based cross-lingual document linking • Based on Canonical Correlation Analysis (CCA) • Allows training language-pairsimilarity models for millions ofdocuments from Wikipedia usedasacomparable corpus • The project developed noveltechniques based on “hublanguages” for dealing with language pairs with small overlap in comparable corpora. • Allows substantial improvement the state-of-the-art performance for under-resourced languages • Target coverage 100 languages InFuture 2013,Zagreb2013-11-06

  15. Language Technologies

  16. Language Technologies:processing pipeline InFuture 2013,Zagreb2013-11-06

  17. WP2: Linguistic processing • XLike project combines scientific capabilities and insights from several areas of science • computational linguistics • machine learning • text mining • semantic technologies in order to enable cross-lingual text“understanding” by machines • WP2 goals • develop tools to extract entities and relations found in documents • for multiple • languages, domains, language registers (standard vs. non-standard) • a solid LT foundation used throughout the project InFuture 2013,Zagreb2013-11-06

  18. WP2: XLike multilingual pipelines • entry module: automatic language identification • 6 pipelines for each of XLike languages • sentence splitting • tokenization • lemmatization • POS/MSD-tagging • NERC • dependency parsing • semantic role labelling • pipelines function as web services • currently • English, Spanish, German, Chinese, Catalan, Slovene • in preparation • Croatian, French, Hindi InFuture 2013,Zagreb2013-11-06

  19. WP2: XLike multilingual pipelines InFuture 2013,Zagreb2013-11-06 LSS & FASSBL9Dubrovnik2013-10-10

  20. WP2: XLike multilingual pipelines InFuture 2013,Zagreb2013-11-06

  21. WP2: XLike multilingual pipelines • starting from • “Unesco is now holding its biennial meeting in Paris to devise its next projects.” • at the end we want to come to InFuture 2013,Zagreb2013-11-06

  22. WP2: XLike multiL pipelines • Five types of target extraction elements: • Tokens (requires tokenization) • Lemmas (requires lemmatization and PoS tagging) • Syntactic Triples (requires syntactic parsing) • Semantic Triples (requires semantic role labeling) • Entity Relations • Dependency-based Extraction • Syntactic: subject-verb-object • Semantic: agent-predicate-theme InFuture 2013,Zagreb2013-11-06

  23. WP2: XLike multiL pipelines • We may be interested in detecting relations between entities (beyond subject-object) • Syntactic/Semantic paths provide a rich, meaningful source of features to characterize such relations • Statistical classification methods can be used to label such relations. InFuture 2013,Zagreb2013-11-06

  24. WP2: XLike multiL pipelines

  25. WP2: XLike multiL pipelines

  26. Analysis of non-standard language • non-standard language (as used in social media) • has particular features distinctive from the standard variety of language • orthography, lexicon, entities, syntax, dialog form, etc. • using standard language models to non-standard language • major source of errors related to words unknown to the statistical linguistic models • needed additional analysis of unknown words for different collections of informal language in English, Spanish and Catalan • non-standard language prototype developed • new tools: shallow pipeline for English Tweets • adaptation to existing tools: Tweet normalization for Spanish InFuture 2013,Zagreb2013-11-06

  27. Analysis of non-standard language • shallow pipeline for English Tweets • simplified set of POS-tags • NE detection trained on NE Tweet Dataset • collected by University of Washington • 10 NE types:geo-location,person,company,product, facility,tvshow,movie,musicartist,sportsteam, other • NE performance • ca 60% precision • ca 40% recall InFuture 2013,Zagreb2013-11-06

  28. Conclusions

  29. Conclusions and future directions • XLike project overview • Language Technologies • used as the integral (preprocessing) part for • Knowledge Processing • XLike multilingual pipelines • covering 6 languages (en, de, es, zh, ca, sl) • more languages to come (hr, fr, hi...) • will be accessible through META-SHARE language resources, tools and services platform • http://www.meta-share.eu InFuture 2013,Zagreb2013-11-06

  30. Thank youfor your attention! InFuture 2013,Zagreb2013-11-06

More Related