300 likes | 414 Views
Language technologies in processing ( social ) media texts : Project Xlike. Marko Tadić University of Zagreb Faculty of Humanities and Social Sciences Department of Linguistics. InFuture2013 , Zagreb 2013-11-06. Outline. XLike p roject presentation Rationale, scope and objectives
E N D
Language technologies in processing (social) media texts: Project Xlike Marko Tadić University of Zagreb Faculty of Humanities and Social Sciences Department of Linguistics InFuture2013, Zagreb 2013-11-06
Outline • XLike project presentation • Rationale, scope and objectives • Expected results and their use • Language Technologies • Multilingual pipelines • Semantic-based cross-lingual annotation • Informal language processing • Conclusions InFuture 2013,Zagreb2013-11-06
XLike: facts • XLike • Cross-lingual Knowledge Extraction • FP7-ICT-2011-7 • Objective ICT-2011.4.2 – Language Technologies • Information access and mining • cross-lingual information search and retrieval • text-mining and information extraction for multilingual collections • Small or medium scale focused research project (STREP) • Web site • http://xlike.org • Videos • http://videolectures.net/xlike/ InFuture 2013,Zagreb2013-11-06
XLike: project partners • Jozef Stefan Institute, Ljubljana, Slovenia (coordinator) • Karlsruhe Institute of Technology, Karlsruhe, Germany • University Politecnica Catalunya, Barcelona, Spain • University of Zagreb, Zagreb, Croatia • Tsinghua University, Beijing, China • iSOCO, Madrid, Spain • Bloomberg, New York, USA • Slovenian Press Agency, Ljubljana, Slovenia InFuture 2013,Zagreb2013-11-06
XLike: associated partners • Indian Institute of Technology, Mumbai, India • to cover Hindi language • group around prof. Pushpak Bhattacharyya • New York Times, New York, USA • R&D department • active participation at project meetings • Xinghua News, Beijing, China • AFP, Paris, France • as two additional case studies • adding French processing InFuture 2013,Zagreb2013-11-06
XLike: Rationale, scope and objectives Two key research objectives • To extract formal knowledge from multilingual texts into cross-lingual interlingua, and • To adapt linguistic techniques to deal with informal language from social media InFuture 2013,Zagreb2013-11-06
The goal of the project: logic-based and statistics-based interlingua Logic / Statistics InFuture 2013,Zagreb2013-11-06
The need for cross-linguality:Increasing proportion of non-English articles in Wikipedia InFuture 2013,Zagreb2013-11-06
The need for interdisciplinary approach:Research areas dealing with text InFuture 2013,Zagreb2013-11-06
Project architecture:processing pipeline InFuture 2013,Zagreb2013-11-06
Expected results and their use • In terms of research: • The key result is semantic and statistics interlingua for linking information across languages • In terms of technology: • The key result is XLike Toolkit, open source environment for cross-lingual language processing • Easy reusability and potential further commercial exploitation InFuture 2013,Zagreb2013-11-06
Results so far: data acquisition • NewsFeed service (http://newsfeed.ijs.si/) • continuous, real-time aggregated stream of semantically enriched mainstream global news articles • A large comparable cross-lingual corpus • covering up-to 50 major languages based on Wikipedia InFuture 2013,Zagreb2013-11-06
Statistics-based Cross-lingual linking • Scalable approach for statistics based cross-lingual document linking • Based on Canonical Correlation Analysis (CCA) • Allows training language-pairsimilarity models for millions ofdocuments from Wikipedia usedasacomparable corpus • The project developed noveltechniques based on “hublanguages” for dealing with language pairs with small overlap in comparable corpora. • Allows substantial improvement the state-of-the-art performance for under-resourced languages • Target coverage 100 languages InFuture 2013,Zagreb2013-11-06
Language Technologies:processing pipeline InFuture 2013,Zagreb2013-11-06
WP2: Linguistic processing • XLike project combines scientific capabilities and insights from several areas of science • computational linguistics • machine learning • text mining • semantic technologies in order to enable cross-lingual text“understanding” by machines • WP2 goals • develop tools to extract entities and relations found in documents • for multiple • languages, domains, language registers (standard vs. non-standard) • a solid LT foundation used throughout the project InFuture 2013,Zagreb2013-11-06
WP2: XLike multilingual pipelines • entry module: automatic language identification • 6 pipelines for each of XLike languages • sentence splitting • tokenization • lemmatization • POS/MSD-tagging • NERC • dependency parsing • semantic role labelling • pipelines function as web services • currently • English, Spanish, German, Chinese, Catalan, Slovene • in preparation • Croatian, French, Hindi InFuture 2013,Zagreb2013-11-06
WP2: XLike multilingual pipelines InFuture 2013,Zagreb2013-11-06 LSS & FASSBL9Dubrovnik2013-10-10
WP2: XLike multilingual pipelines InFuture 2013,Zagreb2013-11-06
WP2: XLike multilingual pipelines • starting from • “Unesco is now holding its biennial meeting in Paris to devise its next projects.” • at the end we want to come to InFuture 2013,Zagreb2013-11-06
WP2: XLike multiL pipelines • Five types of target extraction elements: • Tokens (requires tokenization) • Lemmas (requires lemmatization and PoS tagging) • Syntactic Triples (requires syntactic parsing) • Semantic Triples (requires semantic role labeling) • Entity Relations • Dependency-based Extraction • Syntactic: subject-verb-object • Semantic: agent-predicate-theme InFuture 2013,Zagreb2013-11-06
WP2: XLike multiL pipelines • We may be interested in detecting relations between entities (beyond subject-object) • Syntactic/Semantic paths provide a rich, meaningful source of features to characterize such relations • Statistical classification methods can be used to label such relations. InFuture 2013,Zagreb2013-11-06
Analysis of non-standard language • non-standard language (as used in social media) • has particular features distinctive from the standard variety of language • orthography, lexicon, entities, syntax, dialog form, etc. • using standard language models to non-standard language • major source of errors related to words unknown to the statistical linguistic models • needed additional analysis of unknown words for different collections of informal language in English, Spanish and Catalan • non-standard language prototype developed • new tools: shallow pipeline for English Tweets • adaptation to existing tools: Tweet normalization for Spanish InFuture 2013,Zagreb2013-11-06
Analysis of non-standard language • shallow pipeline for English Tweets • simplified set of POS-tags • NE detection trained on NE Tweet Dataset • collected by University of Washington • 10 NE types:geo-location,person,company,product, facility,tvshow,movie,musicartist,sportsteam, other • NE performance • ca 60% precision • ca 40% recall InFuture 2013,Zagreb2013-11-06
Conclusions and future directions • XLike project overview • Language Technologies • used as the integral (preprocessing) part for • Knowledge Processing • XLike multilingual pipelines • covering 6 languages (en, de, es, zh, ca, sl) • more languages to come (hr, fr, hi...) • will be accessible through META-SHARE language resources, tools and services platform • http://www.meta-share.eu InFuture 2013,Zagreb2013-11-06
Thank youfor your attention! InFuture 2013,Zagreb2013-11-06