300 likes | 663 Views
Data Centres, Social Networks and other stories. Stelios Piperidis spip @ilsp.gr I LSP- ATHENA Research Centre ELRA. In the good old days. Linguistic knowledge and language data, in different incarnations, at the heart of HLT development, for the last 60 years
E N D
Data Centres, Social Networks and other stories Stelios Piperidis spip@ilsp.gr ILSP-ATHENA Research Centre ELRA
In the good old days... Linguistic knowledge and language data, in different incarnations, at the heart of HLT development, for the last 60 years In the good old AI days, with knowledge based, deductive techniques prevailing scientific method(s) at large, formalised lexical databases, grammars, world knowledge bases constituted the knowledge/data backbone Take the example of MT – of good old MT New Horizons for LRs in a Global Context, Barcelona, July 2009
MT history (1) • Phase 1 (’50 – ’80): rule-based approaches (direct translation, transfer-based, interlingua) • the necessary data infrastructure consisted mainly in bilingual lexical databases, analysis/synthesis and transfer grammars, codification of semantic & pragmatic knowledge in some formal language • And this was to be anticipated, through the general thinking of the historical period New Horizons for LRs in a Global Context, Barcelona, July 2009
Analytic philosophy of language up to mid 20th century • The general trend converged to the analysis of complex concepts in simple and often the simplest concept(s). • In this framework, • Frege’s (1848-1925) Begriffschrift, logicism, maths reducible to logic, foundation of formal logic, logic-based ideal language, sense & reference distinction • Russell’s (1872-1970) logical atomism and definite descriptions through logical analysis, construction of logically ideal language – correspondence between language and reality • Early Wittgenstein’s (1889-1951) Tractatus (influence by Russell) perception of language as a mirror of the world , language and reality two different levels connected through logic New Horizons for LRs in a Global Context, Barcelona, July 2009
MT’s interlingua(s) • all these main features of the analytic philosophy until the mid 20th century were taken up by early RBMT/KBMT • the interlingua approach in the knowledge-based version, but also in direct and transfer based MT, presupposes the exclusively representational function of language, relying on logical atomism (facts that cannot be reduced/broken down any further) • with meaning as image/picture and directly connected to reference New Horizons for LRs in a Global Context, Barcelona, July 2009
Philosophical context for MT • The possibility or impossibility of MT and whatever this entails has raised strong philosophical investigations coupled with controversies around the nature of language and the issue of translatability • linguistic and ontological relativity • indeterminacy of translation • the inscrutability of reference • the ideal (universal) language, • the problem of meaning • the representational function of language, etc. New Horizons for LRs in a Global Context, Barcelona, July 2009
Philosophy against • Bar-Hillel • MT impossible not only due to linguistic intricacies but due to total inability to represent the necessary world knowledge in a sound and complete way (with the famous “box was in the pen” example) • Sapir-Whorf • Meaning is culturally dependent • Language and perception of the world are community-based • Linguistic determinism • thought is determined by the language we speak • Linguistic relativity • people who speak different languages perceive and think about the world in different ways New Horizons for LRs in a Global Context, Barcelona, July 2009
Problems unsolved • The solution to the problems that emerged both in the creation of the virtual universal language and on the broader consideration of the function of language as exclusively representational, would have theoretically made possible the realization of high-level machine translation. • However, these problems were not solved, neither at the philosophical nor the computational level, resulting, for some at least, in a … New Horizons for LRs in a Global Context, Barcelona, July 2009
Meaning as use • shift towards the perception of meaning as use, pragmatics, behaviourism and the blossoming of indeterminacies from 1950 onwards, with later Wittgenstein and Quine. • Wittgenstein : rejection of the thesis on the constitution of language through logic, language as a mirror of reality and its unique function in representing/picturing the state of affairs • perceives language not as a whole, but as a set consisting of related language games, • considers meaning as use • the notion of language acts emerges and the general pragmatic shift to language issues is effected New Horizons for LRs in a Global Context, Barcelona, July 2009
Quine, indeterminacy, translation manuals • Quine criticizes the concept of meaning, to highlight his indeterminacy thesis that there is nothing more about the concept of meaning beyond what emerges from the use of words and the corresponding behavior of speakers of a language. • does not claim there is no acceptable translation, but he argues that there is NOT only one acceptable translation, but many. (CL/LT/MT took it up ca 40 years later) • Even in radical translation, at practical level, the linguist-translator with specific rules and methods, mainly inductive, achieves his objective, i.e. the construction of a good translation manual, with pragmatic criteria. New Horizons for LRs in a Global Context, Barcelona, July 2009
Optimism? • In this light, these theses have had a completely different effect on the field of machine translation. • They are not any longer the argumentation basis for the impossibility of MT, but can be considered to offer an optimistic outlook marking the shift from the broader trend of logical analysis and representational function of language and the rule-based techniques to corpus-based techniques developed by the end of 80’s and later, purely statistical or example-based, where use of rules and representational schemas alone starts fading out. New Horizons for LRs in a Global Context, Barcelona, July 2009
Language games and domain specific MT • The restriction of many MT models to specific subject areas and sublanguages is directly related to Wittg.’s ideas regarding the variety of language games, the formation of language not as a whole, but as a set of a myriad of language games with strong similarities between them. New Horizons for LRs in a Global Context, Barcelona, July 2009
MT history (2) • Phase 1 (’50 – ’80): rule-based approaches (direct translation, transfer-based, interlingua) • Phase 2 (’84): Example-based machine translation • Phase 2 (’89 onwards): Corpus-based approaches (SMT, CBMT – parallel and monolingual+lexica based approaches) • Phase 3 (’00 onwards): More sophisticated SMT (phrase-based, hybrids, linguistic structures in translation models, etc) New Horizons for LRs in a Global Context, Barcelona, July 2009
MT today • Rule-based systems still heavily used (several incarnations of Systran’s engines) for gisting purposes • SMT systems highly popular with researchers • Serious lowering of the entry barrier to MT compared to the past • Real hope that we could do without any linguistic processing (language independence in one shot) • Hopes diminish – ceiling is approached fast • Hybrids (RBMT with LMs, factored models in SMT, syntax-based SMT, RBMT+SMT combinations, …) New Horizons for LRs in a Global Context, Barcelona, July 2009
Requirements for MT today (1) • Corpora, lexica and linguistic processors • Very limited variety of parallel corpora • Serious problems of access to more interesting domain and text-type specific corpora (other than JRC and EUParl) • All sorts of tricks but also principled approaches to augmentation of translation models • Entailment and paraphrasing • Syntactic transformations • Social computing methods • … New Horizons for LRs in a Global Context, Barcelona, July 2009
Requirements for MT today (2) • Requirements expected, anticipated and in line with the indeterminacy hypothesis – data of all kinds (corpora, lexica, etc.) • But also linguistic structures pertaining to specific Wittg.’s language games • No single gold translation (but also erroneous translation!!!) drives the “suggest your own translation here” prompt New Horizons for LRs in a Global Context, Barcelona, July 2009
Requirements for MT today (3) • Or, paraphrase what you have available, so that we can generate a translation manual • To help with “Little John’s box” which “was in the pen” we resort to occurrences of “pen” in corpora, do some LSA or design context vectors (based on meaning as use), let alone we hope to find “box in the pen” and “pen in the box” in the petabytes of available text New Horizons for LRs in a Global Context, Barcelona, July 2009
Current reality of language data (1) • primary language data in abundance on the web • Dec 2008 : 487 billions of GB of digital content created in 2008, that is ~3.892.179.868.480.350.000.000 bits, ~162T digital images, ~19B blue-ray disks • Digital universe expected to be doubling every 18 months • however, abundant data not for all languages (e.g. less-widely-used languages, minority languages, dialects, etc) • Things change, however : • 1996 : 66% of online community in the US • 2008 (Dec) : 83% of online community outside the US (28% EU, 42% Asia) • different types of web data (proper text, well- but also “ill-formed” textual communication in blogs/chatrooms etc, images, videos, etc) • some of these data are actually annotations – at varying levels- of other primary data (e.g. summaries, transcripts/subtitles, image captions, or even opinions) New Horizons for LRs in a Global Context, Barcelona, July 2009
Current reality of language data (2) social networks, web2.0 and the upcoming web3.0 (will) provide new further sources, but also new challenges as well as new applications, e.g. opinion mining, market monitoring, digital diplomacy, etc. broadband networks enable the above to be increasingly multimedia in nature while new markets and competition reinforce their multilingual aspect New Horizons for LRs in a Global Context, Barcelona, July 2009
Current trends in language processing remarkable improvements in computational methods and techniques (notably in machine learning) in the recent years no clearly manual or automatic techniques, but mixtures of both many “quick and dirty” techniques the above render the availability of the appropriate data and tools, i.e. resources, sine qua non for any technological progress New Horizons for LRs in a Global Context, Barcelona, July 2009
Language, other media, other sciences • Language does not appear in isolation from other communicative media • Almost established is considered the Vision and Language interrelation bringing together computer vision and language processing – symbol grounding approaches (Little John’s problem re-solved? in multimedia contexts) • Progress in brain activation imaging techniques, neurophysiology, neurolinguistics, cognitive and neuro-cognitive approaches to language expected to shed more light to language and semantics problems • Will fMRI data be all the rage in the next years? New Horizons for LRs in a Global Context, Barcelona, July 2009
Towards new perspectives for LRs • Language resources as living entities • Language resources can no longer be considered monolithic, static objects; they are living entities, constantly evolving, (part of) behavioural data, that can be processed by linguistic and other tools, interrelated with other behavioural and sensorimotor data, with mental or bodily dimensions • As such language data, related data in other media (e.g. still and moving images) and modalities (e.g. gestures, signs, …), tools processing these data to annotate them, to extract knowledge from them, to re-engineer them, to build new connections between them, to generate new or make explicit hidden information, form an integral domain • Such a domain coupled with related technologies to enable new applications for content and communication processing is the long term vision of a new initiative New Horizons for LRs in a Global Context, Barcelona, July 2009
Open Resource Infrastructure (first steps) • ORI : an open, integrated, secured, and interoperable language resources (LR) and language technologies (LT) infrastructure for the HLT (Human Language Technologies) domainand other applicative domains (e.g. digital libraries, cognitive systems, robotics, etc) • ever-evolving, scalable, incl. free and for-a-fee LRs/LTs and services; • including legacy, contemporary and emerging datasets, tools and technologies • based on distributed networked repositories & data centres accessible through common interfaces; • standards-compliant, overcoming format, terminological and semantic differences; allowing/enabling service offerings and static or dynamic compositions, i.e. workflows thereof • complying to legal and security related restrictions New Horizons for LRs in a Global Context, Barcelona, July 2009
ELRA in 2009, in a new setting • Continues its mission • Recasts its definition of resources by extending into tools that are, by contemporary methods, used in the LR production, validation, evaluation process (LRs=data+tools) 3) Reinforces the LR identification tasks through the ELRA and the Universal catalogues • Reinforces cooperation with LDC and NICT through the Universal Catalogue • Participates in the metadata and other standardisation initiatives • Launches service offerings based on its existing language resources (living resources) New Horizons for LRs in a Global Context, Barcelona, July 2009
ELRA instruments in 2009 • Infrastructure • Infrastructure Committee (ICOM) • Identification, cataloguing, standardised metadata documentation, (IPR clearance, (new) distribution mechs) • Scientific • Production and Validation Committee (PVCOM) • Merger of existing P- and V-Com • Production of new derivative resources based on the existing (legacy-golden-normative-…) resources • Evaluation Committee (ECOM) • Promotion and support of evaluation • Evaluation portal upgrade and evaluation services • Promotion of HLT and LRs • Marketing and Promotion Committee (MPCOM) • Organisation of LREC and LangTech • Revival of an information aggregation service a la Hltcentral.org New Horizons for LRs in a Global Context, Barcelona, July 2009
LRs & LTs of ORI • language data (written & spoken: corpora, lexica, grammars, ontologies, terminologies, etc.), language-related data (including or associated to other media & modalities), processing and annotation tools and technologies, services through the use of language tools and technologies, workflows by combining interoperable services, evaluation tools, metrics and protocols, services addressing assessment and evaluation. New Horizons for LRs in a Global Context, Barcelona, July 2009
ORI players • public and privateR&D as well as industrial practitioners/suppliers and users of HLT : • 1) academic institutions, research organizations, universities, • 2) individual researchers and students, • 3) industrial organizations and SMEs, and • 4) national governments, EC institutions, and private investors. • More specifically • a) LRs and LTs producers and providers, • b) LRs and LTs users as well as technology integrators, • c) LRs and LTs repositories, and • d) HLT policy makers and other LR and LT funders and sponsors. New Horizons for LRs in a Global Context, Barcelona, July 2009
ORI services • registration, authorization, inventory and referral services, resource/tool description and uploading, browsing and downloading, validation-maintenance-evaluation, archiving-preservation and access/distribution services including IPR and legal clearance issues • services and functionalities offered as computer services, but also by human expert agents (e.g. legal services) New Horizons for LRs in a Global Context, Barcelona, July 2009
"The vision I have for the Web is about anything being potentially connected withanything. It is a vision that provides us with new freedom, and allows us to grow faster than we ever could. . . . it brings the workings of society closer to the workings of our minds." Tim Berners-Lee : Weaving the Web, 2000 New Horizons for LRs in a Global Context, Barcelona, July 2009
Thank you! New Horizons for LRs in a Global Context, Barcelona, July 2009