670 likes | 682 Views
Discover the hidden biases impacting linguistic diversity on the web and learn alternative measurement approaches. Explore key sources of bias and understand the complexities of language data representation. Be cautious of interpreting internet language statistics to derive meaningful insights.
E N D
Global Expert Meeting Multilingualism in Cyberspace for Inclusive Sustainable Development - 4 June - 9 June, 2017 Khanty-Mansiysk, Russian Federation
Daniel Pimienta pimienta@funredes.org Networks & DevelopmentFoundation http://funredes.org Observatory of languages & cultures in the Internet http://funredes.org/lc ExecutiveCommitteeMember of http://maaya.org
Linguistic Indicators in cyberspace: the biases is all aytte Daniel Pimienta pimienta@funredes.org MAAYA
CREDITS Part of the information offered is taken from a study realized on behalf MAAYA by the team D. Prado/D. Pimienta in 2015-2017 about the place of French in the Internet and which is funded by
CREDITS The original idea to approach the space of languages on the Internet by collecting Internet application/spaces figures & transforming country figures into language figures belongs to Daniel Prado (2012)
CREDITS Thanks to Deirdre Williams… (and to William Shakespeare) for the idea of borrowing from Hamlet Act 5 Scene 2 Not a whit, we defy augury; there's a special providence in the fall of a sparrow. If it be now, ‘tis not to come, if it be not to come, it will be now; if it be not now, yet it will come. The readiness is all. Since no man has aught of what he leaves, what is't to leave betimes? Let be.
ANTECEDENTS • IN 1998-2007 FUNREDES/UNION LATINE AND LOP PRODUCED INDICATORS FOR LANGUAGES IN THE INTERNET . • POST 2007 EVOLUTION OF THE WEB AND SEARCH ENGINES ENDED THE PRODUCTIONAND LEAVE A TERRIBLE VOID. • 2012-2014 : DILINET PROJECT A FAILED ATTEMPT COORDINATED BY MAAYA TO PROVIDE HIGH LEVEL RESPONSE.
ANTECEDENTS THE VOID WAS FILLED BY INTERNETWORLDSTATS (10 languages with the higher number of Internauts) and W3TECHS (contents by language)
BUT the data nicely provided by IWS and W3T is used by many people without a careful look at the BIASES !!!!
and the BIASES need to be understood if you plan to derive serious conclusions from the data…
And, by the way, most of those BIASES ARE NOT NEUTRAL they provoke an overestimation of the place of in the Net.
THE AIM OF THIS PRESENTATION • IS TO WARN YOU ABOUT THOSE SERIOUS BIASES
AND ALSO TO REVEAL ALTERNATIVE APPROACHES WHICH SERVES AT MEASURING THE BIASES • THE IDEA IS CERTAINLY NOT TO DENOTE EXISTING AND USEFUL INITIATIVES • BUT TO CLAIM FOR CAUTION ON THE USE OF THEIR DATA.
LINGUISTIC DIVERSITY INDICATORS PARADOX INTEREST LOP…………… FUNREDES/UL……………………..…. W3TECH…… IDESCAT……. IWS………………… ALIS/ISOC………..OCLC FUNREDES……………….. XEROX…………………….. CAPACITY 1997 98 99 2000 01 02 03 04 05 06 07 08 09 2010 11 12 13 14 15 16 17
ARE BIASES DEEPLY ROOTEDON THE SUBJECT OF LANGUAGES ON THE INTERNET? But why so then???????????????????????????
LANGUAGES THE INTERNET Marketing took over Infiniteness of the web Search engine vs. Ad engine Demo-linguistic data No consensus Fuzzy boundaries Huge domain L1? L2? Li? CHOC OF BIASES
BIASES TAXONOMY • Sources bias • Methodological bias • Statistical bias • Hypothesis bias
SOURCE BIAS: “good practice” example ITU provides the most important data on Internet penetration : the percentage of individuals using the Internet per country. ITU provides with the data a precise definition of the collecting process and the associated assumptions. http://www.itu.int/en/ITU-D/Statistics/Documents/statistics/2016/Individuals_Internet_2000-2015.xls
ITU DATA • Method: country survey • Definition: individuals from 16 to 75 years old having connected to the Internet from any type of device on fixed or mobile network at least once in the last 3 months. • Analysis: 60% of data is given by ITU. 15% given by Eurostat with same criteria. the rest given by country authorities often with different criteria (on age especially).
ITU DATA • Best practice indeed yet bias-free data does not exist. • Careful screening shows that figures do not correspond to the same year (no big deal!) • Some data from countries willing to promote there “digital divide efficient policies” are probably exaggerated… • Careful with country definitions!
STATISTICAL BIAS: extremely influent…and terrible example From 1999 to 2007, the (wrong) steady figure of 80% of webpage's in Englishwas propagated in media from an OCLC study with 2 publications, in 1999 and 2003, using the same methodology than a pioneer study from ALIS Technology, a Canadian company. with ISOC support in 1997. The fact that it was anyway totally flawed did not prevent to make it the truth for medias during 10 years!!!
OCLC WEB CHARACTERIZATION METHODOLOGY: • Random selection of 3000 IP numbers leading to webpages. • Application of language recognition algorithm • Publication of results. • Where was the major flaw?
STATISTICAL BIAS: gross example INKTOMI (a former search engine) published in 2000 its study, with absolutely no methodology revealed… but strong marketing, with a figure of 86.5% of Webpages in English.
W3TECHS: WHERE ARE THE BIASES? METHODOLOGY: On a daily basis W3Techs applies an algorithm of language recognition on the 10 millions of sites classified by ALEXA as the most visited in the world.
W3TECHS: A COHERENCE CONTROL The content productivity indicator P(L)= % content (L) / % internauts (L) Experience has shown that this indicator hardly gets out of the windows 0.5 – 1.5 showing some understandable statistical law between the number of internauts and the amount of contents.
W3TECHS: A COHERENCE CONTROL Data from previous FUNREDES/UNION LATINE studies (2005 & 2008) and data from W3Techs (2017) combined from internauts data derived from ITU Hardly credible…
W3TECHS: A COHERENCE CONTROL Data from W3Techs (2017) combined from country internauts data derived from ITU and transformed into language’s data by simple arithmetic weighting. Too high Too low
W3TECHS: A COHERENCE CONTROL FROM DATA Data from W3Techs (2017) combined from country internauts data derived from ITU and transformed into language’s data by simple arithmetic weighting. Way too low!!! Too low
China + India represent together more than 1 billion persons connected to the Internet in 2016, says close to 1 over 3 Internauts… Would you believe they have less than 7% of the world content???? For some reason W3Tech is incapable to reflect the reality of Asian languages in the Internet… Is that related to Alexa????
AS FOR CHINA ONE REASON HAS TO DO WITH THE FACT THAT ONLY 20% OF CHINESE DOMAINS ARE IN ICANN DNS ROOT!!!! THIS MEANS AN UNDERESTIMATION BY W3TECH IN A FACTOR 5!!!
ALEXA: WHERE IS THE BIAS?AND HOW IT AFFECTS W3Techs ALEXA OFFERS MARKETING DATA TO WEBSITE OWNERS. ALEXA.COM measure traffic to websites thanks to a banner Non transparent about the proceeding of the banner per country or per language. They only claim to have “millions of banners installed”. Wikipedia reported 10 millions in 2005.
ALEXA SUSPECTED BIAS • TELL ME THE BANNER REPARTITION BY COUNTRY I’LL TELL YOU WHERE IS THE BIAS • A PRIORI ONE CAN EXPECT A PRO-OCCIDENTAL BIAS AND PROBABLY ALSO PRO-ENGLISH HOW TO CONTROL IT IN A CONTEXT OF ZERO TRANSPARENCY?
ALEXA BIAS Comparison of Alexa traffic datawith subscribers data for: Facebook, Twitter and Linkedin. The data is transformed from a per country based into a per language based using weighting with language repartition in countries (more later).
ALEXA BIAS The pro-occidental bias appears clearly in the test although with some exceptions which call for further studies…
BEFORE WE CHECK IWSAN EXCLUSIVITY: MY OWN BIAS • INSPIRED FROM THE CURRENT WORK OF MAAYA FOR OIF • FIGURES ON : • INTERNAUTS PER LANGUAGE IN THE INTERNET • CONTENTS PER LANGUAGE IN THE WEB
COMPARING WITH W3Techs RABBIT > W3Techs RABBIT < W3Techs
COMPARING WITH IWS RABBIT > W3Techs RABBIT < W3Techs
COMPARINGIWS/RABBIT We are supposed to rely on the same ITU source. So L1+ L2 shall explain all differences However simulating Rabbit with IWS L1+L2 figures and 100% reasoning (instead of 125%) still show a sub-estimation between 50% and 20% for French, Russian, German and Spanish… Why so? Multilingualism management. The simulation comparison drive a negative figure for the remaining of languages…