450 likes | 647 Views
Strategy for systematic anonymisation of multi-lingual interaction corpora. C. Reffay 1 , F.-M. Blondel 1 , E. Giguet 2 1 STEF – ENS-Cachan / IFÉ – ENS-Lyon 2 GREYC, Université Caen Basse-Normandie, CNRS. Outline. Introduction Anonymisation process Marking process Finding new forms
E N D
Strategy for systematic anonymisation of multi-lingual interaction corpora. C. Reffay1, F.-M. Blondel1, E. Giguet2 1 STEF – ENS-Cachan / IFÉ – ENS-Lyon 2 GREYC, Université Caen Basse-Normandie, CNRS
Outline • Introduction • Anonymisation process • Marking process • Finding new forms • Replacement process • Testing the process on a Galanet session • What did we learn? What works? • Next step… IC'2012 - C Reffay, F-M Blondel, E Giguet
The corpus • Galanet Session 2011-2012: “Nômades...nomadi...nómades... des langues” (Resp.: SandrineD) • 4 teams : Italy, Brazil, France & Spain • During 3.5 months, • 103 teenagers, 83 authors wrote… 915 Messages containing (message body) • Volume: 47 740 forms, 217 477 characters • Lexicon: 9 655 distinct forms IC'2012 - C Reffay, F-M Blondel, E Giguet
Need a software to support The objective is to share! Personal data are not sharable But anonymisation is a hard work (by hand) • The corpus may be enormous • Subtleties: homonyms & synonyms Anonymisation… the solution? IC'2012 - C Reffay, F-M Blondel, E Giguet
Anonymisation purpose • Hidepersonal information systematically • Names (first names, last names, usernames…) • Identifiers (Passport, National Student Number, …) • Locations (city, street, address, coordinates) • Institution/Workplace (school, sport club, firm, …) • Contact references (e-mail, mobile, MSN, skype, twitter, telephone/fax) • Explicit references (URL of homepages, blogs) • Social media usernames (facebook, MySpace, Hi5, Soundcloud, Badoo, Bebo, Friendster, Netlog, …) • Maintaining text coherence and consistency IC'2012 - C Reffay, F-M Blondel, E Giguet
Personal data: examples • {(f331s2970m2)2011-11-30T19:24 Gabibr Re: Quelques informations ... answers SandrineD (f331s2970m1)} “Eu amo a língua Francesa! Quem sabe falar francês me adiconem no meu FACEBOOK;) J'aime parler français! Qui peut parler français? M'ajouter dans FACEBOOK;) Nom: GABRIELA MEDEIROS.” • {(f333s3016m2)2011-12-27T09:25 Miche Re: Les stéréotypes culinaires answers SandrineD (f333s3016m1)} “inviate i vostri documenti alla mia mail mikinessi@yahoo.it grazie!!!;)” • {(f330s2914m8)2011-10-22T19:52 PBS Re: Por que me chamo assim?! answers Eloandrade (f330s2914m1)} “Yo me llamo Peimikà Bibiana. Como mi madre es tailandesa y mi padre es italiano, mi primer nombre, Peimikà, es tailandés y significa " dueña del amor ", mientras mi según nombre, Bibiana, es italiano y procede del etrusco " vibius " que significa " vida ". Me gusta mucho tener dos nombres (en Italia es más usual tener un nombre) y sobre todo estoy orgullosa de los orígenes diferentes que tienen y que hacen mi nombre aún más particular (además Peimikà no es muy difundido en tampoco en Tailandia y tampoco Bibiana en Italia” IC'2012 - C Reffay, F-M Blondel, E Giguet
Just google it! IC'2012 - C Reffay, F-M Blondel, E Giguet
Peimikà Bibiana… google search (2) IC'2012 - C Reffay, F-M Blondel, E Giguet
Anonymisation Principles Once anonymised, no participant may be identifiable by an external person • All identified lexical forms must be (computationally) marked even if not modified by a replacement form. • Any reference (e.g.: name, institution or location) may be imprecise enough to encompass several hundreds people. Mark Replacedby Original lexical form Replacement form IC'2012 - C Reffay, F-M Blondel, E Giguet
Before After Anonymisation • Before:{(f330s2880m3)2011-10-17T08:22 KellyM Re: Qui sommes- nous? answers CarlaN (f330s2880m1)} Bonjour, je m'appelleKellly. J'ai 16 ans, je suis une élève en 1ère S dans le lycéeRosa Luxemburg à Canet,non loin dePerpignan… • After:{(f330s2880m3)2011-10-17T08:22 FLG01 Re: Qui sommes- nous? answers ILG02 (f330s2880m1)} Bonjour, je m'appelleKittty*. J'ai 16 ans, je suis une élève en 1ère S dans le lycéeMargherita Duras* à Aigues-Vives*,non loin dePerpignan… IC'2012 - C Reffay, F-M Blondel, E Giguet
Let’s find the regularities Interactively with the expert: the researcher Hypotheses • A fully automated method does not exist for all corpora • Some decisions have to be taken by the researcher, not by the software • Accuracy of the method will be achieved only for a given context (ex: Galanet) • “Named entities” do not occur randomly IC'2012 - C Reffay, F-M Blondel, E Giguet
Concepts manipulated Real world Reference Corpus Named entities Lexical forms Existing objects Name, Surname, Username, First name, Last name, Addresses,Tel. number, MSN… Institution, Participant, Public person, Relative, Street, City… Pedro, KellyM, Eli, Elô, Kelly, Bergamo, Canet, Rosa Luxembourg, 0609785643, IC'2012 - C Reffay, F-M Blondel, E Giguet
Anonymisation process Named entitiestransformation table Initial list of participants,usernames,institution… Process/RulesDiscovering new forms MarkingProcess Corpus with markedEntities Corpus toanonymise AnonymisedCorpus ReplacementProcess IC'2012 - C Reffay, F-M Blondel, E Giguet
≠ = Synonyms: the same entity has different forms Homonyms: the same form refers to different entities Transformation table: example IC'2012 - C Reffay, F-M Blondel, E Giguet
Marking one form: Example (Kelly) A- List of all occurrences (with their context) with a concordancer IC'2012 - C Reffay, F-M Blondel, E Giguet
Marking one form: Example (Kelly) B- Update the transformation table (ex: Public person Gene Kelly) + IC'2012 - C Reffay, F-M Blondel, E Giguet
Marking one form: Example (Kelly) C- Associate each occurrence to the appropriate entity (=>In the corpus: Surround the occurrence by XML tags) Last name, Normal form, unchangedrefers to the public person Gene Kelly First name, Normal form, to be changedrefers to the participant KellyM IC'2012 - C Reffay, F-M Blondel, E Giguet
Detecting new forms: 2 strategies • Lexical rules: similar forms • Eli -> Elô Ely ELY Seli • Gabriela -> GABRIELA • José -> Jose • Context rules: Similar context • First names: “mi chiamo …”, “accord avec …” • Cities: “Soy de …”, “vivo en …”, “j’habite à …” IC'2012 - C Reffay, F-M Blondel, E Giguet
Adriana Alèxia Anthony Baptiste Cleissa Eli… Elouise Emmanuel Federica Ferran Gabriela Guillem Iñigo Jaqueline Jean José Kelly Léo Mariana Mary Michela Monica Olalla Oleguer 1st Strategy: Lexical variation rules 103Knownforms adriana Alexia Antonhy baptiste Cleisa Elô Ely ELY Seli Louise MAnuel Federiac fran Fran GABRIELA guillem iñigo Jacqueline jean Jose Kellly Leo léo MariAna mary May Miche michelina moni olalla oleguer 31 Newforms IC'2012 - C Reffay, F-M Blondel, E Giguet
2nd Strategy : Context rules 103 Known first names (Adrià, …, Veronica) 145 contexts: Left/Right Total: more than 250 tested rules 47 rules approved 15 good new forms Antonhy BelleBetChristineFedeFederiac Kellly Leo LineMaria May PeimikàRegina fran jean léo IC'2012 - C Reffay, F-M Blondel, E Giguet
Replacing process • Before:{(f330s2880m3)2011-10-17T08:22 KellyM Re: Qui sommes- nous? answers CarlaN (f330s2880m1)} Bonjour, je m'appelleKellly. J'ai 16 ans, je suis une élève en 1ère S dans le lycéeRosa Luxemburg à Canet,non loin dePerpignan… • After:{(f330s2880m3)2011-10-17T08:22 FLG01 Re: Qui sommes- nous? answers ILG02 (f330s2880m1)} Bonjour, je m'appelleKittty*. J'ai 16 ans, je suis une élève en 1ère S dans le lycéeMargherita Duras* à Aigues-Vives*,non loin dePerpignan… IC'2012 - C Reffay, F-M Blondel, E Giguet
Conclusion • A new process/algorithm for anonymisation • Confront hypotheses to a first corpus • 47 rules approved for first names => 15 new forms • 103 first names => 31 existing derivations • Anonymisation not 100% auto => confirmed • Anonymisation possible? in a world with Google • Use Google to evaluate the frequency of a first name! IC'2012 - C Reffay, F-M Blondel, E Giguet
Next steps… • Finalize concrete anonymisation of this corpus • Discuss some choices with SandrineD for: • Usernames, cities, email addresses,… • Get feedback from SandrineD • Verify on a bigger (Galanet) corpus: • The process • The rules • Co-develop the tool : • within the research community… • in the (ANR) CORDIAL project? IC'2012 - C Reffay, F-M Blondel, E Giguet
New forms discovering: 2 strategies 103 Known first names (Adrià, …, Veronica) LexicalRules ContextRules 317 candidates 145 contexts: Left/Right Left: One form: 75 => 13780 occ. Left: 2 forms seq.: 123 => 1700 occ. Total: more than 250 tested rules 47 rules approved 15 good new forms IC'2012 - C Reffay, F-M Blondel, E Giguet
Contexts of 145 occ. of 103 first names(using TXM, case insensitive) IC'2012 - C Reffay, F-M Blondel, E Giguet
The corpus lexicon • A list of (lexical forms ► Frequence) • de ►1015 • que ► 965 • la ► 673 • … • porque ► 48 • … • Addams ► 1 9655 unique forms IC'2012 - C Reffay, F-M Blondel, E Giguet
For more information, see the European Commission Directive (95/46/EC) Who is concerned? « Les applications informatiques à des fins pédagogiques et éducatives mobilisent des données permettant d’identifierdirectement mais aussi indirectement les personnes physiques. Une attention particulière doit être portée sur la collecte de données sensibles ainsi que sur les procédés d’anonymisation des données. » (Mallet-Poujol 2004: p 21) IC'2012 - C Reffay, F-M Blondel, E Giguet
Legal context (95/46/EC) • (Art7) Member States shall provide that personal data may be processed only if: the data subject has unambiguously given his consent;… • (Art8) Member States shall prohibit the processing of personal data revealing sensitive information (racial or ethnic origin, political opinions, religious or philosophical beliefs, trade-union membership, and the processing of data concerning health or sex life) • (Art8) […] Inform the data subject on: • The identity of the controller of the data collection, • The purposes of the processing • The recipients or categories of recipients of the data, • The existence of the right of access to and the right to rectify the data concerning him IC'2012 - C Reffay, F-M Blondel, E Giguet
Text coherence and consistency • {(f330s2914m11)2011-10-20T16:43 M_Cavalcanti Re: Por que me chamo assim?! Answers Eloandrade (f330s2914m1)} “aaah, o meu é uma homenagem a uma de minhas tias e minha avó que se chamam Ana e ao resto de minhas tias que se chamam Maria. Daí, Mariana:)” • {(f330s2914m10)-2011-10-20T21:06 Eloandrade Re: Por que me chamo assim?! answers Eloandrade (f330s2914m1)} “Gostei da criatividade da sua mãe MariAna! Rsrsrs” • {(f330s2914m3)2011-10-28T00:54 LineCosta Re: Por que me chamo assim?! answers Eloandrade (f330s2914m1)} “Ah meu nome é em homenagem a Jacqueline Kennedy, esposa do ex- presidente dos EUA, e também porque sempre foi um dos nomes preferidos do meu pai.: D” • {(f330s2914m18)2011-10-19T20:36 Eloandrade Re: Por que me chamo assim?! answers Eloandrade (f330s2914m1)} “Bem, minha mãe queria que meu nome começasse com a letra E (como o dela!), um certo dia ela viu o nome de uma atriz brasileira chamada Louise Cardoso. Gostou do " Louise ", mas queria com a letra E, então ficou " Elouise "! Só depois, quando eu cresci é que descobri que meu nome era de origem francesa.. Hahaha” IC'2012 - C Reffay, F-M Blondel, E Giguet
TXM: http://textometrie.ens-lyon.fr/ IC'2012 - C Reffay, F-M Blondel, E Giguet
Named entities A named entity is a lexical form identifying a precise object (first/last name, communication ref., city, institution, etc.) Examples: Names: Christophe, Blondel, Giguet, Paris, Communication ref.: 0678600614, … Location: Grenoble, Paris, Parigi, … Institution: ENS Cachan, CNRS, … IC'2012 - C Reffay, F-M Blondel, E Giguet
Managing named entities • Homonyms refer to different objects • In the corpus we have 2 participants named “Guillem”:The same first name refers to different persons. • In “Gene Kelly”, Kelly = public person last name • in “Galdric, Kelly et Antonhy”, it’s a participant first name • Different synonyms refer to the same object • Kellly & Kelly, • Anthony & Antonhy, • Elô & Elouise IC'2012 - C Reffay, F-M Blondel, E Giguet
Referring to global entities IC'2012 - C Reffay, F-M Blondel, E Giguet
Overall method and tools • Define a process/algorithm for anonymisation • Confront hypotheses to a first corpus • Using existing tools (Excel, TXM/Calico, Notepad++) • Doing many work by hand (having automation in mind) • Facing/solving/avoiding problems • Evaluating/Suggesting (new) hypotheses • Discuss the result with the original researcher • Verify on a second (bigger corpus) • Co-develop the tool within the research community IC'2012 - C Reffay, F-M Blondel, E Giguet
Find Nei/nei with a concordancer All occurrences refer to the Italian common word “nei” IC'2012 - C Reffay, F-M Blondel, E Giguet
Another example • {(f330s2914m5)2011-10-23T21:52 CR_Martins Re: Por que me chamo assim?! answers Eloandrade (f330s2914m1)} “Meu nome é Cleissa Regina, Cleissa porque minha mãe viu na tv uma repórter chamada Cleisa e achou parecido com o nome dela, Cléia e Regina porque o nome do meu pai é Reginaldo. Assim como a PBS gosto muito de ter 2 nomes e Cleissa é bem raro, nunca conheci ninguém chamado assim.” IC'2012 - C Reffay, F-M Blondel, E Giguet
No! Let’s try Cleissa Regina… Peimikà Bibiana… a unique case? IC'2012 - C Reffay, F-M Blondel, E Giguet
How to detect new forms? • Lexical rules (look for similar forms): • Ignoring accents (ex: José, Jose) • Ignoring case (ex: José, jose, JOSÉ, …) • Levenstein distance between 2 forms: number of extra/missing/inversion of characters • For graphy size <5 : Dist<=1 • For graphy size >=5 : Dist<=2 • Context rules: (ex: “mi chiamo …”, “merci …”) IC'2012 - C Reffay, F-M Blondel, E Giguet
Lexical variations 1/2 IC'2012 - C Reffay, F-M Blondel, E Giguet
Lexical variations 2/2 IC'2012 - C Reffay, F-M Blondel, E Giguet
Some good context rules (1/3) IC'2012 - C Reffay, F-M Blondel, E Giguet
Some good context rules (2/3) IC'2012 - C Reffay, F-M Blondel, E Giguet
Generic context rules IC'2012 - C Reffay, F-M Blondel, E Giguet