1 / 45

Strategy for systematic anonymisation of multi-lingual interaction corpora.

Strategy for systematic anonymisation of multi-lingual interaction corpora. C. Reffay 1 , F.-M. Blondel 1 , E. Giguet 2 1 STEF – ENS-Cachan / IFÉ – ENS-Lyon 2 GREYC, Université Caen Basse-Normandie, CNRS. Outline. Introduction Anonymisation process Marking process Finding new forms

snow
Download Presentation

Strategy for systematic anonymisation of multi-lingual interaction corpora.

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Strategy for systematic anonymisation of multi-lingual interaction corpora. C. Reffay1, F.-M. Blondel1, E. Giguet2 1 STEF – ENS-Cachan / IFÉ – ENS-Lyon 2 GREYC, Université Caen Basse-Normandie, CNRS

  2. Outline • Introduction • Anonymisation process • Marking process • Finding new forms • Replacement process • Testing the process on a Galanet session • What did we learn? What works? • Next step… IC'2012 - C Reffay, F-M Blondel, E Giguet

  3. The corpus • Galanet Session 2011-2012: “Nômades...nomadi...nómades... des langues” (Resp.: SandrineD) • 4 teams : Italy, Brazil, France & Spain • During 3.5 months, • 103 teenagers, 83 authors wrote… 915 Messages containing (message body) • Volume: 47 740 forms, 217 477 characters • Lexicon: 9 655 distinct forms IC'2012 - C Reffay, F-M Blondel, E Giguet

  4. Need a software to support The objective is to share! Personal data are not sharable But anonymisation is a hard work (by hand) • The corpus may be enormous • Subtleties: homonyms & synonyms Anonymisation… the solution? IC'2012 - C Reffay, F-M Blondel, E Giguet

  5. Anonymisation purpose • Hidepersonal information systematically • Names (first names, last names, usernames…) • Identifiers (Passport, National Student Number, …) • Locations (city, street, address, coordinates) • Institution/Workplace (school, sport club, firm, …) • Contact references (e-mail, mobile, MSN, skype, twitter, telephone/fax) • Explicit references (URL of homepages, blogs) • Social media usernames (facebook, MySpace, Hi5, Soundcloud, Badoo, Bebo, Friendster, Netlog, …) • Maintaining text coherence and consistency IC'2012 - C Reffay, F-M Blondel, E Giguet

  6. Personal data: examples • {(f331s2970m2)2011-11-30T19:24 Gabibr Re: Quelques informations ... answers SandrineD (f331s2970m1)} “Eu amo a língua Francesa! Quem sabe falar francês me adiconem no meu FACEBOOK;) J'aime parler français! Qui peut parler français? M'ajouter dans FACEBOOK;) Nom: GABRIELA MEDEIROS.” • {(f333s3016m2)2011-12-27T09:25 Miche Re: Les stéréotypes culinaires answers SandrineD (f333s3016m1)} “inviate i vostri documenti alla mia mail mikinessi@yahoo.it grazie!!!;)” • {(f330s2914m8)2011-10-22T19:52 PBS Re: Por que me chamo assim?! answers Eloandrade (f330s2914m1)} “Yo me llamo Peimikà Bibiana. Como mi madre es tailandesa y mi padre es italiano, mi primer nombre, Peimikà, es tailandés y significa " dueña del amor ", mientras mi según nombre, Bibiana, es italiano y procede del etrusco " vibius " que significa " vida ". Me gusta mucho tener dos nombres (en Italia es más usual tener un nombre) y sobre todo estoy orgullosa de los orígenes diferentes que tienen y que hacen mi nombre aún más particular (además Peimikà no es muy difundido en tampoco en Tailandia y tampoco Bibiana en Italia” IC'2012 - C Reffay, F-M Blondel, E Giguet

  7. Just google it! IC'2012 - C Reffay, F-M Blondel, E Giguet

  8. Peimikà Bibiana… google search (2) IC'2012 - C Reffay, F-M Blondel, E Giguet

  9. Anonymisation Principles Once anonymised, no participant may be identifiable by an external person • All identified lexical forms must be (computationally) marked even if not modified by a replacement form. • Any reference (e.g.: name, institution or location) may be imprecise enough to encompass several hundreds people. Mark Replacedby Original lexical form Replacement form IC'2012 - C Reffay, F-M Blondel, E Giguet

  10. Before After Anonymisation • Before:{(f330s2880m3)2011-10-17T08:22 KellyM Re: Qui sommes- nous? answers CarlaN (f330s2880m1)} Bonjour, je m'appelleKellly. J'ai 16 ans, je suis une élève en 1ère S dans le lycéeRosa Luxemburg à Canet,non loin dePerpignan… • After:{(f330s2880m3)2011-10-17T08:22 FLG01 Re: Qui sommes- nous? answers ILG02 (f330s2880m1)} Bonjour, je m'appelleKittty*. J'ai 16 ans, je suis une élève en 1ère S dans le lycéeMargherita Duras* à Aigues-Vives*,non loin dePerpignan… IC'2012 - C Reffay, F-M Blondel, E Giguet

  11. Let’s find the regularities Interactively with the expert: the researcher Hypotheses • A fully automated method does not exist for all corpora • Some decisions have to be taken by the researcher, not by the software • Accuracy of the method will be achieved only for a given context (ex: Galanet) • “Named entities” do not occur randomly IC'2012 - C Reffay, F-M Blondel, E Giguet

  12. Concepts manipulated Real world Reference Corpus Named entities Lexical forms Existing objects Name, Surname, Username, First name, Last name, Addresses,Tel. number, MSN… Institution, Participant, Public person, Relative, Street, City… Pedro, KellyM, Eli, Elô, Kelly, Bergamo, Canet, Rosa Luxembourg, 0609785643, IC'2012 - C Reffay, F-M Blondel, E Giguet

  13. Anonymisation process Named entitiestransformation table Initial list of participants,usernames,institution… Process/RulesDiscovering new forms MarkingProcess Corpus with markedEntities Corpus toanonymise AnonymisedCorpus ReplacementProcess IC'2012 - C Reffay, F-M Blondel, E Giguet

  14. = Synonyms: the same entity has different forms Homonyms: the same form refers to different entities Transformation table: example IC'2012 - C Reffay, F-M Blondel, E Giguet

  15. Marking one form: Example (Kelly) A- List of all occurrences (with their context) with a concordancer IC'2012 - C Reffay, F-M Blondel, E Giguet

  16. Marking one form: Example (Kelly) B- Update the transformation table (ex: Public person Gene Kelly) + IC'2012 - C Reffay, F-M Blondel, E Giguet

  17. Marking one form: Example (Kelly) C- Associate each occurrence to the appropriate entity (=>In the corpus: Surround the occurrence by XML tags) Last name, Normal form, unchangedrefers to the public person Gene Kelly First name, Normal form, to be changedrefers to the participant KellyM IC'2012 - C Reffay, F-M Blondel, E Giguet

  18. Detecting new forms: 2 strategies • Lexical rules: similar forms • Eli -> Elô Ely ELY Seli • Gabriela -> GABRIELA • José -> Jose • Context rules: Similar context • First names: “mi chiamo …”, “accord avec …” • Cities: “Soy de …”, “vivo en …”, “j’habite à …” IC'2012 - C Reffay, F-M Blondel, E Giguet

  19. Adriana Alèxia Anthony Baptiste Cleissa Eli… Elouise Emmanuel Federica Ferran Gabriela Guillem Iñigo Jaqueline Jean José Kelly Léo Mariana Mary Michela Monica Olalla Oleguer 1st Strategy: Lexical variation rules 103Knownforms adriana Alexia Antonhy baptiste Cleisa Elô Ely ELY Seli Louise MAnuel Federiac fran Fran GABRIELA guillem iñigo Jacqueline jean Jose Kellly Leo léo MariAna mary May Miche michelina moni olalla oleguer 31 Newforms IC'2012 - C Reffay, F-M Blondel, E Giguet

  20. 2nd Strategy : Context rules 103 Known first names (Adrià, …, Veronica) 145 contexts: Left/Right Total: more than 250 tested rules 47 rules approved 15 good new forms Antonhy BelleBetChristineFedeFederiac Kellly Leo LineMaria May PeimikàRegina fran jean léo IC'2012 - C Reffay, F-M Blondel, E Giguet

  21. Replacing process • Before:{(f330s2880m3)2011-10-17T08:22 KellyM Re: Qui sommes- nous? answers CarlaN (f330s2880m1)} Bonjour, je m'appelleKellly. J'ai 16 ans, je suis une élève en 1ère S dans le lycéeRosa Luxemburg à Canet,non loin dePerpignan… • After:{(f330s2880m3)2011-10-17T08:22 FLG01 Re: Qui sommes- nous? answers ILG02 (f330s2880m1)} Bonjour, je m'appelleKittty*. J'ai 16 ans, je suis une élève en 1ère S dans le lycéeMargherita Duras* à Aigues-Vives*,non loin dePerpignan… IC'2012 - C Reffay, F-M Blondel, E Giguet

  22. Conclusion • A new process/algorithm for anonymisation • Confront hypotheses to a first corpus • 47 rules approved for first names => 15 new forms • 103 first names => 31 existing derivations • Anonymisation not 100% auto => confirmed • Anonymisation possible? in a world with Google • Use Google to evaluate the frequency of a first name! IC'2012 - C Reffay, F-M Blondel, E Giguet

  23. Next steps… • Finalize concrete anonymisation of this corpus • Discuss some choices with SandrineD for: • Usernames, cities, email addresses,… • Get feedback from SandrineD • Verify on a bigger (Galanet) corpus: • The process • The rules • Co-develop the tool : • within the research community… • in the (ANR) CORDIAL project? IC'2012 - C Reffay, F-M Blondel, E Giguet

  24. Grazie !

  25. More precisely

  26. New forms discovering: 2 strategies 103 Known first names (Adrià, …, Veronica) LexicalRules ContextRules 317 candidates 145 contexts: Left/Right Left: One form: 75 => 13780 occ. Left: 2 forms seq.: 123 => 1700 occ. Total: more than 250 tested rules 47 rules approved 15 good new forms IC'2012 - C Reffay, F-M Blondel, E Giguet

  27. Contexts of 145 occ. of 103 first names(using TXM, case insensitive) IC'2012 - C Reffay, F-M Blondel, E Giguet

  28. The corpus lexicon • A list of (lexical forms ► Frequence) • de ►1015 • que ► 965 • la ► 673 • … • porque ► 48 • … • Addams ► 1 9655 unique forms IC'2012 - C Reffay, F-M Blondel, E Giguet

  29. For more information, see the European Commission Directive (95/46/EC) Who is concerned? « Les applications informatiques à des fins pédagogiques et éducatives mobilisent des données permettant d’identifierdirectement mais aussi indirectement les personnes physiques. Une attention particulière doit être portée sur la collecte de données sensibles ainsi que sur les procédés d’anonymisation des données. » (Mallet-Poujol 2004: p 21) IC'2012 - C Reffay, F-M Blondel, E Giguet

  30. Legal context (95/46/EC) • (Art7) Member States shall provide that personal data may be processed only if: the data subject has unambiguously given his consent;… • (Art8) Member States shall prohibit the processing of personal data revealing sensitive information (racial or ethnic origin, political opinions, religious or philosophical beliefs, trade-union membership, and the processing of data concerning health or sex life) • (Art8) […] Inform the data subject on: • The identity of the controller of the data collection, • The purposes of the processing • The recipients or categories of recipients of the data, • The existence of the right of access to and the right to rectify the data concerning him IC'2012 - C Reffay, F-M Blondel, E Giguet

  31. Text coherence and consistency • {(f330s2914m11)2011-10-20T16:43 M_Cavalcanti Re: Por que me chamo assim?! Answers Eloandrade (f330s2914m1)} “aaah, o meu é uma homenagem a uma de minhas tias e minha avó que se chamam Ana e ao resto de minhas tias que se chamam Maria. Daí, Mariana:)” • {(f330s2914m10)-2011-10-20T21:06 Eloandrade Re: Por que me chamo assim?! answers Eloandrade (f330s2914m1)} “Gostei da criatividade da sua mãe MariAna! Rsrsrs” • {(f330s2914m3)2011-10-28T00:54 LineCosta Re: Por que me chamo assim?! answers Eloandrade (f330s2914m1)} “Ah meu nome é em homenagem a Jacqueline Kennedy, esposa do ex- presidente dos EUA, e também porque sempre foi um dos nomes preferidos do meu pai.: D” • {(f330s2914m18)2011-10-19T20:36 Eloandrade Re: Por que me chamo assim?! answers Eloandrade (f330s2914m1)} “Bem, minha mãe queria que meu nome começasse com a letra E (como o dela!), um certo dia ela viu o nome de uma atriz brasileira chamada Louise Cardoso. Gostou do " Louise ", mas queria com a letra E, então ficou " Elouise "! Só depois, quando eu cresci é que descobri que meu nome era de origem francesa.. Hahaha” IC'2012 - C Reffay, F-M Blondel, E Giguet

  32. TXM: http://textometrie.ens-lyon.fr/ IC'2012 - C Reffay, F-M Blondel, E Giguet

  33. Named entities A named entity is a lexical form identifying a precise object (first/last name, communication ref., city, institution, etc.) Examples: Names: Christophe, Blondel, Giguet, Paris, Communication ref.: 0678600614, … Location: Grenoble, Paris, Parigi, … Institution: ENS Cachan, CNRS, … IC'2012 - C Reffay, F-M Blondel, E Giguet

  34. Managing named entities • Homonyms refer to different objects • In the corpus we have 2 participants named “Guillem”:The same first name refers to different persons. • In “Gene Kelly”, Kelly = public person last name • in “Galdric, Kelly et Antonhy”, it’s a participant first name • Different synonyms refer to the same object • Kellly & Kelly, • Anthony & Antonhy, • Elô & Elouise IC'2012 - C Reffay, F-M Blondel, E Giguet

  35. Referring to global entities IC'2012 - C Reffay, F-M Blondel, E Giguet

  36. Overall method and tools • Define a process/algorithm for anonymisation • Confront hypotheses to a first corpus • Using existing tools (Excel, TXM/Calico, Notepad++) • Doing many work by hand (having automation in mind) • Facing/solving/avoiding problems • Evaluating/Suggesting (new) hypotheses • Discuss the result with the original researcher • Verify on a second (bigger corpus) • Co-develop the tool within the research community IC'2012 - C Reffay, F-M Blondel, E Giguet

  37. Find Nei/nei with a concordancer All occurrences refer to the Italian common word “nei” IC'2012 - C Reffay, F-M Blondel, E Giguet

  38. Another example • {(f330s2914m5)2011-10-23T21:52 CR_Martins Re: Por que me chamo assim?! answers Eloandrade (f330s2914m1)} “Meu nome é Cleissa Regina, Cleissa porque minha mãe viu na tv uma repórter chamada Cleisa e achou parecido com o nome dela, Cléia e Regina porque o nome do meu pai é Reginaldo. Assim como a PBS gosto muito de ter 2 nomes e Cleissa é bem raro, nunca conheci ninguém chamado assim.” IC'2012 - C Reffay, F-M Blondel, E Giguet

  39. No! Let’s try Cleissa Regina… Peimikà Bibiana… a unique case? IC'2012 - C Reffay, F-M Blondel, E Giguet

  40. How to detect new forms? • Lexical rules (look for similar forms): • Ignoring accents (ex: José, Jose) • Ignoring case (ex: José, jose, JOSÉ, …) • Levenstein distance between 2 forms: number of extra/missing/inversion of characters • For graphy size <5 : Dist<=1 • For graphy size >=5 : Dist<=2 • Context rules: (ex: “mi chiamo …”, “merci …”) IC'2012 - C Reffay, F-M Blondel, E Giguet

  41. Lexical variations 1/2 IC'2012 - C Reffay, F-M Blondel, E Giguet

  42. Lexical variations 2/2 IC'2012 - C Reffay, F-M Blondel, E Giguet

  43. Some good context rules (1/3) IC'2012 - C Reffay, F-M Blondel, E Giguet

  44. Some good context rules (2/3) IC'2012 - C Reffay, F-M Blondel, E Giguet

  45. Generic context rules IC'2012 - C Reffay, F-M Blondel, E Giguet

More Related