300 likes | 472 Views
Searching peoples' digital footprints A new avenue in sociology and what are the problems with it János Kert é sz Budapest University of Technology and Economics. Outline. Problems with the classical primary data collection in soc. – an example Abundance of data: Digital footprints
E N D
Searching peoples' digitalfootprintsA new avenue in sociology and what are the problems with itJános KertészBudapest University of Technology and Economics
Outline • Problems with the classical primary data collection in soc. • – an example • Abundance of data: Digital footprints • – new era in social sciences? • Examples • Data availability • Ethical issues • Summary
Primary Methods of Data Collection Interviewing People Designing a questionnaire Observing people Content analysis Designing an experiment to carry out Case study Focus group
Primary Methods of Data Collection Interviewing People Designing a questionnaire This method is best for discovering factual information about people … Observing people Content analysis Designing an experiment to carry out Case study Focus group Statistics about primary data collection: Papers over 10 years in American Sociological Review: Interpretative: 17% Survey: 80% Experiment: 3%
An example: The Add Health database „The (US) National Longitudinal Study of Ado-lescent Health (Add Health) is a nationally repre-sentative study that explores the causes of health-related behaviors of adolescents in grades 7 through 12 and their outcomes in young adulthood. Add Health seeks to examine how social contexts (families, friends, peers, schools, neigh-borhoods, and communities) influence adolescents' health and risk behaviors.” Designed by J. R. Udry, P. S. Bearman, and K. M. Harris, started 1994, still going on.; funded by National Institute of Child Health and Human Development (P01-HD31921) Contact:http://www.cpc.unc.edu/addhealth
DATA (cont.) Data based on questionnaires and medical tests ~ 1700 publications (inc. dissertations) We used the data from Wave I (1994-95): 75871 students were asked in 84 high schools 68 questions, including 10 friendship related ones: >> Name 5 best male and 5 best female friends. >> For each friend select from the list those, which apply. During the last 7 days you 1. visited each other 2. met after school 3. spent time together during last weekend 4. talked with him/her about a problem 5. talked with him/her on the phone
Threshold analysis Gonzales, et al 2007 Links are a priori directed, corresponding to the nominations Strength of ties characterized by discrete weights Strong asymmetry may occur: A B but B A 1 5 G/N : order parameter of percolation s2ns : „susceptibility” Black line: w=(w + w )/2 mutuality required Red line:no mutuality required, missing nomination is taken as 0
Other ways of finding data for scientific research: Huge datasets due to IT Official data collections (open or can be made available) Statistical Institutes (e.g. P. Hedeström’s Stockholm data) Fiscal data (income distributions etc.) Medical Data (e.g., Finnish diabetes data, mortality data) … Work related: Commercial data (e.g. point collections, trading data of companies) secret, property of companies Financial data (e.g., stock and other markets, banks) partly open (free or for purchase) … Science related (open): Human Genome Project Chemical Data Banks Archives Bibliographies… These data are produced either for analysis or we assume that they would be used for that purpose
Data generated in our everyday lives A new avenue for social sciences: Digital footprints
This collection of data raises • Legal • Ethical • issues (see later) At the same time it provides a gold mine for research!
Until now, social science has struggled toobtain tools that do more than scratch thesurface of some of its questions. These rangefrom identifying the driving forces behind violence, to the factors influencing how ideas,attitudes and prejudices spread through humanpopulations. The available tools have largelyremained in a time warp, consisting of analysesof national censuses, small-scale surveys,or lone researchers with a notebook observinginteractions within small groups. Being able to automatically and remotelyobtain massive amounts of continuousdata opens up unprecedentedopportunities forsocial scientists to study organizations and entirecommunities or populations. NATURE|Vol 449|11 October 2007
Communications leave detailed information about who with whom, when and where… • phone (mobile and fixed line) • sms,mms • MSN • email • In a broader sense all kinds of activities can be used, which leave electronic records, including • commercial activities (ebay, point collecting cards, credit cards, etc) • open collaborative environments (Wikipedia, gnu, etc) • E-communities (Facebook, MySpace, etc) • E-games (Roleplaying, Where is George, etc)
Enron Email Dataset (free: www.cs.cmu.edi/~enron/) 150 users, (Enron management) 0.5M messages made public (including content!) by Fed. Energy Regulatory Commission The presently available corpus does not include attachments and some messages have been deleted (due to requests of affected employees) Triggered much interesting work, e.g.: Berkeley Enron Email Analysis (testing methods) J. Shetty and J. Adibi: The Enron Email Dataset: Database Schema and Brief Statistical Report Z. Eisler, I Bartos and J.K. : Fluctuation scaling Huberman et al: HP data (publicly not available) Related: Microsoft report MSR-TR-2006-186 (2007): on 30X109MSNmessages
Fluctuation scaling: ~ <f> Eisler et al. 2008
15 min X X 20 min Y Y 5 min Constructing social network from mobilephone data J.-P. Onnela, et al. PNAS 104, 7332-7336 (2007)J.-P. Onnela, et al. New J. Phys. 9, 179 (2007) • Over 7 million private mobile phone subscriptions • Focus: voice calls within the home operator • Data aggregated from a period of 18 weeks • Require reciprocity (XY AND YX) for a link • Customers are anonymous (hash codes) • Data from an European mobile operator
Huge network: proxy for network at societal level • Largest connected component dominates • 3.9M / 4.6M nodes • 6.5M / 7.0M links
Study revealed the structure of the network, the interplay btw weigths and communities, the relations btw local, mesoscopic and global structure (See JP Onnela’s talk) Possible to ask unprecedented questions and even find the answers to them
New data (continuously supplied): • records of each call, sms, mms • information about subscribers age, gender, ZIP code New studies started on data from Belgium (+information about location of the call) France, Hungary (fixed lines) India With some efforts individuals could be identified! No data sharing possible: Confidentiality agreement with the provider. Contracts regulate publication rights like in an industrial R & D project
eBay data I. Yang, E. Oh, B. Kahng: Phy. Rev. E 74, 016121 2006 1) 2) A: collectibles, B: clothing, sport, office C: home decoration, electronics, D: art, hobby E: books, toys, F: valuables (jewelry, stamps, …) Traditional classification scheme (2) can be improved by hierarchical agglomeration algorithm (1)
Where is George? Zip code
(„Where is George”) The scaling laws of human travel D. Brockmann, L. Hufnagel and T. Geisel Nature 439, 462-465 (26 January 2006) doi:10.1038/nature04292
Diapers and beer Standard story in data mining courses: An investigation of 1.2M baskets of consumers of Osco Drug showed that between 5 and 7 pm significantly many bought diapers and beer together (suggesting that bored young fathers were sent to the shop) (It is an urban legend that as a consequence the management let put diapers and beer closer to each other.But they could have…) One should not have illusions about (mis)use of point collector cards, great winning actions etc…
LAMENTS OF A SORROWFUL MANThey've entered me in books of every kind,I'm registered and checked in every way.I'm kept in musty, ink-stained offices, in folders that are growing grizzly-grey.Oh, gnashing of teeth, oh, humiliation,that I am captive till my dying day,that they dispose of me from top to toe, that I am just a record, filed away.I'd much prefer to live in the Saharaor rot beneath a mound of heavy clay,for I am kept in books of every kind,and registered and checked in every way. D. Kosztolányi, 1924
Ethical issues • Google has all tools to be Big Brother. It has control over your • Clicks (interest, taste, purchases, pictures…) • Mail • Travel plans • etc. • These data would be of much interest for research but they contain too much information. Google definitely uses them, e.g. for targeted advertising. „When web provider AOL’s research divisionpublished an analysis of search behaviour onthe Internet last year, it had what it thoughtwas a bright idea: it would reach out to academicsby making an anonymized version ofthe data freely available for download from itswebsite. But within hours, it had to pull thesite, after bloggers managed to infer manyidentities from the data and view the associated search histories.” NATURE|Vol 449|11 October 2007
Two problems related to „computational social science”: • i) Privacy issues • Data are not produced for scientific evaluation, in contrast to questionnaires, where the target person can decide about delivering data or cases where data handling is expected. Moreover, in the latter case the utilization of the data is strongly regulated by law and by organizations (e.g. Consortium for Political and Social Research). • ii) Controllability and reproducability of research • Since data are not public (sometimes even the actual source must not be named) the general criterion of controllability of scientific research is violated. As seen on the AOL example, this is related to i), or to commercial interests. A good counterexample is Enron Email Database, which can serve as a benchmark for related studies.
Measures? So far no real scandal… caused by scientific use of data. Institutional framework needed? Putting People on the Map: Protecting Confidentiality with Linked Social-Spatial Data www.nap.edu/catalog/11865. html (Natl Acad. Sci., Washington DC, 2007)concluded: “Institutional solutions involve establishing tiersof risk and access, and developing data-sharingprotocols that match the level of access to therisks and benefits of the planned research.” However, “Businesses seem moreprone to misuse private data than scientists ofany stripe.” (Marshall Van Alstyne, BU) But „trust isof crucial importance to the contract between scientific expertiseand the broader society that supports it” NATURE editorial, 2007 October
Summary: Fantastic new possibilities for computational social science Multidisciplinary efforts needed More open, shared data needed. Benchmarking. Experiments??? Artificial data? Ethical and legal issues: Privacy, commercial interest and scientific reproducibility Institutionialization? Surveys cannot be substituted!