1 / 15

We Are More Than Our Features

Craig Evans CAS587 – Culture As Data Project Results 4 December 2012. We Are More Than Our Features. Challenge: Finding the Right Data Set. Wide variety of data types presented Global, national, local Big data, personal data Discussed varying technologies Data mining T ext mining

melina
Download Presentation

We Are More Than Our Features

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Craig Evans CAS587 – Culture As DataProject Results 4 December 2012 We Are More Than Our Features

  2. Challenge:Finding the Right Data Set • Wide variety of data types presented • Global, national, local • Big data, personal data • Discussed varying technologies • Data mining • Text mining • Machine learning • Visualisation • All very abstract …

  3. Motivation:Something Personal/Relatable • Never lose sight of the data • Its not about the technology • Technology is a tool, not an endpoint • Choose data that we can all see something in • So …

  4. Goal • CAS587 is an interdisciplinary class • We have different interests/focus – do they come out through our readings analysis? • Analyse the writings of the CAS587 class, and see if there is any apparent trend in their writing.

  5. Importance … • To the student: • Who else in the class has a similar interest? • Who has expresses skills that are complementary? • Who would you reach out to to build a team later? • To the instructor: • Has the right message been communicated? • Have your goals in educating the class been met? • To the wider population: • This is an example of how data can get used in a way unintended. Would you write differently if you knew the text was going to be used for this purpose? • Would you choose to post anonymously instead?

  6. Data Appropriateness • It is a “raw” data set • No previous preprocessing • It is not what the data was intended for • It is a little “random” in nature – not a traditional structured dataset found in an online repository

  7. CAS587 – The Data Set • Starts as a PDF file • Converted to standard ASCII text file • Manual cleanup of data required • Removal of heading/footer information • Result? • 150 files • 96677 words • 1150150 chars

  8. The Process 1. PDF’s submitted to CAS587 Website Used trial version of publicly available PDF2Text tool 6. Results returned to Excel / Visualisation Tool 2. Results exported toplain text files Excel is easy, but once data processed, I can have some fun with the visualisation mySQL 5. Results returned to database 3. Results imported todatabase • Text parsed to individual words • Text stemmed using WordNet • tf*idf Weightings used to generate keywords per person/article • If time permits – run Sentiment Analysis over corpus 4. Results analyzed in custom Java application

  9. tf*idf … term frequency x inverse doc frequency(From Wikipedia) … a numerical statistic which reflects how important a word is to a document in a collection or corpus. It is used as a weighting factor in information retrieval and text mining. The tf*idfvalue increases proportionally to the number of times a word appears in the document, but is offset by the frequency of the word in the corpus, which helps to control for the fact that some words are generally more common than others. Example Consider a document containing 100 words where the word cow appears 3 times. Following the previously defined formulas, the term frequency (TF) for cow is then (3 / 100) = 0.03. Now, assume we have 10 million documents and cow appears in one thousand of these. Then, the inverse document frequency is calculated as log(10 000 000 / 1 000) = 4. The tf*idf score is the product of these quantities: 0.03 × 4 = 0.12.

  10. The ClassWeek 2-6 • Week 2: What is Culture as Data? • filter,comparative,autism,scholarship,writer,closed,overload,library,net,outward,air, inside,coin,ecology,region • Week 3: Social Media - culture, trends, and data • activism,movie,stock,flu,market,happy,mood,tweet,Trends,predict,happier,weak, Democrats,happiness,television • Week 4: Visualization, the challenges of visualizing culture - the challenges of manipulating large amounts of data • visualization,template,analyst,analytic,seer,visual,computing,dot,cloud,distort, manipulate,viewer,map,lie,trap • Week 5: Books, Music, Images, Movies • music,dementia,rating,alzheimer,movie,taste,playlist,political,novel,Books,musical, affiliation,preference,listen,writing • Week 6: Data as Culture: Curating, Scrubbing, and Sampling • classification,hire,narrative,card,database,replicate,icd,scientific,decline,finding,poetic,viscosity,replication,solution,electronic

  11. The ClassWeek 7-11 • Week 7: Prediction • customer,habit,pregnant,economy,economics,coupon,routine,cue, prediction, purchasing,evaluation,trigger • Week 8: Personal data online.  Conversations and Persistence.  Interpretations of personal data. • Spider,thesis,speaker,oatmeal,report,communicative,annual,persona,public,email, private,eat,analyzeword,mouth,wife • Week 9: History of Big Data Critiques • skull,friction,reductionism,craniology,maturity,downfall,shimmering,positivism, introspectometer,domain,inaccurate,conflict,economics,igy,dominate • Week 10: Life After Privacy • obfuscation,protect,privacy,car,policy,setting,private,default,public,option,breach, anonymize,identifiable,regulation,photo • Week 11: Art as Data; Data as Art • art,wind,transfinite,installation,artistic,cascade,choir,hint,visualization,rose,color, contents,flow,beautiful

  12. Picking on an IndividualCraig Evans – Total Corpus • Keywords from total corpus • cent,visualisation,suspect,secondary,teach, zip,irb,material,illustrate,interestingly, openly,playlist,artwork,profile,century, experience,lose,computationally,reuse • Most negative sentiment … • not,lose,suspect,base,dementia,secondary,paranoid, bias,present,present,disturbing,insufficient,paranoia, difficult,number • Most positive sentiment … • model,interesting,good,well,better,researcher,accurate, aware,time,time,beneficial,enable,teach,illustrate,find, method,read,add,excellent,art

  13. Picking on an IndividualCraig Evans – Week 7 • Week 7: Prediction … Keywords • customer,habit,pregnant,economy,economics,coupon, routine,cue,prediction,purchasing,evaluation,trigger • Keywords against rest of corpus • model,influence,buying,paper,predictive,joke, series,economist,valid,pregnant,resource,woman,link • Most negative sentiment … • bias,difficult,not,base,nefarious,invalid,defunct, savage,hard,blue,miss,number,scale,pregnant • Most positive sentiment … • model,find,color,joke,read,sound,accurate, interesting,valid,valid,privacy,improve,influence, compare,reasoning,group,improvement,absolute

  14. CAS587 Wordle – Just for Karrie

  15. Questions?

More Related