1 / 22

Flickr Tag Analysis

Flickr Tag Analysis. Ahmet Iscen. Outline. Social Media What is Flickr? Flickr Photos Association Rule Latent Semantic Analysis Latent Dirichlet Allocation Conclusions. Social Media. Important part of our daily lives today Twitter 12th largest country in the world

Download Presentation

Flickr Tag Analysis

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Flickr Tag Analysis Ahmet Iscen

  2. Outline • Social Media • What is Flickr? • Flickr Photos • Association Rule • Latent Semantic Analysis • Latent Dirichlet Allocation • Conclusions

  3. Social Media • Important part of our daily lives today • Twitter 12th largest country in the world • Two new members sign up every second to LinkedIn

  4. What is Flickr? • Image and video hosting • Acquired by Yahoo! in 2005 • 51 million registered members and 80 million unique visitors as of June 2011 • 6 million photos • Widely used by researchers

  5. Flickr

  6. Dataset • Xirong Li's Flickr-3.5M Dataset • 3,500,000 images • 570,000 unique tags • 270,000 unique user-ids • Randomly selected 250,000 images with their tags http://staff.science.uva.nl/~xirong/index.php?n=DataSet.Flickr3m

  7. Challenges • Tags totally depend on the user • Can be extremely noisy • Huge range of possible words • Examples: milos tasic milosevrodjendan verjaardagmilos desember 2005 tmo

  8. Preprocessing • Eliminate stopwords (a,for,the etc.) • Eliminate extreme words (those that appear less than 20 photos and more than 80% of the photos. • Porter Stemmer (only for association rule) • Convert everything to lowercase • Eliminate tags with less than 2 letters and more than 20 letters • Eliminate numerical tags

  9. Association Rules Mining • Rapid Miner [york] --> [new] (confidence: 0.910) Support: 0.04 [geolat, geolon] --> [geotag] (confidence: 0.986) Support: 0.03 [hors, lotharlez] --> [caballo, cheval, hestur] (confidence: 0.846) Support: 0.03 [paard] --> [hors, lotharlenz, zirg] (confidence: 0.802) Support: 0.03 [hors, paard] --> [lotharlenz, zirg] (confidence: 0.802) Support: 0.03

  10. Association Rules Mining • Poor results. • Probably due to noise and variance in data. • Takes too much time to process the words and find rules. • Need find alternative methods

  11. Latent Semantic Analysis • Same as LSI (LSI used in IR field) • SVD on document-term matrix to reduce dimensionality • Words are compared by taking the cosine of the angle between two vectors by any two rows.

  12. Implementation • Gensim – topic modeling toolkit • Python • Tested different corpus and topic sizes

  13. Latent Semantic Analysis • 250000 photos, 20 topics topic #0: 0.997*"wedding" + 0.047*"family" + 0.023*"friends" + 0.022*"party" + 0.019*"reception" + 0.013*"california" + 0.011*"ceremony" + 0.009*"india" + 0.008*"church" + 0.008*"sanfrancisco" topic #11: 0.491*"newyork" + -0.463*"china" + 0.448*"nyc" + -0.233*"beach" + 0.174*"newyorkcity" + 0.146*"italy" + -0.132*"friends" + -0.123*"flowers" + 0.119*"new" + -0.117*"beijing" topic #4: 0.586*"paris" + -0.524*"family" + 0.417*"france" + 0.186*"london" + 0.178*"party" + -0.169*"halloween" + 0.156*"europe" + -0.121*"japan" + 0.103*"travel" + 0.063*"birthday" topic #1: 0.701*"halloween" + 0.588*"party" + 0.169*"friends" + 0.165*"family" + 0.157*"birthday" + 0.126*"japan" + 0.071*"christmas" + 0.059*"london" + 0.058*"travel" + 0.055*"beach"

  14. Latent Semantic Analysis • 250000 photos, 50 topics topic #10: -0.655*"friends" + 0.633*"china" + 0.221*"travel" + 0.166*"beijing" + 0.136*"party" + -0.088*"beach" + 0.075*"vacation" + 0.071*"greatwall" + 0.070*"shanghai" + -0.066*"flowers" topic #28: -0.580*"india" + -0.323*"trip" + 0.279*"nature" + 0.262*"snow" + -0.258*"dog" + -0.224*"sunset" + 0.200*"winter" . topic #20: -0.527*"cat" + 0.511*"sunset" + 0.266*"sky" + -0.242*"california" + -0.209*"sanfrancisco" + 0.198*"clouds" + -0.167*"beach" + -0.156*"flower" + -0.149*"cats" + -0.132*"dog" topic #17: -0.323*"california" + -0.272*"sanfrancisco" + 0.269*"cat" + 0.254*"horse" + 0.211*"pferd" + 0.207*"cheval" + 0.205*"caballo" + 0.205*"paard" + 0.204*"hest" + 0.204*"cavalo"

  15. Latent Semantic Analysis • 250000 photos, 100 topics topic #29: 0.689*"australia" + 0.279*"sydney" + -0.233*"nature" + 0.220*"trip" + -0.209*"france" + -0.187*"india" + -0.175*"snow" + 0.157*"new" + 0.144*"paris" + -0.134*"winter" topic #58: 0.401*"geotagged" + 0.385*"geolat" + 0.380*"geolon" + -0.261*"people" + 0.259*"day" + 0.198*"england" + 0.191*"newzealand" + -0.178*"canada" + 0.168*"water" + -0.144*"portrait". topic #45: 0.406*"fall" + 0.398*"park" + 0.315*"october" + -0.291*"animals" + 0.289*"autumn" + -0.262*"art" + 0.182*"leaves" + -0.175*"zoo" + -0.163*"sky" + 0.132*"garden" topic #85: -0.673*"hongkong" + 0.221*"florida" + 0.221*"singapore" + 0.209*"winter" + 0.174*"museum" + -0.170*"boston" + -0.165*"scotland" + -0.153*"prague" + 0.153*"cats" + -0.136*"island"

  16. Latent Semantic Analysis • Notice the negative weights. • Hard to interpret • Probabilistic methods are not used

  17. Latent Dirichlet Allocation • Expectation- Maximization • Each document is a mixture of topics • Find the posterior for topics in the E-Step p(topic t | document d) • Then update the assignment of the current word in the M-Step p(word w | topic t)

  18. Latent Dirichlet Allocation • 250000 photos, 20 topics topic #13: 0.088*party + 0.072*halloween + 0.027*lake + 0.024*boat + 0.022*home + 0.019*park + 0.018*river + 0.016*ice + 0.015*spring + 0.014*birthday topic #3: 0.046*trip + 0.044*vacation + 0.044*sanfrancisco + 0.040*california + 0.026*road + 0.024*cats + 0.018*school + 0.018*cruise + 0.014*ca + 0.014*old topic #8: 0.051*paris + 0.042*france + 0.027*july + 0.027*4th + 0.025*music + 0.022*car + 0.021*rock + 0.020*dogs + 0.020*concert + 0.016*geotagged

  19. Latent Dirichlet Allocation • 250000 photos, 50 topics topic #7: 0.111*sunset + 0.108*beach + 0.089*holiday + 0.047*fun + 0.029*smile + 0.028*forest + 0.023*rose + 0.020*wood + 0.019*disneyland + 0.019*costarica topic #14: 0.141*vacation + 0.046*san + 0.037*francisco + 0.034*sports + 0.020*hockey + 0.020*top + 0.019*cake + 0.014*cafe + 0.013*biking + 0.013*ruins topic #23: 0.112*trip + 0.070*bridge + 0.057*road + 0.048*blue + 0.048*building + 0.042*film + 0.035*orange + 0.022*university + 0.021*telephone + 0.018*sky topic #29: 0.124*party + 0.110*friends + 0.085*christmas + 0.045*rock + 0.038*lake + 0.038*ireland + 0.031*castle + 0.026*africa + 0.025*live + 0.025*music

  20. Latent Dirichlet Allocation • 250000 photos, 100 topics topic #10: 0.109*hawaii + 0.093*island + 0.060*la + 0.030*photoshop + 0.027*walk + 0.026*hdr + 0.024*maui + 0.023*us + 0.019*fountain + 0.018*beach topic #24: 0.172*house + 0.106*architecture + 0.077*festival + 0.068*airplane + 0.038*flying + 0.029*flight + 0.026*air + 0.025*aircraft + 0.021*aviation + 0.020*airshow topic #34: 0.231*vacation + 0.159*trip + 0.136*lake + 0.095*florida + 0.088*birds + 0.062*san + 0.051*francisco + 0.015*yellowstone + 0.015*kayak + 0.015*maltay topic #70: 0.114*november + 0.074*thanksgiving + 0.050*soccer + 0.048*polarbear + 0.048*ski + 0.041*basketball + 0.035*safari + 0.034*bear + 0.023*wien + 0.021*flood

  21. Conclusions • LSA and LDA are more useful for analyzing tags than Association Rule Mining • There is no “best” number of topics • Human interpretation still might be required

  22. Future Works • Increase the corpus size to 1000000 documents • Analyze Flickr groups as well

More Related