1 / 26

Series-O-Rama Search & Recommend TV series with SQL bit.ly /series-o-rama2012

Guillaume Cabanac guillaume.cabanac@univ-tlse3.fr. Series-O-Rama Search & Recommend TV series with SQL http:// bit.ly /series-o-rama2012. March 27th, 2012. Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac. Toulouse: A Picture is Worth a Thousand Words. 1. 3.

bikita
Download Presentation

Series-O-Rama Search & Recommend TV series with SQL bit.ly /series-o-rama2012

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Guillaume Cabanac guillaume.cabanac@univ-tlse3.fr Series-O-RamaSearch & Recommend TV series with SQLhttp://bit.ly/series-o-rama2012 March 27th, 2012

  2. Series-O-Rama: Search & Recommend TV series with SQLGuillaume Cabanac Toulouse: A Picture is Worth a Thousand Words 1 3 Capbreton 3h ride 4 Toulouse population: 437 000 students: 97 000 Aberdeen population: 210 400 students: ?? ??? 2 Ax-les-Thermes 1h40 ride Collioure 2h30 ride

  3. Series-O-Rama: Search & Recommend TV series with SQLGuillaume Cabanac Telly Addicts Need Help to Find TV Series en.wikipedia.org • Main Topics of Grey’s Anatomy? • Text mining, Visualization • Series about ‘plane crash island’ • Search engine • What should I watch next? • Recommender system amazon.com

  4. Series-O-Rama: Search & Recommend TV series with SQLGuillaume Cabanac Text Mining: Let’s Crunch Subtitles Grey’s Anatomy • Main Topics of Grey’s Anatomy? • Text mining, Visualization • Series about ‘plane crash island’ • Search engine • What should I watch next? • Recommender system Cold Case

  5. Series-O-Rama: Search & Recommend TV series with SQLGuillaume Cabanac What’s in a Subtitle File? • Title – Season – Episode – Language.srt • 1 episode = 1 plain text file • Synchronization • start --> stop • Dialogue •  We can easily extract words • [ a, again*2, and, but, com, cuban, different, favorite, food, for*2, forum, going, great, happen*2, has, hungry, i*2, is, it, love, m, my, nice, night*2, miami, now, pork, s*2, sandwiches, something, the, to*2, tonight, town, www ]

  6. Series-O-Rama: Search & Recommend TV series with SQLGuillaume Cabanac DB technology at Work! [Home] 7 527 files = 337 MB 100% Java and Oracle

  7. Series-O-Rama: Search & Recommend TV series with SQLGuillaume Cabanac DB technology at Work! [Search engine] Ranked listof results

  8. Series-O-Rama: Search & Recommend TV series with SQLGuillaume Cabanac DB technology at Work! [Infos] Most popular terms Mostrelatedseries

  9. Series-O-Rama: Search & Recommend TV series with SQLGuillaume Cabanac DB technology at Work! [Recommendations]

  10. Series-O-Rama: Search & Recommend TV series with SQLGuillaume Cabanac DB technology at Work! [Recommendations] I liked I disliked What shouldI watch next?

  11. Series-O-Rama: Search & Recommend TV series with SQLGuillaume Cabanac DB technology at Work! [Recommendations] Ranked list ofrecommendations

  12. Series-O-Rama: Search & Recommend TV series with SQLGuillaume Cabanac How Does this Work?

  13. Series-O-Rama: Search & Recommend TV series with SQLGuillaume Cabanac Architecture and Data Model subtitles indexing Series = { idS, name} 12 Lost 45 Dexter 45 ???? Dict = { idT, term} 8 plane 27 killer 29 crash DB   Posting = { idT*, idS*, nb} 27 45 89 8 45 3 8 12 90 offline online GUI browsing searching recommending

  14. Series-O-Rama: Search & Recommend TV series with SQLGuillaume Cabanac Theory  Text Indexing Pipeline Tokenization + lowercase [the, plane, crashed, ..., planes, ..., is] Stopwords removal [plane, crashed, ..., planes, ...] Stemming [plane, crash, ..., plane, ...] Counting Porter’s Stemmer (1980)http://qaa.ath.cx/porter_js_demo.html In 1720 Robert Gordon retired to Aberdeen having amassed a considerable fortune in Poland. On his death 11 years later he willed his entire estate to build a residential school for educating young boys. In the summer of 1750 the Robert Gordon’s Hospital was born In 1881 this was converted into a day school to be known as Robert Gordon’s College. This school also began to hold day and evening classes for boys girls and adults in primary secondary mechanical and other subjects … {(plane, 48), (crash, 15) ...}

  15. Series-O-Rama: Search & Recommend TV series with SQLGuillaume Cabanac Theory  Similarity of Paired Series • Dice’s Coefficient (1945) • Based on the Set Theory • Example: Let us Model a Series as a Set of Terms • House = {hospital, doctor, crazy, psycho} • Grey’s = {doctor, care, hospital} A Big Limitation The distribution of terms among series is ignored It makes no difference that a term occurs 1 time or 1,000,000 times

  16. Series-O-Rama: Search & Recommend TV series with SQLGuillaume Cabanac Theory  Vector Space Model, Term Weighting Vocabulary max max max max Raw TF survive ? Normalization TF / max(TF) dexter > lost  dexter < lost

  17. Series-O-Rama: Search & Recommend TV series with SQLGuillaume Cabanac Theory  Best Match Retrieval 1 45 1467 6790 n 1 TV series = 1 vector Now, we know how to: Find most popular terms for a TV series  Compute similarity between TV series  Find TV series matching a query

  18. Series-O-Rama: Search & Recommend TV series with SQLGuillaume Cabanac Theory  More on Term Weighting 1 45 1467 6790 n 1 TV series = 1 vector All terms are supposed to be equally representative … but‘survive’ is way more unusual than ‘people’  ‘survive’ better represents Lost than ‘people’ does IDF: Inverse Document Frequency

  19. Series-O-Rama: Search & Recommend TV series with SQLGuillaume Cabanac Theory  The Big Picture: TF*IDF An important term for series S is frequent in S and globally unusual. 1 TV series = 1 vector Some Limitations Term positions? e.g., “ice truck killer” in Dexter Stemming? e.g., christmas Mixture of languages? e.g., amusantFR vs. funEN

  20. Series-O-Rama: Search & Recommend TV series with SQLGuillaume Cabanac Theory … and Practice Series = { idS, name, maxNb} 12 Lost 540 45 Dexter 125 Dict = { idT, termidf } 8 plane 1.25 27 killer 2.87 29 crash 3.07   Posting = { idT*, idS*, nb, tf } 27 45 89 0.71 8 45 3 0.02 8 12 90 0.16

  21. Series-O-Rama: Search & Recommend TV series with SQLGuillaume Cabanac Description of a TV Series Lost ⋈  Many surnames need to be filtered out

  22. Series-O-Rama: Search & Recommend TV series with SQLGuillaume Cabanac Retrieval of TV Series  queries with 1 term survive ⋈ • Importance of normalization • Stargate Atlantisnb/maxNb = 63/1116 = 0.05645 • Bladenb/maxNb = 9/163 = 0.05521

  23. Series-O-Rama: Search & Recommend TV series with SQLGuillaume Cabanac Retrieval of TV Series  queries with n terms survive mulder ⋈ 67|The Vampire Diaries 18| X-Files survive|0.028|0.107 = 0.028 * 0.107 = 0.003 survive|0.014|0.107 = 0.014 * 0.107 = 0.001 mulder|0.007|3.977 = 0.007 * 3.977 = 0.028 mulder|1.000|3.977 = 1.000 * 3.977 = 3.977 + 0.031 + 3.978 ⁞

  24. Series-O-Rama: Search & Recommend TV series with SQLGuillaume Cabanac Computing Similarities Among TV Series 1/2 First, let’s compute the numerator where:Ai= Terms from HouseBi= Terms from Another TV series ⋈ Ai Bi Similar to House?

  25. Series-O-Rama: Search & Recommend TV series with SQLGuillaume Cabanac Computing Similarities Among TV Series 2/2 ⋈ ⋈ ⋈ Similar to House?

  26. Thank you http://www.irit.fr/~Guillaume.Cabanac

More Related