260 likes | 470 Views
Guillaume Cabanac guillaume.cabanac@univ-tlse3.fr. Series-O-Rama Search & Recommend TV series with SQL http:// bit.ly /series-o-rama2012. March 27th, 2012. Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac. Toulouse: A Picture is Worth a Thousand Words. 1. 3.
E N D
Guillaume Cabanac guillaume.cabanac@univ-tlse3.fr Series-O-RamaSearch & Recommend TV series with SQLhttp://bit.ly/series-o-rama2012 March 27th, 2012
Series-O-Rama: Search & Recommend TV series with SQLGuillaume Cabanac Toulouse: A Picture is Worth a Thousand Words 1 3 Capbreton 3h ride 4 Toulouse population: 437 000 students: 97 000 Aberdeen population: 210 400 students: ?? ??? 2 Ax-les-Thermes 1h40 ride Collioure 2h30 ride
Series-O-Rama: Search & Recommend TV series with SQLGuillaume Cabanac Telly Addicts Need Help to Find TV Series en.wikipedia.org • Main Topics of Grey’s Anatomy? • Text mining, Visualization • Series about ‘plane crash island’ • Search engine • What should I watch next? • Recommender system amazon.com
Series-O-Rama: Search & Recommend TV series with SQLGuillaume Cabanac Text Mining: Let’s Crunch Subtitles Grey’s Anatomy • Main Topics of Grey’s Anatomy? • Text mining, Visualization • Series about ‘plane crash island’ • Search engine • What should I watch next? • Recommender system Cold Case
Series-O-Rama: Search & Recommend TV series with SQLGuillaume Cabanac What’s in a Subtitle File? • Title – Season – Episode – Language.srt • 1 episode = 1 plain text file • Synchronization • start --> stop • Dialogue • We can easily extract words • [ a, again*2, and, but, com, cuban, different, favorite, food, for*2, forum, going, great, happen*2, has, hungry, i*2, is, it, love, m, my, nice, night*2, miami, now, pork, s*2, sandwiches, something, the, to*2, tonight, town, www ]
Series-O-Rama: Search & Recommend TV series with SQLGuillaume Cabanac DB technology at Work! [Home] 7 527 files = 337 MB 100% Java and Oracle
Series-O-Rama: Search & Recommend TV series with SQLGuillaume Cabanac DB technology at Work! [Search engine] Ranked listof results
Series-O-Rama: Search & Recommend TV series with SQLGuillaume Cabanac DB technology at Work! [Infos] Most popular terms Mostrelatedseries
Series-O-Rama: Search & Recommend TV series with SQLGuillaume Cabanac DB technology at Work! [Recommendations]
Series-O-Rama: Search & Recommend TV series with SQLGuillaume Cabanac DB technology at Work! [Recommendations] I liked I disliked What shouldI watch next?
Series-O-Rama: Search & Recommend TV series with SQLGuillaume Cabanac DB technology at Work! [Recommendations] Ranked list ofrecommendations
Series-O-Rama: Search & Recommend TV series with SQLGuillaume Cabanac How Does this Work?
Series-O-Rama: Search & Recommend TV series with SQLGuillaume Cabanac Architecture and Data Model subtitles indexing Series = { idS, name} 12 Lost 45 Dexter 45 ???? Dict = { idT, term} 8 plane 27 killer 29 crash DB Posting = { idT*, idS*, nb} 27 45 89 8 45 3 8 12 90 offline online GUI browsing searching recommending
Series-O-Rama: Search & Recommend TV series with SQLGuillaume Cabanac Theory Text Indexing Pipeline Tokenization + lowercase [the, plane, crashed, ..., planes, ..., is] Stopwords removal [plane, crashed, ..., planes, ...] Stemming [plane, crash, ..., plane, ...] Counting Porter’s Stemmer (1980)http://qaa.ath.cx/porter_js_demo.html In 1720 Robert Gordon retired to Aberdeen having amassed a considerable fortune in Poland. On his death 11 years later he willed his entire estate to build a residential school for educating young boys. In the summer of 1750 the Robert Gordon’s Hospital was born In 1881 this was converted into a day school to be known as Robert Gordon’s College. This school also began to hold day and evening classes for boys girls and adults in primary secondary mechanical and other subjects … {(plane, 48), (crash, 15) ...}
Series-O-Rama: Search & Recommend TV series with SQLGuillaume Cabanac Theory Similarity of Paired Series • Dice’s Coefficient (1945) • Based on the Set Theory • Example: Let us Model a Series as a Set of Terms • House = {hospital, doctor, crazy, psycho} • Grey’s = {doctor, care, hospital} A Big Limitation The distribution of terms among series is ignored It makes no difference that a term occurs 1 time or 1,000,000 times
Series-O-Rama: Search & Recommend TV series with SQLGuillaume Cabanac Theory Vector Space Model, Term Weighting Vocabulary max max max max Raw TF survive ? Normalization TF / max(TF) dexter > lost dexter < lost
Series-O-Rama: Search & Recommend TV series with SQLGuillaume Cabanac Theory Best Match Retrieval 1 45 1467 6790 n 1 TV series = 1 vector Now, we know how to: Find most popular terms for a TV series Compute similarity between TV series Find TV series matching a query
Series-O-Rama: Search & Recommend TV series with SQLGuillaume Cabanac Theory More on Term Weighting 1 45 1467 6790 n 1 TV series = 1 vector All terms are supposed to be equally representative … but‘survive’ is way more unusual than ‘people’ ‘survive’ better represents Lost than ‘people’ does IDF: Inverse Document Frequency
Series-O-Rama: Search & Recommend TV series with SQLGuillaume Cabanac Theory The Big Picture: TF*IDF An important term for series S is frequent in S and globally unusual. 1 TV series = 1 vector Some Limitations Term positions? e.g., “ice truck killer” in Dexter Stemming? e.g., christmas Mixture of languages? e.g., amusantFR vs. funEN
Series-O-Rama: Search & Recommend TV series with SQLGuillaume Cabanac Theory … and Practice Series = { idS, name, maxNb} 12 Lost 540 45 Dexter 125 Dict = { idT, termidf } 8 plane 1.25 27 killer 2.87 29 crash 3.07 Posting = { idT*, idS*, nb, tf } 27 45 89 0.71 8 45 3 0.02 8 12 90 0.16
Series-O-Rama: Search & Recommend TV series with SQLGuillaume Cabanac Description of a TV Series Lost ⋈ Many surnames need to be filtered out
Series-O-Rama: Search & Recommend TV series with SQLGuillaume Cabanac Retrieval of TV Series queries with 1 term survive ⋈ • Importance of normalization • Stargate Atlantisnb/maxNb = 63/1116 = 0.05645 • Bladenb/maxNb = 9/163 = 0.05521
Series-O-Rama: Search & Recommend TV series with SQLGuillaume Cabanac Retrieval of TV Series queries with n terms survive mulder ⋈ 67|The Vampire Diaries 18| X-Files survive|0.028|0.107 = 0.028 * 0.107 = 0.003 survive|0.014|0.107 = 0.014 * 0.107 = 0.001 mulder|0.007|3.977 = 0.007 * 3.977 = 0.028 mulder|1.000|3.977 = 1.000 * 3.977 = 3.977 + 0.031 + 3.978 ⁞
Series-O-Rama: Search & Recommend TV series with SQLGuillaume Cabanac Computing Similarities Among TV Series 1/2 First, let’s compute the numerator where:Ai= Terms from HouseBi= Terms from Another TV series ⋈ Ai Bi Similar to House?
Series-O-Rama: Search & Recommend TV series with SQLGuillaume Cabanac Computing Similarities Among TV Series 2/2 ⋈ ⋈ ⋈ Similar to House?
Thank you http://www.irit.fr/~Guillaume.Cabanac