270 likes | 345 Views
Explore our web portal demo showcasing data crawling, preprocessing, schema alignment, and entity resolution for movie databases. We use Python, R, and PHP to gather data from IMDb, TMDb, and Rotten Tomatoes, align schemas, and resolve entities efficiently.
E N D
MOVIE WEB PORTAL GROUP 12 CHI HUNG HUNG TONG KA WAI TERENCE YUEN
CONTENT • Web Crawling • Data Preprocessing • Schema Alignment • Entity Resolution • Data Fusion • Web Portal Demo
OVERVIEW • Database • MySQL • Data Sources • IMDB • TMDB • Rotten Tomatoes • Programming Languages • Python • R • PHP
METHODOLOGY • 1000 most popular movies between 2006-2016 • HTTP request sender: Requests • HTML/XML parser: BeautifulSoup
WEB CRAWLER EXAMPLE • Data navigation via traversing the DOM tree top-to-bottom • Nodes are recognized by the type of tag and the name of the class • Data extraction Crawler in Python: Page Source:
CRAWLING ILLUSTRATION List of Movies Movie1 Movie2 Movie3 Director List of Actors List of Genres … … Actor1 Genre1 Genre2 Actor2 Order of Traversing: List of Movies -> Movie1 -> Director -> List of Actors -> Actor1 -> Actor2 -> List of Genres -> Genre1 -> Genre2
EXAMPLE • Date/Time format conflicts • August 7, 1975 vs 1975-8-7 vs 1975-08-07 • 2 hrs. 18 mins vs 138 mins • Gender naming convention • “F” vs “Female” • Regional discrepancies • Release date/country • Currencies
METHODOLOGY • Union the attributes among all 3 sources • Example: • S1: {A1,A2,A3,A4} • S2: {A1,A2,A3,A5,A7} • S3: {A1,A3,A6} • Unified S: {A1,A2,A3,A4,A5,A6,A7}
UNIFIED SCHEMA • Movie • mid, title, year, overview, runtime, film_location, budget, global_revenue, us_revenue, us_release_date, other_release_date, other_release_country, dvd_date,user_rating, votes_num • Actor • aid, name, gender, date_of_birth, place_of_birth • Director • did, name, gender, date_of_birth, place_of_birth • Genre • gid, genre_type
METHODOLOGY • Clustering into Groups by keys • Movie, Genre: by first character of the title name/type name • Actor, Director: by concatenating the first character of actor’s/director’s first name and last name • Pairwise Matching • Distance-based approach • used on Actors, Directors, Genres • edit distance: Levenshtein, Jaro-Winkler • Rule-based approach • used on Movies
PERFORMANCE COMPARISON • Efficiency evaluation is conducted on group blocking between 2 different solutions • Experiment performed on Actor’s entities:
BLOCK SIZE DISTRIBUTION (Solution 2) (Solution 1)
RULE-BASED MATCHING • Rule used for deciding whether or not two movie entities are matching • Step 1: IF | year1 – year2 | > 2 years, declare a non-match ELSE go to step 2 • Step 2: IF | runtime1 – runtime2 | > 15 mins, declare a non-match ELSE go to step 3 • Step 3: IF edit-distance between title_name1 and title_name2 < threshold, declare a non-match ELSE consider the entity a match
EXAMPLE After Record Linkage…
METHODOLOGY • Fusion by voting • Assumption made on trustworthiness of the 3 data sources • IMDB > TMDB > Rotten Tomato • Extract the most informative value • Example 1: • For actor’s DOB => S1: 1985, S2: 1985-05/05, S3: 1983 • S2:1985-05-05 will be chosen, as S1 & S2 share the same year value, and S2 provides details on month and date over S1 • Example 2:
PORTAL APPLICATION • Search movies for more details. • Rank movies by filtering, such as rating , box office. • Find out the relating movies of celebrities.