200 likes | 320 Views
Multimedia search engine. Michal Krsek, UISK Charles University at Prague & CESNET Ivan Doležal , CESNET Michal Illich , Jyxo. Electronic Media. TV & radio Organized in channels Zero democracy in programming (by channel management) Centralized production (big guys business). Internet.
E N D
Multimedia search engine Michal Krsek, UISK Charles University at Prague & CESNET Ivan Doležal, CESNET Michal Illich, Jyxo
Electronic Media • TV & radio • Organized in channels • Zero democracy in programming (by channel management) • Centralized production (big guys business)
Internet • Not only web (audio/video and others) • remember archie.sura.net? • IPTV / Live / Video on demand • Navigation only via web => not easy to find specific program in A/V
Search options I • Voice recognition • Language identification • Accents • Video recognition • Textinterpretation (bush vs. Bush) • Low video quality
Search options II • Indexing of web pages • Yahoo! does (google bomb target) Metadata • “Out of the band Metadata” (as in librarian world) • Metadata in files (added during editing or encoding)
Project description • Started in 2003 (oh yes, one year before Truveo) • “Google for audio and video on Internet” • No support from content owners • Modular concept • Start with .cz Internet
Technical description I • Crawler • Crawls web and collects addresses (URL) • Exports URL of multimedia files • Software written by Jyxo (Linux console app)
Technical description II • Distiller • Imports addresses of multimedia files • Distills metadata (and makes XML files) • Makesscreenshots (if video in file) • C# software and mplayer (windows apps) • Runs in distributed environment
Technical description III • Database • Imports XML metadata files to full text DB • Responses back-end queries for web queries • And others fulltext things (i.e. language)
Crawls webpages Gets addresses Filter A/V adresses crawling distillation www. yournamehere. edu Gets metadata from multimedia files indexing search Holds fulltext database Provides back end for querries
Distillation • Proces description • Get URL from DB • Get metadata from file available at URL • Get screenshots at 1,30,50 sec • Save metadata & screenshot
Distillation • Use of win32 applications • Native players (WMP, RP, Qt) for metadata • Mplayer for screenshots • Takes average one minute • Slow servers/bandwidth • Streaming without fast fw
DistillerGRID • <= need 16 years to distill 8.500.000 URLs • Ideal application for GRID computing • Not need of real time response • Huge amount of computing time needed • Two ways to create GRID • Build dedicated system • Use of current capacities
Computing machines • PC/Windows based • HW independent • Secure environment • Security of hosting system • Security of distillation process • Well connected • Not needed to run 24x7 • Easy to manage
Configuration • ~100 PCs in student labs • Running on demand during weekends • Virtual machines (MS VPC 2004) in hosting system (Win XP) • Three different HW configurations • Peak rate about 5000 URLs per minute • SQL as background -> pull distribution of work
Actual status I • HW • 20crawlers • 2 servers for fulltext DB (<1.400 USD) • Distillation stations (X office PC) • Connected by 1 Gb/s to CESNET2 -> GEANT2
Actual status II • Database • EU + .com, .edu • > 13.000.000 URLs • > 8.000.000 valid • > 2.800.000 with screenshots
Want to test? • URLs • http://multimedia.jyxo.cz • http://videoserver.cesnet.cz/videoarchiv_en.php • For XML interface send me e-mail
Questions ?Comments ? Michal Krsek, Michal.Krsek@cesnet.cz (academic service, cooperation) Michal Illich, michal@illich.cz (business service)