160 likes | 254 Views
Web Search with Variable User Model. Peter Gurský Stanislav Krajči Tomáš Horváth Róbert Novotný Jozef Jirásek Veronika Vaneková Peter Vojtáš. PF UPJŠ Košice MFF UK Praha. Datakon, 22.10.2007. Problem: Information Overload. Multiple sources Different structure, layout, usage
E N D
Web Search with Variable User Model Peter Gurský Stanislav Krajči Tomáš Horváth Róbert Novotný Jozef Jirásek Veronika Vaneková Peter Vojtáš PF UPJŠ Košice MFF UK Praha Datakon, 22.10.2007
Problem: Information Overload • Multiple sources • Different structure, layout, usage • Various software tools with different sets of answers Datakon 22.10.2007
Objectives • Integrate data from heterogeneous sources • Find adequate number of answers that match user preferences • Suitable representation of user preferences Datakon 22.10.2007
System Architecture Corporate memory Ontology HTML files annotation crawler Top-k objects query WEB evaluation Middleware system best objects Datakon 22.10.2007
Text-Oriented Annotation • Regular expressions • Analyze of visual representation • Structural differences: • Element hierarchy • HTML attributes • HTML node values Datakon 22.10.2007
Graphic-Oriented Annotation • Preliminary exploration. • Web pages may contain pictures, flash animations, ... This information is not available from web page source. • We use OCR processing and analysis of color, position, ... Datakon 22.10.2007
User Dependent Querying Object display and evaluation Evaluate Evaluate Display Find Rules Suitable Object Search (Top-k) Learning Preferences (IGAP) Find Top-k Objects RDF repository Preferences Datakon 22.10.2007
Retrieving Preferences from User • Direct user specification • Collaborative filtering • Learning preferences from sample objects evaluated by user • Iterative method: repeat evaluating until the relevant objects are found Datakon 22.10.2007
Learning Preference from Evaluation Datakon 22.10.2007
Learning Preference from Evaluation Datakon 22.10.2007
Basic Fuzzy Set Types • Lower values are better • Higher values are better • Middle values are better • Either high or low, but not middle Datakon 22.10.2007
Aggregation Each fuzzy set relates to one attribute, e.g. number of stars. Thus we obtain partial relevance for every attribute. Overall relevance is result of aggregation: • Weighted average (continuous range)goodU = 2/3*cheapU + 1/3*high-classU • Rules (discretized range)evaluationU = good IF (price≤500 AND stars≥***)evaluationU = excellent IF (distance≤1 km) Datakon 22.10.2007
User 1 User 2 User 3 User 4 Close Far Middle distance Border Middle price Cheap Middle price Border Datakon 22.10.2007
Relevant Object Search • having retrieved local and global preferences, we can find top-k objects according to user preferences • do not browse and calculate above all data, use only those that are necessary • use 3-phased No Random Access Algorithm – an improvement of Fagin's algorithm Datakon 22.10.2007
User Independent Querying • Text-based vector model • Document is defined as a vector ofTF-IDF weights of the document terms • Weights are stored in database index • Similarity ofqueryand document collection isdetermined by cosine measure Datakon 22.10.2007
Thank You for Your Attention. Questions?