220 likes | 303 Views
Geographic Web Information Retrieval. Alexander Markowetz , University of Marburg Thomas Brinkhoff, FH Oldenburg Bernhard Seeger, University of Marburg. Current Situation In Web-IR. Everybody is online But never seen. Queries are too short Resultsets are too large.
E N D
Geographic Web Information Retrieval Alexander Markowetz, University of Marburg Thomas Brinkhoff, FH Oldenburg Bernhard Seeger, University of Marburg
Current Situation In Web-IR • Everybody is online • But never seen
Queries are too short Resultsets are too large You can effectively block your competitors Good results get buried Current Situation In Web-IR • Smaller Results • Ways to drill the ice-berg
Solutions • Personalized Search • Dynamic/Interactive Search
Geographic Web-IR • Location is the most personal property • „All business is local“ • People already use the web geographically • „Yoga Brooklyn“ • „Linux usergroup Frankfurt“ • And get poor results • We are going to make that a lot better
How-Not-To • Semantic Web • „If just everybody included Geographic Markup in their web-pages“ • Two problems • Chicken-Egg • Malicious Webmaster • Metatags Anyone? • Bottomline • Semantic web is for „B2B“ situations only.
How-To • Modify traditional IR techniques to extract geographic markers • Multigranular approach • Extending basic Web-IR • Map pages to geographic positions • Footprint • Aggregate and Cluster them • Build Applications • Geographic Search • Geographic Web-Mining
Geocoding • Footprint • Geographic Position of a Webpage • Set of points and polygons, associated with some amplitude
Preliminaries • Basic IR Assumptions can easily be extended to „geographic-IR“ • Radius-1 Hypothesis • Radius-2 Hypothesis (co-citation) • Intra-Site Hypothesis • Intra-subdomain • Intra-directory
Dom SDom SDom Dir Dir File File Multigranularity • Information extraction on different levels • Domain • Subdomain • Directory • File • Need to aggregate
On all levels Names of places Zip-codes Area-codes On Site Level Whois Business Directories Links Density over a given area Radius-1 and Radius-2 Geospatial Mapping and Navigation of the Web, Kevin S. McCurley, 10th WWW, 2001 Computing Geographical Scopes of Web Resources, J. Ding, L. Gravano, and N. Shivakumar, VLDB2000 Dom SDom SDom Dir Dir File File Sources
Key Words City Street State Area code Geographic Search • A simple interface • Not so exciting, but... SEARCH
Closer Continue Wider Next ½ mile 1 mile 2 miles 5 miles 10 miles 25 miles 100 miles Next Closer Wider Dynamic Geographic-IR • Replacing the „next“ button
Locality • Final ranking is a (linear) combination of importance and geographic distance. • Chances are: • Amazon will still rank first: no matter where you are • Amazon is a „global bully“ • Idea: • Eliminate global bullies by computing importance differently • Give less weight to links that span a longer distance
Evaluation • Evaluation Web-IR is hard • Evaluating geo-Search is even harder • Mistakes are hard to find
Impact of geo-IR • Next generation Search Engine • Location based Service • For cellphones under UMTS • Move traffic from A&E • Local companies will get more traffic • Increase Profits from Adwords • Smallest businesses will advertise online • Locally focused • The „Leaflet-industry“ will shrink
Geographic Web-Mining • The web reflects human society. • Distorted • Delayed/Ahead • A lot of interesting social questions can be answered by looking at a large webcrawl • You can save time and money compared to door-to-door surveys • This is widely used • But: • Most of them are of geographic nature
Where in Germany are vintage sneakers a trend? Is there a fashion authority that is accepted in all regions of Germany? Do Britney and Madonna have the same audience? Draw a map of Germany with all sites about vintage sneakers. Find all fashion-sites that get a min of 1000 equally distributed links. Map the areas in Germany, where there are significantly more Sites for B. than for M. Example Queries Precise Semantics?
Current Work • Older Prototype • Metasearch on top of lycos.de • Screen-scrape & re-order • Whois only • Did very well
Current Work • Current Prototype for Geographic Search • Limited to Germany = .de domains • 50.000.000 Pages • Expected online by late summer • In co-operation with • Yen-Yu Chen • Xiaohui Long • Torsten Suel • Polytechnic University, Brooklyn
Reinventing Web-IR • Nearly no (academic) work in geo-IR • Allmost every aspect of Web-IR needs to be looked at again • Interfaces • Query processing • Index distribution • Link analysis • User profile analysis • Spam detection • Even: • Other aspects of personalized search • Changes in the web
Thank you Any questions?