180 likes | 280 Views
http://cs.joensuu.fi/mopsi/. Ad-hoc Georeferencing of Web-pages Using Street-name Prefix Trees. Andrei Tabarcea , Ville Hautamäki , Pasi Fränti University of Eastern Finland. Introduction. Our goal is to find services and points of interest close to the user’s location
E N D
http://cs.joensuu.fi/mopsi/ Ad-hoc Georeferencing of Web-pages Using Street-name Prefix Trees Andrei Tabarcea, Ville Hautamäki, PasiFränti University of Eastern Finland
Introduction • Our goal is to find services and points of interest close to the user’s location • We call this “location-based search” • We try to find location information in web-pages
Ad-Hoc Georeferencing <HTML> <HEADprofile"="http://geotags.com/geo> <METAname="geo.position" content="62.35;29.44"> <METAname="geo.region" content="FI"> <METAname="geo.placename" content="Joensuu"> <METAhttp-equiv="Content-Type" content="text/html; charset=iso-8859-1"> <linkrel="stylesheet" href="http://www.joensuu.fi/tkt/sivutyyli.css" type="text/css"> <TITLE>Pages of PasiFränti</TITLE> </HEAD> • The problem is how to extract and validate location data from free-form text • Most web pages don’t contain explicit georeferencing (eg. geo-tags) • Postal address is the most common location data found • Our goal is to give geographical coordinates to services mentioned in web-pages • We call this method ad-hoc georeferencing
MOPSI Location-Based Search MOPSI = Mobiilitpaikkatieto-sovelluksetjaInternet (Mobile location based applications and Internet) Available on http://cs.joensuu.fi/mopsi/ Main focus areas: • Mobile search engine • How to collect & present location-based data • Other location-related topics
Mobile search engine • How can you find services: • Asking directions • Advertisements • Wandering around • Yellow pages • Internet • Query consists of: • Keyword • Location
Mobile Search engine structure Core server software Keyword Coordinates Mobile application Geocoded street-name database Coordinates Search results • Search Engine consists of: • User interface • Core server software • Geocoded street-name database Address Keyword Coordinates Web user interface Search results
Core Server software Geocoded database Coordinates Coordinates Municipalities list Addresses Georeferencing module Relevant municipalities detector Page parser Address and description detector Address validator Sorted results list Word list Keyword Municipalities Results list Keyword, Address, Coordinates <keyword, municipality> query Result links
Core Server software Geocoded database Coordinates Coordinates Municipalities list Addresses Georeferencing module Relevant municipalities detector Page parser Address and description detector Address validator Sorted results list Word list Keyword Municipalities Results list Keyword, Address, Coordinates <keyword, municipality> query Result links
Our Solution • A rule-based solution that detects address-based locations using a gazetteer and street-name prefix trees created from the gazetteer • We compare this approach against: • a method that doesn’t require a gazetteer (a heuristic method that assumes that the street-name has a certain structure) • a method that also uses data structures created from the gazetteer in the form of street-name arrays StreetNameDetection(words) { WHILE i < count(words) DO { IF words[i] = street name THEN { Search for street number, postal code and other address elements near words[i]. IF address elements found THEN { Create address block Get coordinates using Geocoded Database IF coordinates found THEN Add address block to address list } }i = i + 1; }}
Street-address Detection • We use a rule-based pattern matching algorithm • The detection of street-names is the starting point of the algorithm • An address-block candidate is constructed by detecting typical address elements (street names, numbers, postal codes, telephone numbers and municipal names) • Address block candidates are validated using the gazetteer
Street-name Detection • Street-name detection is the starting point of the address detection • Heuristic and brute-force method are compared against our Prefix Tree solution • Our application uses a commercial gazetteer for Finland and, for Singapore, street data from the free map project OpenStreetMap
Prefix Trees • Invented by Friedkin (1960) • The prefix tree (or trie) is a fast ordered tree data structure used for retrieval • Root is associated with an empty string • All the descendants of a node have a common prefix of the string associated with that node • Some nodes can have associated values (usually they mark the end of a word)
Street-name Prefix Trees • Our solution is to detect street-names using prefix trees constructed from the gazetteer • A street-name prefix tree is build for each municipality used in the search • The user’s location and his area of interested are known, therefore prefix-trees can be limited to municipalities
Other solutions • Heuristic solution • Relies on regular expression matching • Street names usually have similar endings or similar prefixes • A gazetteer is not needed (except for validation) • Can be fast but not precise • Brute-force solution • Every word should be checked if it exists in the gazetteer • An optimized solution is used (gazetteer is locally limited and preloaded into arrays)
Experiments • 10 urban locations (blue) and 10 rural location (orange) were used for testing • Testing was done using the MOPSI prototype for Finland and Singapore • Both commercial and non-commercial keywords were used:
Results • Average processing times for every solution were calculated • The prefix tree solution proved to be on average 57% faster and 10% more accurate than the heuristic solution and 10 times faster than the brute-force solution • The resulting solution improves the speed and quality of web-page georeferencing
Open Problems • Support approximate matching to avoid problems in misspellings • Improve flexibility of the address detection algorithm • Implement a way to learn rules automatically using hand tagged example corpus.
http://cs.joensuu.fi/mopsi Thank you!