180 likes | 282 Views
Mining the Web for Knowledge of Vague Places Chris Jones* with Paul Clough # , Hideo Joho + and Ross Purves ^ * Cardiff University, # University of Sheffield, + University of Glasgow, ^ University of Zurich. Place Names for Information Retrieval.
E N D
Mining the Web for Knowledge of Vague PlacesChris Jones* with Paul Clough #, Hideo Joho + and Ross Purves ^*Cardiff University, #University of Sheffield, +University of Glasgow, ^University of Zurich
Place Names for Information Retrieval • Place names used to specify location on internet e.g. timetables, routing instructions, yellow pages, general web search • Gazetteers used to recognise / disambiguate places and convert to quantitative footprint (point, MBR, polygon) if spatial search / mapping is required • Common to use vernacular place names • not recorded in conventional gazetteers • often no precise boundary • Need to acquire knowledge of vernacular names
Sources of place name knowledge • Gazetteers (incomplete / inadequate) • Maps with place annotation • Associate names with terrain features (valleys, mountain ranges, peaks..) • Difficult to derive extent of other vague places • People • Interviews / questionnaires – traditional methods inefficient, but web offers great potential • Text documents • Detailed descriptions difficult to interpret automatically + not enough for mass data acquisition. • Associations between places may be useful…
Mining text on the web for place name knowledge • Documents that refer to vague places may also refer to more precise places inside them. • Places that occur frequently in association with a target named place may have higher chance of being inside • Analyse frequency of occurrence of co-located places
Vague place : Highlands of Scotland a b c a) Density of unique places b) Density of occurrences of the name of each place c) Density of number of documents that mention each place Main peak corresponds to Inverness - main town in Highlands 2nd peak in b) is due to mis-geocoding of “Cameron”
Vague place :MittellandEvidence for validity of method Human interpretations of the extent + is the “core” Density surface of web mining results
Summary of procedure • Submit web search engine queries referring to a target place • Parse resulting highest ranking web pages for occurrence of place names • Geocode (“ground”) place names with coordinates • Create geometric model (surface model) and extract approximate boundary.
Formulating appropriate web queries • Region only, e.g. “Rocky Mountains” • Retrieves all documents mentioning the name • Region + Concept, e.g. “Hotels in Cotswolds” • Tends to retrieve directory pages listing places associated with the target place • Region and lexical pattern (trigger phrase), e.g. “Midwest towns such as”; “in the South of France” • Reduces the number of relevant documents retrieved but can work well for those documents • Problem of not enough places for statistical analysis • Region + Concept produces highest numbers of co-associated places in top ranking documents.
Geo-parsing • Use Named Entity Recognition (NER) methods to identify names • Gazetteers used to recognise place names • Distinguish between geographical and non-geographical uses – many place names occur in organisation names and in people’s names. • Use rules / patterns to identify these cases, e.g. <Forename><Placename> indicates person’s name.
Geocoding • Need to assign coordinate to name • But many places have the same name • Crude approach: –default interpretation (assume most commonly occurring instance) • More sophisticated : – search for co-occurrence of parent places that establish uniqueness
Experiments • Queries of form “hotels in <place>” submitted to Google with API • Initially used target places that are precise – English counties - and compared results with known exact boundary • Then used several target vague places and evaluated qualitatively
Devon (county) Density surface Density surface at three threshold levels (1, 0.5, 0.25 points per cell) Distribution of associated places Note: some places wrongly geocoded Thresholded boundary compared with actual boundary
Other vague places : Mid Wales Mid Wales approximated with thresholds of 0.5, 0.25 and 0.12
Vague place: Cotswolds Cotswolds (large region in centre) with thresholds of 0.5, 0.25. and 0.125 points per grid cell.
Vague place :Strathspey Strathspey with a threshold of 0.5 points per grid cell.
Conclusions • Method has potential, but currently flawed • Thresholding of surfaces needs to be automated • ? machine learning methods with training data for different types of places • Quality of geocoding needs to be improved • Problem of areas that have no settlements • Use queries that target topographic features (hills, valleys etc) as well as “Hotels” or other types of settlement – preliminary experiments promising for UK. • Use terrain modelling methods to extract regions associated with valleys, peaks, mountain ranges… • Need systematic evaluation methods • Could elicit large quantities of place knowledge from people via web questionnaires