Place Names for Information Retrieval

Mining the Web for Knowledge of Vague PlacesChris Jones* with Paul Clough #, Hideo Joho + and Ross Purves ^*Cardiff University, #University of Sheffield, +University of Glasgow, ^University of Zurich

Place Names for Information Retrieval • Place names used to specify location on internet e.g. timetables, routing instructions, yellow pages, general web search • Gazetteers used to recognise / disambiguate places and convert to quantitative footprint (point, MBR, polygon) if spatial search / mapping is required • Common to use vernacular place names • not recorded in conventional gazetteers • often no precise boundary • Need to acquire knowledge of vernacular names

Sources of place name knowledge • Gazetteers (incomplete / inadequate) • Maps with place annotation • Associate names with terrain features (valleys, mountain ranges, peaks..) • Difficult to derive extent of other vague places • People • Interviews / questionnaires – traditional methods inefficient, but web offers great potential • Text documents • Detailed descriptions difficult to interpret automatically + not enough for mass data acquisition. • Associations between places may be useful…

Mining text on the web for place name knowledge • Documents that refer to vague places may also refer to more precise places inside them. • Places that occur frequently in association with a target named place may have higher chance of being inside • Analyse frequency of occurrence of co-located places

Places mentioned in queries on the Cotswolds

Vague place : Highlands of Scotland a b c a) Density of unique places b) Density of occurrences of the name of each place c) Density of number of documents that mention each place Main peak corresponds to Inverness - main town in Highlands 2nd peak in b) is due to mis-geocoding of “Cameron”

Vague place :MittellandEvidence for validity of method Human interpretations of the extent + is the “core” Density surface of web mining results

Summary of procedure • Submit web search engine queries referring to a target place • Parse resulting highest ranking web pages for occurrence of place names • Geocode (“ground”) place names with coordinates • Create geometric model (surface model) and extract approximate boundary.

Formulating appropriate web queries • Region only, e.g. “Rocky Mountains” • Retrieves all documents mentioning the name • Region + Concept, e.g. “Hotels in Cotswolds” • Tends to retrieve directory pages listing places associated with the target place • Region and lexical pattern (trigger phrase), e.g. “Midwest towns such as”; “in the South of France” • Reduces the number of relevant documents retrieved but can work well for those documents • Problem of not enough places for statistical analysis • Region + Concept produces highest numbers of co-associated places in top ranking documents.

Geo-parsing • Use Named Entity Recognition (NER) methods to identify names • Gazetteers used to recognise place names • Distinguish between geographical and non-geographical uses – many place names occur in organisation names and in people’s names. • Use rules / patterns to identify these cases, e.g. <Forename><Placename> indicates person’s name.

Geocoding • Need to assign coordinate to name • But many places have the same name • Crude approach: –default interpretation (assume most commonly occurring instance) • More sophisticated : – search for co-occurrence of parent places that establish uniqueness

Experiments • Queries of form “hotels in <place>” submitted to Google with API • Initially used target places that are precise – English counties - and compared results with known exact boundary • Then used several target vague places and evaluated qualitatively

Devon (county) Density surface Density surface at three threshold levels (1, 0.5, 0.25 points per cell) Distribution of associated places Note: some places wrongly geocoded Thresholded boundary compared with actual boundary

4 precise places + thresholded boundaries

Other vague places : Mid Wales Mid Wales approximated with thresholds of 0.5, 0.25 and 0.12

Vague place: Cotswolds Cotswolds (large region in centre) with thresholds of 0.5, 0.25. and 0.125 points per grid cell.

Vague place :Strathspey Strathspey with a threshold of 0.5 points per grid cell.

Conclusions • Method has potential, but currently flawed • Thresholding of surfaces needs to be automated • ? machine learning methods with training data for different types of places • Quality of geocoding needs to be improved • Problem of areas that have no settlements • Use queries that target topographic features (hills, valleys etc) as well as “Hotels” or other types of settlement – preliminary experiments promising for UK. • Use terrain modelling methods to extract regions associated with valleys, peaks, mountain ranges… • Need systematic evaluation methods • Could elicit large quantities of place knowledge from people via web questionnaires

Place Names for Information Retrieval

Place Names for Information Retrieval

Presentation Transcript

Information retrieval

Information Retrieval

Information retrieval

Galago for Information Retrieval

Place Names

Information Retrieval

Information Retrieval

Information Retrieval

Information Retrieval

Information Retrieval

Place Value and Names for Numbers

Information Retrieval

Information Retrieval

Information Retrieval

information retrieval

Place Value and Names for Numbers

Information Retrieval