220 likes | 322 Views
SECOGIS – ER 2009 Gramado – RS- Brazil, 13th November 2009. A Model for Geographic Knowledge Extraction on Web Documents. Cláudio E. C. Campelo and Cláudio de Souza Baptista University of Campina Grande Computer Science Department Information Systems Laboratory
E N D
SECOGIS – ER 2009 Gramado – RS- Brazil, 13th November 2009 A Model for Geographic Knowledge Extraction on Web Documents Cláudio E. C. Campelo and Cláudio de Souza Baptista University of Campina Grande Computer Science Department Information Systems Laboratory http://www.lsi.dsc.ufcg.edu.br Cláudio Baptista, UFCG http://lsi.dsc.ufcg.edu.br
Agenda • Introduction • Main Challenges • Detection of Geographic References • The Geographic Scope • GeoSEn Prototype • Architecture • GUI • Experiments • Conclusion and Future Work Cláudio Baptista, UFCG http://lsi.dsc.ufcg.edu.br
Introduction • Web: need for searching using the geographic context; • Traditional search engines: search based on keywords only; • Example: • A Web document: “...With the arrival of the industry in Gramado, one thousand of new jobs for Java programmers will be created...”; • User query: “Java programmer jobs Brazil”; • The mentioned document will not be retrieved in the previous query! Cláudio Baptista, UFCG http://lsi.dsc.ufcg.edu.br
Introduction • What is the Geographic Context of Web documents? • The place where the information was created? • The places mentioned in the document content? • Where are people who are most interested in a particular information? • etc… • Several documents have this context: • Research in Portugal in which only occurrence of names of Portuguese cities was considered (308 in total): • Total of about 4 millions pages analyzed. • Occurrence of 2.2 references per document; • 4% of the queries submitted had a reference to one of those cities. Cláudio Baptista, UFCG http://lsi.dsc.ufcg.edu.br
Main Challenges • Detection of geographic references in the documents; • Modeling of geographic scope of documents; • Relevance ranking according to geographic context; • Need for efficient index techniques which cope with both textual and spatial dimensions • Development of user interfaces which provide usability to deal with both dimensions Cláudio Baptista, UFCG http://lsi.dsc.ufcg.edu.br
Detection of Geographic References • Aim: to identify document features which may be mapped to a geographic place name; • Challenge: elimination of ambiguities, ex: • Place with a name of a thing; (Ex. Gramado, Canela) • Place with name of a Person (Ex. Garibaldi); • Places with same names and same types: (Ex. Cachoeirinha-Pe e Cachoeirinha-Rs); • Places with same names and different types (ex. city of Rio de Janeiro and state of Rio de Janeiro • Places and gentilics with the same names (ex. city of Paulista-Pe and paulista (who is born in São Paulo) Cláudio Baptista, UFCG http://lsi.dsc.ufcg.edu.br
Detection of Geographic References • Another example of ambiguity: • São Paulo as a State • São Paulo as a City • São Paulo as a football team • São Paulo as the name of a hospital • São Paulo as the Saint! Cláudio Baptista, UFCG http://lsi.dsc.ufcg.edu.br
Detection of Geographic References • Explored detected points: page content, page title, URL; • Types of detected places: all of the spatial hierarchy: (from city to region); • Types of detected references: place names, postal code, telephone code area, gentilic. Cláudio Baptista, UFCG http://lsi.dsc.ufcg.edu.br
Definitions • Confidence Rate (CR) represents the probability of a given reference be a valid place name. • Confidence Factor (CF) a measure associated to each analyzed feature during the detection of geographic reference. CR 1 N CF Cláudio Baptista, UFCG http://lsi.dsc.ufcg.edu.br
Confidence Factor • CFST – analyzes the occurrence of special terms associated to geographic references; • Examples of STs include: “in" (e.g. “in Gramado); "city" (e.g. "city of São Paulo"); “ZIP” (e.g. “ZIP: 58109-000”); • Storage of special terms: • Term; • Type of geographic reference (zip code, telephone area code, place name, etc,); • Type of place (city, state, region); • Minimum distance (DMIN); • Maximum distance (DMAX); • Maximum confidence grade (CMAX). Cláudio Baptista, UFCG http://lsi.dsc.ufcg.edu.br
Confidence Factor • CFTS – considers the probability of a term be a geographic reference using a traditional search engine; Cláudio Baptista, UFCG http://lsi.dsc.ufcg.edu.br
Confidence Factor • CFCROSS : • analyzes the occurrence of cross references based on topological relationships (inside, contains, etc); • CFFMT – evaluates the syntax used to describe the geographic references; • Abbreviation of place names (R. de Janeiro, RJ); • The use of uppercase in the place names; • Telephone format ( 083)-999-3456; • Postal code format 58.104-867 Cláudio Baptista, UFCG http://lsi.dsc.ufcg.edu.br
Modeling of the Geographic Scope • A document may be associated to one or more places; • A geographic scope may have places that are not mentioned directly in a document (geographic expansion) • Each place which is part of the scope has an associated relevance value; Cláudio Baptista, UFCG http://lsi.dsc.ufcg.edu.br
Geographic Dispersion Rate • Another factor used in the composition of the geographic relevance value; • Hypothesis: references dispersed may characterize regions that share common features (e.g. cultural, economic, social); (a) (b) Cláudio Baptista, UFCG http://lsi.dsc.ufcg.edu.br
GeoSEn – an overview • Geographic Search Engine: • Indexes a subset of the Brazilian Web; • Deals with 6,291 places in Brazil, which are organized in a five-levels hierarchy: from city to region. • Region: ex. South • State: ex. Rio Grande do Sul • MesoRegion: ex. Metropolitana de Porto Alegre • MicroRegion: ex. Gramado-Canela • Municipality: ex. Gramado Cláudio Baptista, UFCG http://lsi.dsc.ufcg.edu.br
GeoSEn - Architecture Cláudio Baptista, UFCG http://lsi.dsc.ufcg.edu.br
Cláudio Baptista, UFCG http://lsi.dsc.ufcg.edu.br
Query Example • Example of query using a user defined area of interest SELECT id FROM places plc1 WHERE within(plc1.geometry, specified_geometry) AND NOT EXISTS ( SELECT id FROM places plc2 WHERE within(plc2.geometry, specified_geometry) AND within(plc1.geometry, plc2.geometry)) Cláudio Baptista, UFCG http://lsi.dsc.ufcg.edu.br
Experiments • Experiments using 66,531 indexed documents; • 5 classes: .edu, .gov, blogs, tourism, arts; • Detection of terms: • Documents from the Web manually analyzed; • Documents with strong ambiguities created for the test bed; Cláudio Baptista, UFCG http://lsi.dsc.ufcg.edu.br
Conclusion • We have presented a heuristic based approach to implement a GIR system. • The techniques presented may be combined with others already known. • Precomputed relevance values may be used aiming to simplify the search process; Cláudio Baptista, UFCG http://lsi.dsc.ufcg.edu.br
Future Work • Retrieval of georeferenced images and videos; • Recognition of other kinds of places; • Integration of other data sources; • Evaluation using large data set collections. Cláudio Baptista, UFCG http://lsi.dsc.ufcg.edu.br
Thank you very much! Questions? Cláudio Baptista, UFCG http://lsi.dsc.ufcg.edu.br