1 / 27

Image Source : http://upload.wikimedia.org/wikipedia/en/7/7f/World_Map_flat_Mercator.png, http://www.frdc.at/hp_frdc_pic

Extraction of Address Data from Unstructured Text using Free Knowledge Resources. ??. Sebastian Schmidt, M.Sc. . Image Source : http://upload.wikimedia.org/wikipedia/en/7/7f/World_Map_flat_Mercator.png, http://www.frdc.at/hp_frdc_pictures/frdc_dart_pfeil.jpg.gif . Outline. Motivation

brita
Download Presentation

Image Source : http://upload.wikimedia.org/wikipedia/en/7/7f/World_Map_flat_Mercator.png, http://www.frdc.at/hp_frdc_pic

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Extraction of Address Data from Unstructured Text • using Free Knowledge Resources ?? Sebastian Schmidt, M.Sc. Image Source: http://upload.wikimedia.org/wikipedia/en/7/7f/World_Map_flat_Mercator.png, http://www.frdc.at/hp_frdc_pictures/frdc_dart_pfeil.jpg.gif

  2. Outline • Motivation • Why Business Address Data? • Application Scenarios • Structure of German Addresses • Solution • Evaluation • Methodology • Results • Challenges • Related Work • Adaptation • Conclusion and Future Work Image source: http://www.yourshiningredthread.com/wp-content/uploads/2012/08/WavyThreadImage.jpg

  3. 1. MotivationGeneral • Text documents are everywhere around us (e.g. 189 Mio Web Sites1) • All containing lots of valuable information • Semantic Web as a vision to annotate information with their meaning • Only 12% of Web Sites make use of any semantic annotation like RDFa, microformat or Microdata[Mühleisen12] Most content remains incomprehensible to machines •  Tools required that allow automatic identification of certain information in text Image source: http://www.netresearch.de/blog/wp-content/uploads/2009/04/semantic_web_day.jpg 1 http://news.netcraft.com/archives/2013/08/09/august-2013-web-server-survey.html

  4. 1. MotivationBusiness Address Data • Addresses consisting of different attributes • Extracted data is only valuable if all attributes have been identified correctly • Sequentiality can be exploited • Business addresses have a high volatility • Need to track them automatically • Business address data is of interest in various domains

  5. 1. MotivationApplication Scenario • Semantic Web! • Web Sites aggregating existing content • Often relying on addresses given on Web Sites • E.g. restaurant recommendations, job search engines, product search engines • Address-repositories • Can be created automatically • Location-based services • Can gain from population of geographicalrepositories with business information Image source: http://www.thedigitalbus.com/wp-content/uploads/2011/09/Location-Based-Services.jpg

  6. Outline • Motivation • Why Business Address Data? • Application Scenarios • Structure of German Addresses • Solution • Evaluation • Methodology • Results • Challenges • Related Work • Adaptation • Conclusion and Future Work Image source: http://www.yourshiningredthread.com/wp-content/uploads/2012/08/WavyThreadImage.jpg

  7. 2. Structure of German Addresses • <Company Name> • <Street> <Street Number> • <Postal Code> <City> • Single digit or number • Can be suffixed by a character • A number of common suffixes • But many exceptions • Spelling varies a lot (abbreviations) • Variable length • No common pattern • Variable length • Type of business entity can be part of the name • General structure exists but many exceptions • fragmented by other attributes • E.g. name of a company not mentioned next to the address but somewhere else on a Web site • All attributes within one line • … • Five digits • Might be pre-fixed by “D-” • No common structure • Some suffixed indicators • Not for all cities • Different naming schemes for single city • E.g. “Frankfurt”, “Frankfurt/Main”, “Ffm”,…

  8. Outline • Motivation • Why Business Address Data? • Application Scenarios • Structure of German Addresses • Solution • Evaluation • Methodology • Results • Challenges • Related Work • Adaptation • Conclusion and Future Work Image source: http://www.yourshiningredthread.com/wp-content/uploads/2012/08/WavyThreadImage.jpg

  9. 3. SolutionOverview • Approach: • Pre-Processing • Identification of single attributes with some dependencies defined by patterns • Afterwards aggregation of results to complete addresses Pre- Processing Postal Codes Cities Street Numbers Identification of Street Names Company Names Aggregation

  10. 3. SolutionSteps • Preprocessing • Stripping of HTML markups • Data cleaning • Line splitting • Tokenization • Part-of-Speech (POS) Tagging • Identification of Single Attributes • Independently of previous identifications • Only some dependenciesfor improving precision • Leads to a large number of candidates for each attribute Pre- Processing Postal Codes Cities Street Numbers Identification of Street Names Company Names Aggregation

  11. 3. SolutionSteps • Identification of Postal Codes • Regular expression • Identification of Cities • Terms in a certain distance (3 tokens) to postal code candidate that exist in Gazetteer • Gazetteer assembled from OpenStreetMap • 28,087 entries • Terms that are preceded directly by a postal code candidate • Capitalized Pre- Processing Postal Codes Cities Street Numbers Identification of Street Names Company Names Aggregation

  12. 3. SolutionSteps • Identification of Street Numbers • Regular expression • Also for range of street numbers • Identification of Street Names • Token chains ending with an indicator term • Gazetteer of indicators assembled from OpenStreetMap • Containing 30 most common endings of German street names • Covering 70% of German street names • Token chains that follow a certain POS pattern • Out of 6 manually defined patterns Pre- Processing Postal Codes Cities Street Numbers Identification of Street Names Company Names Aggregation

  13. 3. SolutionSteps • Identification of Company Names • Token chain ending with indicator term • List of terms from a Wikipedia page on types of business entities • 29 indicator terms • Token chains preceding a street name Pre- Processing Postal Codes Cities Street Numbers Identification of Street Names Company Names Aggregation

  14. 3. SolutionSteps • Aggregation • Company candidates as seed • Search for closest combination of street name and number candidate • Search for closest combination of postal code and city candidate • If all elements are found for a company candidate  Complete address Pre- Processing Postal Codes Cities Street Numbers Identification of Street Names Company Names Aggregation Image source: http://d3sdoylwcs36el.cloudfront.net/online_content_distribution_strategies_aggregation_getty_images.jpg/

  15. Outline • Motivation • Why Business Address Data? • Application Scenarios • Structure of German Addresses • Solution • Evaluation • Methodology • Results • Challenges • Related Work • Adaptation • Conclusion and Future Work Image source: http://www.yourshiningredthread.com/wp-content/uploads/2012/08/WavyThreadImage.jpg

  16. 4. EvaluationMethodology • Evaluation with legal notes (“Impressum”) from German company Web sites • 1576 documents containing one or more addresses • Each Web site annotated with the address of the owner of the Web site ( Our Gold Standard) • Recall • Fraction of addresses from the Gold Standard found • Only if all attributes of a single address were completely correct, then the address as a whole was considered as correct • Precision • Fraction of correct addresses found • F1-Measure Image source: http://wisesyracuse.wordpress.com/2012/05/23/how-to-measure-the-effectiveness-of-your-social-media-efforts/

  17. 4. EvaluationResults

  18. 4. EvaluationChallenges • Structure of company names often very unusual • Leads to partly correct detection • E.g. “oberüberAgenturfürdigitaleWertschöpfung” has been detected as “AgenturfürdigitaleWertschöpfung” • Several company names on the Web site • Wrong company is assigned to an address • Transformation from HTML code to text introduces errors

  19. Outline • Motivation • Why Business Address Data? • Application Scenarios • Structure of German Addresses • Solution • Evaluation • Methodology • Results • Challenges • Related Work • Adaptation • Conclusion and Future Work Image source: http://www.yourshiningredthread.com/wp-content/uploads/2012/08/WavyThreadImage.jpg

  20. 5. Related Work • [Loos08] • Usage of Conditional Random Fields • Small annotated dataset for bootstrapping • Result of unsupervised tagger as an additional feature • [Asadi08] • Manually defined patterns for address extraction with confidence scores • Usage of some geographic information from unknown source • [Cai05] • Exploiting graph based similarity to a template graph • Usage of commercial GIS database • [Ahlers08] • Relying on complete database of street names, postal codes and cities • Matching of text to valid combination of those attributes • Relying on manual effort and/or extensive proprietary data sources • No identification of business addresses

  21. 5. Related WorkResults • Comparison to Related Work • Restricting to address without company name

  22. Outline • Motivation • Why Business Address Data? • Application Scenarios • Structure of German Addresses • Solution • Evaluation • Methodology • Results • Challenges • Related Work • Adaptation • Conclusion and Future Work Image source: http://www.yourshiningredthread.com/wp-content/uploads/2012/08/WavyThreadImage.jpg

  23. 6. Adaptation to other Country/Language • Define overall pattern (order of attributes) • Adapt identification of single attributes • Re-Create Gazetteers • Cities • Street name indicators • Business entity types  OpenStreetMap and Wikipedia exist in most countries/languages

  24. Outline • Motivation • Why Business Address Data? • Application Scenarios • Structure of German Addresses • Solution • Evaluation • Methodology • Results • Challenges • Related Work • Adaptation • Conclusion and Future Work Image source: http://www.yourshiningredthread.com/wp-content/uploads/2012/08/WavyThreadImage.jpg

  25. 7. Conclusion & Future Work • A new approach for identification of address data • Outperforming existing approaches • No usage of commercial databases • Adaptable to other languages / countries • Tailored for identification of business addresses • Next steps: • Adapt patterns to other languages / countries • Evaluate in other languages / countries

  26. Questions & Contact Source: http://www.dreifragezeichen.de/

  27. References • [Ahlers08] D. Ahlers and S. Boll. Retrieving Address-based Locations from the Web. In Proceedings of the 2nd international workshop on Geographic information retrieval, GIR ’08, pages 27–34, New York, NY, USA, 2008. ACM • [Asadi08] S. Asadi, G. Yang, X. Zhou, Y. Shi, B. Zhai, and W.-R. Jiang. Pattern-Based Extraction of Addresses from Web Page Content. In Y. Zhang, G. Yu, E. Bertino, and G. Xu, editors, Progress in WWW Research and Development, volume 4976 of Lecture Notes in Computer Science, pages 407–418. Springer Berlin Heidelberg, 2008. • [Cai05] W. Cai, S. Wang, and Q. Jiang. Address extraction: Extraction of location-based information from the web. In Y. Zhang, K. Tanaka, J. Yu, S. Wang, and M. Li, editors, Web Technologies Research and Development - APWeb 2005, volume 3399 of Lecture Notes in Computer Science, pages 925–937. Springer Berlin Heidelberg, 2005. • [Loos08] B. Loos and C. Biemann. Supporting Web-based Address Extraction with Unsupervised Tagging. In C. Preisach, H. Burkhardt, L. Schmidt-Thieme, and R. Decker, editors, Data Analysis, Machine Learning and Applications, Studies in Classification, Data Analysis, and Knowledge Organization, pages 577–584. Springer Berlin Heidelberg, 2008. • [Mühleisen12] H. Mühleisenand C. Bizer. Web Data Commons -Extracting Structured Data from Two Large Web Corpora. In Proceedings of the 5th Workshop on Linked Data on the Web, 2012.

More Related