340 likes | 470 Views
Search engine and services. Course: Location Aware Machine Intelligence Presented by : Celestine Mkama Kalendero 25.02.2014. Outline. Search Engine results ranking based on location Review of Personalized Mobile Search Engine Extraction of Address Data from Unstructured Text.
E N D
Search engine and services Course: Location Aware Machine Intelligence Presented by : Celestine Mkama Kalendero 25.02.2014
Outline • Search Engine results ranking based on location • Review of Personalized Mobile Search Engine • Extraction of Address Data from Unstructured Text
Search Engine Results Ranking based on Location Carolyn Watters and Ghada Amoudi Faculty of Computer Science, Dalhousie University, Halifax, Nova Scotia. Canada. E-mail: watters@cs.dal.ca Publication Year: 2003
Result Ranking in Search engine ( as in the year 2002 ) Search engine build their indexes based on • Keyword occurence Frequency of query negotiation Prons + Robust, Fast Cons • User sort through pages when queries related to physical distance and location 44 % of users frustrated by search engine (Realname,2000)
Geosearcher • Location based ranking system • Translate search reference point into coordinates (Long,Lat) • Rank search results in ascending order based on distance Geosearcher architecture
Geosearcher architecture-Query • Presented by end system users e.g skiing resort District of Columbia Query- Skiing resolt Reference Point- District of Columbia • Sample random Urls available ( used for evaluation )
Geosearcher architecture-Geocoding • Process of assigning latitude and longitude coordinates to the host for each site; - Preliminary work ( Perfomed by researchers) • Determine Location • Create Lookup table
Geosearcher architecture-Geocoding • Determining Location From Host Urls – DNS,Country Codes,Whois database - Map location into coordinates e.g Use Getty Thesaurus(GS) to map location into cordinates + Containing state and area code for US,Canada + Other Countries b) Lookup Table - Country Codes with Coordinates www.about.com www.dartmouth.camathresource.com
Geosearcher architecture-Geocoding • Determining Location From Host Urls – DNS,Country Codes,Whois database - Map location into coordinates e.g Use Getty Thesaurus(GS) to map location into cordinates + Containing state and area code for US,Canada + Other Countries
Example: Location Information Whois Database Getty thesaurus
Geosearcher architecture-Geocoding The Process • Check coordinates from host table • If not, send domain to whois -Return Country Code(CC) and Area code on Match If CC is ca or us and area code, Lookup in Table :- Get state name or province c) If not ,strip down domain by 1 level (i.e data.about.com to about.com ) d) Unmatched names checked in IPtoLL(Host-LatLong Conversion) - IPtoLL uses administrative contact Store Results in host table Next
Geosearcher architecture-Geocoding The Process • Check coordinates from host table • If not, send domain to whois -Return Country Code(CC) and Area code on Match If CC is ca or us and area code, Lookup in Table :- Get state name or province
Distance and Ranking • For Ranking URL in host table from ref Location • Calculated using haversine distance • Stored in session host table • Rank results based on distance (Insertion sort)
Results Unranked Result- Altavista Using Geosearcher
Results..contd Validation of accuracy • Examined 100 result manually for Location Information • 90 websites assigned correctly • 78% of 83 URLs were accurately identified
Results..contd Algorithm Effectiveness • Tested with 10 sets of 100 URLs using Yahoo Random Link generator
Personalized Mobile Search Engine Using Location and Content Concepts Namrata G Kharate ME-Computer-II MCOERC, Nasik-India • Prof. S. A. Bhavsar • Assistant Prof. Computer Dept. • MCOERC, Nasik-India Publication: November, 2013
Search - Mobile Devices • Search queries on mobile Devices – Shorter,ambiguous • Search Results- Less Accurate Solution We need a system that capture user preference to return personalized result ranking • Personalized Mobile Search Engine (PMSE)
PMSE- System Architecture RSVM- Ranking Support Vector Machine Next
PMSE- System Architecture RSVM- Ranking Support Vector Machine
PMSE Client • Receive user requests • Store Click through Data (Location,Content) • Submit Request to server • Display results • Profile preference in ontology based user profile Server • Forward request to commercial search engine • RSVM Training • Search Result Reranking
Extraction of Address Data from Unstructured Text using Free Knowledge Resources Sebastian Schmidt schmidt@kom.tudarmstadt.De Simon Manschitz manschitz@stud.tudarmstadt.de • Ralf Steinmetz • steinmetz@kom.tudarmstadt.de ChristophRensing rensing@kom.tudarmstadt.de • Multimedia Communications Lab • TechnischeUniversität • Darmstadt Germany Publication: November, 2013
Extraction of Address Data • Is of interest in various domains • Location – based services • Address respiratory –automatically created - Automatic harvesting of web address is not possible Solution Identify business address data,hybrid approach • Combine Pattern & Gazetteers
Address Structure-Germany • Company Name- No special pattern • Street- varies, Burgermeister-Jung,Bgm.-Jung • Street # - Digit sequence, e.g 45a,45-47 • Postal Code-exactly 5 numbers,reserved • Cities –Frankfurt,Ffm,Frankfurt/Main
Address Data Identification Workflow
Address Data Identification Preprocessing • Strip HTML Markup –e.g using Beautiful Soap Library • Clearing- Removing non-unicode chars,White space btn numbers • Line Splitting and Tokenizing –using Apache openNLP toolkit • Part of Speech Tagging- using TreeTagger Next
Address Data Identification Line Splitting and Tokenizing –using Apache openNLP toolkit
Address Data Identification 1. Postal Codes • Token regular expression [0-9]{5} 2. Cities • Generated list based on OpenStreetMap accessed via Overpass-API (28,087 entries) • Known city found in the list • Preceded directly by postal code
Address Data Identification 3. Street Numbers • Use Regular expression ([0-9]{1,3})([a-zA-Z][0-9]?)?(([+|-])([0-9]{1,3})([a-zA-Z][0-9]?)?)? 4. Steet Names • Generated list based on OpenStreetMap accessed via Overpass-API (300,000 entries) • Use street name endings e.g str
Address Data Identification 5. Company Name • Search Identical terms ( Wikipedia )- 29 terms e.g GmbH-Private,AG-Public • Exploit standard address structure
Evaluation & Methology • Site with Legal Note (1,576 websites ) Fraction of full address identified correctly Rcorrect Address- 0.946, Rcompany-0.82
Conclusion Search engine Ranking • Evaluation- Algorithm was accurate and effective • Efficiency- Impacted by reliance on external databases Reccommendation • Have Database of special resources – Increase efficiency • Adaptation to other languages- Address extraction
Thank You! (Q&A)