210 likes | 219 Views
Information Search & Retrieval: Problems, solutions, trends… Tony Rose, PhD MBCS CEng Vice-Chair, BCS IRSG. going further together. Contents. The BCS Information Retrieval SG What is IR anyway? How search engines work Why search is hard Where’s it all going?. Information Retrieval SG.
E N D
Information Search & Retrieval: Problems, solutions, trends… Tony Rose, PhD MBCS CEng Vice-Chair, BCS IRSG going further together
Contents • The BCS Information Retrieval SG • What is IR anyway? • How search engines work • Why search is hard • Where’s it all going?
Information Retrieval SG • Growing rapidly • 750+ members • Annual conference (ECIR) • FDIA • Various 1-day events • Search Solutions • Informer • Discounts for various events, e.g. SIGIR • … is free to join!
Information Retrieval SG • Traditional focus on search (text retrieval) • Knowledge management, Multimedia retrieval, User experience, Information visualisation, extraction, summarisation, etc. • Latest issue of Informer: • “Searching for the Music You Like” • “Exploring Maps through Geo-referenced Images and RDF Shared Metadata” • “Using Semantic Relations to improve Question Answering” • “Modeling & Annotation of Dance Media Semantics”
What is IR? • “Science of searching for: • information in documents • documents themselves • metadata which describe documents, • within databases • …whether relational stand-alone databases or hypertextually-networked databases such as the World Wide Web”
The Need for IR • In a word … Infoglut • 800Mb of recorded information is produced per person per year [Computing magazine] • Up to 80% of corporate information is unstructured • Documents, emails, images, voicemail, etc. • So …can’t we just use Google?
How do Search Engines Work? • On the surface: • Understand what the user wants • Find documents about that topic • In reality: • Count words • Apply a simple equation
How do Search Engines Work? • Measure the conceptual distance between your query and each document in the DB • Return the best matches [Source: Maristella Agosti, University of Padova]
The Central Problem in IR Information Seeker Author Concepts Concepts Query Terms Document Terms Do these represent the same concepts? [Source: Jimmy Lin, University of Maryland]
The Central Problem in IR • How do you represent the concepts? • Documents and queries = “bag of words” • Unordered set of terms + numeric weights • How do you calculate similarity? • Set theory (e.g. Boolean) • Algebraic (e.g. vector space) • Probabilistic
IR models [Source: Wikipedia]
How do we Evaluate Search? • Assume that results are either relevant or non-relevant • Precision: • Proportion of retrieved documents that are relevant • Recall: • Proportion of known-relevant documents that were actually retrieved • But what about: indexing / retrieval speed, query language, user experience, etc? relevant retrieved
Why Search is Hard • Document representation • Keywords are not enough • Blind Venetian = Venetian Blind • Terms are not independent • Structural & discourse dependencies, co-references, etc. • Imperfect “stop lists” • the, and, of…
Why Search is Hard • Morphological relationships • Computer, computing, compute, computed… • Index documents using word stems • False positives: • organization, organ organ • police, policy polic • arm, army arm • False negatives: • cylinder, cylindrical • create, creation • Europe, European • Prefixes are particularly difficult • Un*, dis* • Delegate = de-leg-ate • Ratify = rat-ify
Why Search is Hard • Named entity recognition • Companies in New York • New companies in York • NEs are highly discriminatory • People • Places • Organisations • Many vertical applications • e.g. bioscience
Why Search is Hard • Semantic relationships • Car = automobile • Buy = purchase • Sick = ill • Synonym rings • Car, automobile, truck, bus, taxi... • Appropriate level of abstraction depends on user & task • Development of subject-specific taxonomies • “concept matching”
Why Search is Hard • Word sense disambiguation • “Bank” • Financial institution? • Part of a river? • An aerial manoeuvre? • Active research area • Categorisation & clustering of results
Google’s Insight • Exploit the link structure inherent in the web • calculate measure of document’s value • Independent of any query • “PageRank” • Overall relevance based on 100+ parameters • Constant battle with SEOs • Enterprise search is a different proposition… • As is desktop search
Where’s it all going? • Vertical search • Jobs, travel, health, people, etc. • Rich media search • Audio, video, TV, images • Specialised content search • blogs, news, classifieds • Social search • Personalisation
Where’s it all going? • Mobile search • Answer engines • Active research communityin Question Answering • Multi / cross-lingual search • Search agents • Human UI
Further Information • www.irsg.bcs.org • Informer • ECIR (March 2008, Glasgow) • Search Solutions 2008 (Sept 2008, London)