360 likes | 370 Views
This lecture discusses the main problem of the World Wide Web - the difficulty of finding the desired web pages. It explains what search engines are and their types, components, and query interface. It also addresses the problems with search engines and provides tips for improving query results. The lecture covers the operation of search engines, how pages get into search engines, and strategies for getting pages to the top of search results. It concludes with an introduction to the WWLib-TNG search engine and the use of automatic classifiers for generating metadata.
E N D
CP3024 Lecture 12 Search Engines
What is the main WWW problem? • With an estimated 800 million web pages finding the one you want is difficult!
What is a Search Engine? • A page on the web connected to a backend program • Allows a user to enter words which characterise a required page • Returns links to pages which match the query
Types of Search Engine • Automatic search engine e.g. Altavista, Lycos • Classified Directory e.g. Yahoo! • Meta-Search Engine e.g. Dogpile
Components of a Search Engine • Robot (or Worm or Spider) • collects pages • checks for page changes • Indexer • constructs a sophisticated file structure to enable fast page retrieval • Searcher • satisfies user queries
Query Interface • Usually a boolean interface • (Fred and Jean) or (Bill and Sam) • Normally allows phrase searches • "Fred Smith" • Also proximity searches • Not generally understood by users • May have extra 'friendlier' features ?
Search Results • Presented as links • Supposedly ordered in terms of relevancy to the query • Some Search Engines score results • Normally organised if groups of ten per page
Problems • Links are often out of date • Usually too many links are returned • Returned links are not very relevant • The Engines don't know about enough pages • Different engines return different results • U.S. bias
Improving query results • To look for a particular page use an unusual phrase you know is on that page • Use phrase queries where possible • Check your spelling! • Progressively use more terms • If you don't find what you want, use another Search Engine!
Who operates Search Engines? • People who can get money from venture capitalists! • Many search engines originate from U.S. universities • Often paid for by advertisements • Engines monitor carefully what else interests you (paid by the click)
How do pages get into a Search Engine? • Robot discovery (following links) • Self submission • Payments
Robot Discovery • Robots visit sites while following links • The more links the more visits • Make sure you don't exclude Robots from visiting public pages
Payments • Some search engines only index paying customers • The more you pay the higher you appear on answers to queries
Self submission • Register your page with a search engine • Pay for a company to register you with many search engines • Get registration with many search engines for free!
Getting to the top • Only relevant queries should be ranked highly • Search engines only look at text • Search engine operators try to stop "search engine spamming" • Some queries are pre-answered
Get where you should be! • Put more than graphics on a page • Don't use frames • Use the <ALT….> tag • Make good use of <TITLE> and <H1> • Consider using the <META> tag • Get people to link to your page
Summary • Search Engines are vital to the Web user • Search Engines are not perfect by a long way • There are tactics for better searching • Page design can bring more visitors via Search Engines • The more links the better!
WWLib-TNG A Next Generation Search Engine
In the beginning • WWLib-TOS • Manually constructed directory • Classified on Dewey Decimal • Simple data structure • Proof of concept
Motive - Why Generate Metadata Automatically? • Meta tags are not compulsory • Old pages are less likely to have meta tags • Available data can be unreliable • The Web of Trust requires comprehensive resource description • An essential prerequisite for widespread deployment of RDF applications
Method - How can Metadata be Generated Automatically? • Using an automatic classifier • The classifier classifies Web Pages according to Dewey Decimal Classification • Other useful metadata can be extracted during the process of automatic classification
Automatic Classification • Intended to combine the intuitive accuracy of manually maintained classified directories with the speed and comprehensive coverage of automated search engines • DDC has been adopted because of its universal coverage, multilingual scope and hierarchical nature
Automatic Classifier - How does it work? Firstly, the page is retrieved from a URL or local file and parsed to produce a document object
Automatic Classifier - How does it work? The document object is then compared with DDC objects representing the top ten DDC classes
Automatic Classifier - How does it work? • Each time a word in the document matches a word in the DDC object, the two associated weights are added to a total score • A measure of similarity is then calculated using a similarity coefficient
Automatic Classifier - How does it work? • If there is a significant measure of similarity the document will be compared with any subclasses of that DDC class • If there are no subclasses (i.e. the DDC class is a leaf node) the document is assigned the classmark • If the result is not significant, the comparison process will proceed no further through that particular branch of the DDC object hierarchy
The automatic classification process can be used to extract other useful metadata elements other than the classification classmarks: Keywords Classmarks Word count Metadata elements • Title • URL • Abstract • A unique accession number and associated dates can be obtained and supplied by the system
RDF Schema • There is a significant overlap with the Dublin Core element set • Requirement for implementation clarity • Those that have Dublin Core equivalents are declared as sub-properties • Maintain interoperability with Dublin Core applications
RDF Schema <rdf:Description ID="Keyword"> <rdf:type rdf:resource="http://www.w3.org/TR/WD-rdf-syntax#Property"/> <rdfs:subPropertyOf resource="http://purl.org/metadata/dublin_core#Subject"/> <rdfs:label>Keyword</rdfs:label> </rdf:Description> <rdf:Description ID="Classmark"> <rdf:type rdf:resource="http://www.w3.org/TR/WD-rdf-syntax#Property"/> <rdfs:subPropertyOf resource="http://purl.org/metadata/dublin_core#Subject"/> <rdfs:label>Classmark</rdfs:label> </rdf:Description>
Classifier Evaluation • Automatic metadata generation will become important for the widespread deployment of RDF based applications • Documents created before the invention of RDF generating authoring tools also need to be described • RDF utilised in this manner may encourage interoperability between search engines • More info: http://www.scit.wlv.ac.uk/~ex1253/
Current Status of WWLib-TNG • New results interface proposed • R-wheel (CirSA) • Builder and searcher constructed, now being tested • Classifier constructed • Test Dispatcher/Analyser/Archiver in place