200 likes | 213 Views
Explore the intricate components and functions of search engines, discover how web crawlers index data, and learn effective search strategies to navigate the online world efficiently.
E N D
What Are They? • Four Components • A database of references to webpages • An indexing robot that crawls the WWW • An interface • Enables users to submit queries • Displays results • Information retrieval system • Each is unique, but are mostly the same
Database • Where user's query is matched • Contains only essential parts of pages • Only includes pages that were indexed • Search engines are always out of date
Web Crawler • A robot that follows links • Records data it finds • Words in the webpage • Metadata • ALTattributes in IMG tags • Robot Exclusion Protocol
Search Engine Interfaces • Gathers input from users • Presents results from the IR system • Often in ranked order
Search Engine Interfaces • Input • User requirements • Search expression, search limits • Presentation style • Presentation format , search type
Search Engine Interfaces • Output • Results • Descriptions • Clusters
Search Term Matching • Trying to find a match in the database • Two main methods • Keyword searching • Matching single terms, computing cosine • Concept-based searching • Examining clusters of words • Attempt to determine meaning of query and find records related to that meaning
Basic IR Features • Boolean operators • AND, OR, NOT, grouping • Extended operators • NEAR, ADJACENT, (") • Stop word deletion • Stemming • Searching in fields (e.g. host)
Ranked Output • Most SEs produce ranked lists by applying simple rules: • Early words are more important • Title is very important • Frequency of occurrence matters for some • Infrequent words matter more • Modification date • Google is different: • PageRankTM method based on popularity • Links as money
Googlebombing • Google spoofed from the lecture list • first hit from 1992 • Official GoogleBlog explanation
What about the Invisible Web? • Also known as the Deep Web • Documents that are on the WWW but not indexed by Search Engines • Some are available only by submitting forms • Some are not generally accessible (in subnets) • Some are not in (X)HTML format
The Invisible Web Isn't So Invisible Anymore… • More search engines parse non-(X)HTML now than before • Because of awareness of the problem companies are making more content available using • Stable URLs • Robot-friendly sitemaps • But much content is still not indexed
But, there's still plenty of important yet invisible docs • How to find them? • Many of them are in databases • No one search engine covers everything • Use database tools from the U.'s library • Especially for research articles • Use multiple search engines or a meta-crawler • dogpile is the most famous
Search Engines A Summary of Practical Advice
How To Succeed With SEs • As a surfer: • If you don't know what you are looking for • Use multiple SEs, or a meta-crawler • Search within results • If you don't know what you are looking for • Use multiple SEs, or a meta-crawler • Use Boolean expressions or search within results • Consider specialized engines
How To Succeed With SEs • As a creator: • HTML level • Always use ALT attributes with <IMG>, etc. • Avoid frames • Make it easier to index • Don't expect SEs to find your pages • Make links between your pages • Use metadata • Informal: <meta name="description" …> • Formal: Dublin core and others • Increase your pages popularity • Don’t use systematic reciprocal linking: rings, exchanges, lists • Page Rank™ is inversely proportional to outdegree
How To Succeed With SEs • As a creator (cont.) • For surfers: • Use <meta name="description" …> • Don't expect surfers to start at top of your hierarchy • Don't rely on a hierarchy • Include a context map near the top of each page • Don't use frames • Think through dynamic content implications • Stickiness… is for another day