280 likes | 532 Views
Chapter 5. Introduction to WWW Application. WWW Applications. Search Engine / Meta-Search Engine Web Data Mining Bots and Internet Intelligent Agents Electronic Commerce Web Titles e -Learning. Section 5-1. Search Engine / Meta-Search Engine. Internet. Search Engine. Browser.
E N D
Chapter 5 Introduction to WWW Application
WWW Applications • Search Engine / Meta-Search Engine • Web Data Mining • Bots and Internet Intelligent Agents • Electronic Commerce • Web Titles • e-Learning Chapter 5:Introduction to WWW Application
Section 5-1 Search Engine / Meta-Search Engine
Internet Search Engine Browser User Database What is Search Engine? • A mechanism that help users to find online resources quickly. Chapter 5:Introduction to WWW Application
Popular Search Engines • AltaVista (http://www.altavista.com) • Excite (http://www.excite.com) • Google (http://www.google.com) • HotBot (http://www.hotbot.com) • Lycos (http://www.lycos.com) • Yahoo! (http://www.yahoo.com) • WebCrawler (http://www.webcrawler.com) • Openfind, GAIS, Yam,…etc. Chapter 5:Introduction to WWW Application
Types of Search Tools • Search Engines & Meta-Search Engines • Search Engine: Google • Meta-Search Engine: Metacrawler,SavvySearch • Subject Directories • Yahoo! • Specialized Databases (The Invisible Web) • Librarian's Index Chapter 5:Introduction to WWW Application
How to choose a starting point? • Search Engines • Advantage: Can be fast. • disadvantage: Irrelevant information can overwhelm useful information. (Good choice of keywords can help here.) • Specialized Web Site • Advantage: Leads to information inaccessible to search engines. • disadvantage: May not exist for your topic. Chapter 5:Introduction to WWW Application
How to choose a starting point? (Cont.) • FAQ • Advantage: A great place to start. • Disadvantage: Not all topics have FAQs. • Guess • Advantage: Can be very fast. • Disadvantage: Requires experience, intuition. • Discussion group • Advantage: Reaches a community of experts. • Disadvantage: Relatively slow. Experts may tire of beginner questions. Chapter 5:Introduction to WWW Application
Search Engine & Catalog • Catalog is the set of Web pages that a search engine knows how to find. Also called a database or index. • A search engine can find only the Web pages in its catalog. • No catalog covers the entire Internet since the Internet keeps changing, so catalogs are never completely up to date. Chapter 5:Introduction to WWW Application
Give a query, get a hit • Keyword is a word, partial word, or phrase that you can give to a search engine. Also called a search term. • Query is one or more keywords that, together, represent the concept that you want to find on the Net. Also called a search string. • Hit is a Web page in the catalog that matches your query. Also called a match. Chapter 5:Introduction to WWW Application
Techniques to build catalogs • Active Search Engine • Collects Web page information by itself. • Use a program called a spider (also called a robot, wanderer or crawler) that travels around the Net, locates Web pages, and adds entries to the catalog. • Some spiders run all the time, adding information to the catalog on a regular basis. Others run less frequently, perhaps updating the catalog weekly or monthly. Chapter 5:Introduction to WWW Application
Passive Search Engine URLs Register URLs My Web Page URLs WWW Active Search Engine Chapter 5:Introduction to WWW Application
Techniques to build catalogs (Cont.) • Passive Search Engine • Does not seek out Web pages by itself. • Allow people to register their Web pages, usually by filling out a form online. Once a page is registered with the search engine, the page can be found by queries. • Some search engines have both active and passive features. They use a spider to gather information, but also allow users to register pages. Chapter 5:Introduction to WWW Application
Techniques to build catalogs (Cont.) • Meta-Search Engine • Do not catalog any Web pages themselves. • It forward user’s queries to other search engines to do the actual work. • When results come back from the other search engines, the meta-search engine presents them to the user, possibly summarizing them or at least giving them a consistent appearance. Chapter 5:Introduction to WWW Application
query hits AltaVista Query query Hits hits Lycos query hits Yahoo MetaCrawler Chapter 5:Introduction to WWW Application
Comparison of Search Engines • Active Search Engine • Advantage: Large catalog. • Disadvantage: Too many hits. • Passive Search Engine • Advantage: Possibly more organized. • Disadvantage: Smaller catalog; items may be cataloged in unexpected places. • Meta-Search Engine • Advantage: One query goes a long way. • Disadvantage: Longer search time. Chapter 5:Introduction to WWW Application
Choose keywords with care • The success of a Web search depends heavily on the keywords you choose. Be sure to watch out for: • Misspellings (拼錯字) • Alternate spellings (不同的拼法) • Synonyms (同義字) • Word forms (文字的型態) Chapter 5:Introduction to WWW Application
The forms of advanced query Chapter 5:Introduction to WWW Application
The forms of advanced query (Cont.) Chapter 5:Introduction to WWW Application
The forms of advanced query (Cont.) Chapter 5:Introduction to WWW Application
Search Strategies • General search (廣域式搜尋): When you know little about your topic. • Specific search (集中式搜尋): When you know a lot about your topic. • Incremental search (漸進式搜尋): Zeroing in on your topic. • Substring search (字串搜尋): Matching several similar keywords at once. • Search-and-jump (搜尋再搜尋): A speedy, two-part search technique. • Category search (目錄搜尋): Convenient browsing of a topic area. • Search-and-rank (搜尋與排序): Locating the most relevant hits first. Chapter 5:Introduction to WWW Application
Comparison of search strategies Chapter 5:Introduction to WWW Application
Some Meta-Search Engines • WebCrawler • Characteristics • It uses a content-based, full-text indexing system to provide a high-quality index. • It uses a breadth-first search strategy to create a broad index. • It tries to include as many Web servers as possible. Chapter 5:Introduction to WWW Application
Agents Internet Webspace Query Server Search Engine Database Some Meta-Search Engines (Cont.) • Architecture • The search engine. • The agents. • The database. • The query server. Chapter 5:Introduction to WWW Application
Some Meta-Search Engines (Cont.) • Lycos • It extracts the following pieces of information from each document that it retrieves: • Title • Headings and subheadings • 100 most important words • First 20 lines • Size in bytes • Number of words Chapter 5:Introduction to WWW Application
Some Meta-Search Engines (Cont.) • The 100 important words are selected using the Tf * Idf weighting algorithm. • Tf (Term Frequency) is the number of occurences of particular terms in the collection. • Df (Document Frequency) is the number of documents in the collection which particular terms occur. • IDf (Inverse Document Frequency) • N: the number of documents in a collection • IDf = log(N / Df) • weight = Tf * IDf = Tf * log(N / Df) Chapter 5:Introduction to WWW Application
WWW Robot WWW Robot WWW Robot Web Server Web Server Web Server Some Meta-Search Engines (Cont.) • Harvest • It is an integrated tool that provides a scalable, customizable architecture for gathering, indexing, caching, replicating, and accessing Internet information. Chapter 5:Introduction to WWW Application
Broker (Index) Broker (Index) Broker (Index) Filter Filter Gatherer Gatherer Gatherer Web Server Web Server Web Server Some Meta-Search Engines (Cont.) • Subsystems • Gatherer collects indexing information • Broker provides a flexible interface to gathered information • Index/Search subsystem allows the information space to be flexibly indexed and searched in a variety of ways • Object Cache stores contents of retrieved objects to alleviate access bottlenecks to popular data • Replicator mirrors index information of Brokers to alleviate server bottlenecks Chapter 5:Introduction to WWW Application