Chapter 5

Chapter 5 Introduction to WWW Application

WWW Applications • Search Engine / Meta-Search Engine • Web Data Mining • Bots and Internet Intelligent Agents • Electronic Commerce • Web Titles • ｅ-Learning Chapter 5：Introduction to WWW Application

Section 5-1 Search Engine / Meta-Search Engine

Internet Search Engine Browser User Database What is Search Engine? • A mechanism that help users to find online resources quickly. Chapter 5：Introduction to WWW Application

Popular Search Engines • AltaVista (http://www.altavista.com) • Excite (http://www.excite.com) • Google (http://www.google.com) • HotBot (http://www.hotbot.com) • Lycos (http://www.lycos.com) • Yahoo! (http://www.yahoo.com) • WebCrawler (http://www.webcrawler.com) • Openfind, GAIS, Yam,…etc. Chapter 5：Introduction to WWW Application

Types of Search Tools • Search Engines & Meta-Search Engines • Search Engine: Google • Meta-Search Engine: Metacrawler,SavvySearch • Subject Directories • Yahoo! • Specialized Databases (The Invisible Web) • Librarian's Index Chapter 5：Introduction to WWW Application

How to choose a starting point? • Search Engines • Advantage: Can be fast. • disadvantage: Irrelevant information can overwhelm useful information. (Good choice of keywords can help here.) • Specialized Web Site • Advantage: Leads to information inaccessible to search engines. • disadvantage: May not exist for your topic. Chapter 5：Introduction to WWW Application

How to choose a starting point? (Cont.) • FAQ • Advantage: A great place to start. • Disadvantage: Not all topics have FAQs. • Guess • Advantage: Can be very fast. • Disadvantage: Requires experience, intuition. • Discussion group • Advantage: Reaches a community of experts. • Disadvantage: Relatively slow. Experts may tire of beginner questions. Chapter 5：Introduction to WWW Application

Search Engine & Catalog • Catalog is the set of Web pages that a search engine knows how to find. Also called a database or index. • A search engine can find only the Web pages in its catalog. • No catalog covers the entire Internet since the Internet keeps changing, so catalogs are never completely up to date. Chapter 5：Introduction to WWW Application

Give a query, get a hit • Keyword is a word, partial word, or phrase that you can give to a search engine. Also called a search term. • Query is one or more keywords that, together, represent the concept that you want to find on the Net. Also called a search string. • Hit is a Web page in the catalog that matches your query. Also called a match. Chapter 5：Introduction to WWW Application

Techniques to build catalogs • Active Search Engine • Collects Web page information by itself. • Use a program called a spider (also called a robot, wanderer or crawler) that travels around the Net, locates Web pages, and adds entries to the catalog. • Some spiders run all the time, adding information to the catalog on a regular basis. Others run less frequently, perhaps updating the catalog weekly or monthly. Chapter 5：Introduction to WWW Application

Passive Search Engine URLs Register URLs My Web Page URLs WWW Active Search Engine Chapter 5：Introduction to WWW Application

Techniques to build catalogs (Cont.) • Passive Search Engine • Does not seek out Web pages by itself. • Allow people to register their Web pages, usually by filling out a form online. Once a page is registered with the search engine, the page can be found by queries. • Some search engines have both active and passive features. They use a spider to gather information, but also allow users to register pages. Chapter 5：Introduction to WWW Application

Techniques to build catalogs (Cont.) • Meta-Search Engine • Do not catalog any Web pages themselves. • It forward user’s queries to other search engines to do the actual work. • When results come back from the other search engines, the meta-search engine presents them to the user, possibly summarizing them or at least giving them a consistent appearance. Chapter 5：Introduction to WWW Application

query hits AltaVista Query query Hits hits Lycos query hits Yahoo MetaCrawler Chapter 5：Introduction to WWW Application

Comparison of Search Engines • Active Search Engine • Advantage: Large catalog. • Disadvantage: Too many hits. • Passive Search Engine • Advantage: Possibly more organized. • Disadvantage: Smaller catalog; items may be cataloged in unexpected places. • Meta-Search Engine • Advantage: One query goes a long way. • Disadvantage: Longer search time. Chapter 5：Introduction to WWW Application

Choose keywords with care • The success of a Web search depends heavily on the keywords you choose. Be sure to watch out for: • Misspellings (拼錯字) • Alternate spellings (不同的拼法) • Synonyms (同義字) • Word forms (文字的型態) Chapter 5：Introduction to WWW Application

The forms of advanced query Chapter 5：Introduction to WWW Application

The forms of advanced query (Cont.) Chapter 5：Introduction to WWW Application

Search Strategies • General search (廣域式搜尋): When you know little about your topic. • Specific search (集中式搜尋): When you know a lot about your topic. • Incremental search (漸進式搜尋): Zeroing in on your topic. • Substring search (字串搜尋): Matching several similar keywords at once. • Search-and-jump (搜尋再搜尋): A speedy, two-part search technique. • Category search (目錄搜尋): Convenient browsing of a topic area. • Search-and-rank (搜尋與排序): Locating the most relevant hits first. Chapter 5：Introduction to WWW Application

Comparison of search strategies Chapter 5：Introduction to WWW Application

Some Meta-Search Engines • WebCrawler • Characteristics • It uses a content-based, full-text indexing system to provide a high-quality index. • It uses a breadth-first search strategy to create a broad index. • It tries to include as many Web servers as possible. Chapter 5：Introduction to WWW Application

Agents Internet Webspace Query Server Search Engine Database Some Meta-Search Engines (Cont.) • Architecture • The search engine. • The agents. • The database. • The query server. Chapter 5：Introduction to WWW Application

Some Meta-Search Engines (Cont.) • Lycos • It extracts the following pieces of information from each document that it retrieves: • Title • Headings and subheadings • 100 most important words • First 20 lines • Size in bytes • Number of words Chapter 5：Introduction to WWW Application

Some Meta-Search Engines (Cont.) • The 100 important words are selected using the Tf * Idf weighting algorithm. • Tf (Term Frequency) is the number of occurences of particular terms in the collection. • Df (Document Frequency) is the number of documents in the collection which particular terms occur. • IDf (Inverse Document Frequency) • N: the number of documents in a collection • IDf = log(N / Df) • weight = Tf * IDf = Tf * log(N / Df) Chapter 5：Introduction to WWW Application

WWW Robot WWW Robot WWW Robot Web Server Web Server Web Server Some Meta-Search Engines (Cont.) • Harvest • It is an integrated tool that provides a scalable, customizable architecture for gathering, indexing, caching, replicating, and accessing Internet information. Chapter 5：Introduction to WWW Application

Broker (Index) Broker (Index) Broker (Index) Filter Filter Gatherer Gatherer Gatherer Web Server Web Server Web Server Some Meta-Search Engines (Cont.) • Subsystems • Gatherer collects indexing information • Broker provides a flexible interface to gathered information • Index/Search subsystem allows the information space to be flexibly indexed and searched in a variety of ways • Object Cache stores contents of retrieved objects to alleviate access bottlenecks to popular data • Replicator mirrors index information of Brokers to alleviate server bottlenecks Chapter 5：Introduction to WWW Application

Chapter 5

Chapter 5

Presentation Transcript

Chapter 5

Chapter 5

Chapter 5

Chapter 5

Chapter 5 5

chapter 5

Chapter 5

Chapter 5

Chapter 5

Chapter 5

Chapter 5

CHAPTER 5

Chapter 5

CHAPTER 5

Chapter 5

Chapter 5

Chapter 5

Chapter 5

Chapter 5

Chapter 5

Chapter 5

Chapter 5