300 likes | 411 Views
WWW servers and search engines. 2004, 劉震昌. Web browser and server. tools to read HTML document. client. server. Web browser. Web server (ex. 跑 IIS). send request. click a link. find document. display. return HTML document. Where is the web server?. Probing the Internet (cont.).
E N D
WWW servers and search engines 2004, 劉震昌
Web browser and server • tools to read HTML document client server Web browser Web server (ex. 跑IIS) send request click a link find document display return HTML document Where is the web server?
Probing the Internet (cont.) • tracert, ping 封包(網路上資料傳輸單位) packet source destination www.yahoo.com.tw router
Probing the Internet (How do you know you are on Internet?) • ping www.yahoo.com.tw Pinging rc.tpe.yahoo.com [202.1.237.23] with 32 bytes of data: Reply from 202.1.237.23: bytes=32 time=4ms TTL=246 Reply from 202.1.237.23: bytes=32 time=5ms TTL=246 Reply from 202.1.237.23: bytes=32 time=4ms TTL=246 Reply from 202.1.237.23: bytes=32 time=4ms TTL=246 Ping statistics for 202.1.237.23: Packets: Sent = 4, Received = 4, Lost = 0 (0% loss), Approximate round trip times in milli-seconds: Minimum = 4ms, Maximum = 5ms, Average = 4ms
The route from source to destination • tracert www.yahoo.com.tw Tracing route to rc.tpe.yahoo.com [202.1.237.23] over a maximum of 30 hops: 1 <1 ms <1 ms <1 ms gateway.lan20.csie.ncnu.edu.tw [163.22.20.254] 2 <1 ms <1 ms <1 ms ip253.puli01.ncnu.edu.tw [163.22.1.253] 3 <1 ms <1 ms <1 ms ip090.puli255-64-203.ncnu.edu.tw [203.64.255.90] 4 1 ms 1 ms 1 ms 140.128.251.38 5 17 ms 74 ms 2 ms tc-tanet-gw01.router.hinet.net [211.22.189.186] 6 2 ms 1 ms 1 ms 211.22.189.190 7 1 ms 1 ms 1 ms tc-c12r2.router.hinet.net [211.22.189.74] 8 4 ms 4 ms 4 ms tp-s2-c12r2.router.hinet.net [210.65.200.30] 9 4 ms 4 ms 4 ms tp-s2-c6r8.router.hinet.net [211.22.35.181] 10 9 ms 5 ms 6 ms 211.22.41.89 11 5 ms 5 ms 5 ms rc.tpe.yahoo.com [202.1.237.23] Trace complete.
Lab#5 • Try ping and tracert to access www.google.com.tw • Record your results in a text file • Email to me with subject: Lab5 學號
動態 IP 如何架站(WWW,ftp,…)? • DHCP (Dynamic Host Configuration Protocol) • DHCP 說明 IP:163.22.123.111 IP:163.22.123.123 If we want to communicate with hime, What’s the IP or domain name? . . . • 自己架 DNS (domain name server) • 動態註冊 IP 與 domain name
www.no-ip.com 動態 www.no-ip.com DNS server IP:163.22.123.111 Kamiry.no-ip.com 註冊 IP 與 domain name 的對應 參考:No-IP 使用文件
安裝 IIS (internet information server) • 在 Windows CD 片 • 安裝說明 • IIS 設定 • Microsoft IIS 太普遍,並且有很多安全漏洞,可以使用非微軟的 WWW server • Ex. Apache, analogx, … • 參考文件
HW#3 • 在自己的電腦上架設 WWW server • 將 server 的 domain name email 給我 • 將自己的個人網頁放到自己的電腦上 • 助教指定開機時間 server 必須開啟
Searching the Web Ref: Chapter 13 in “Modern Information Retrieval” Ricardo Baeza-Yates and Berthier Ribeiro-Neto
Outline • Measuring the Web • Methods for searching the Web • Search engines • Web directories
Searching the Web • WWW starts in 1989 • Just the textual data is estimated to be in the order of one terabyte • Goal: how to efficiently manage, retrieve and filter information from the Web?
Challenges • Distributed data • Data spans over many computers interconnected without predefined topology • High percentage of volatile data 易變資料 • 40% of the Web changes every month • Large volume • Unstructured and redundant data 重複資料 • 30% of Web pages are (near) duplicates • Heterogeneous data • Different languages
Measuring the Web URLs WWW *1998, 3M servers Web server 3百萬 Internet No. of servers = 1/10 no. of computers on Internet
Measuring the Web (cont.) • 1998 • 5Kb per Web page on average • 300M Web pages (3億…) • 300M * 5Kb = 1.5 Terabytes • Grow at a rate of 20M pages per month
Growth of the Web Web pages Million Web sites 300 200 100 year 1996 1997 1998
Methods for searching the Web • Search engines 搜尋引擎 • Index the Web documents as a full-text database • Alta Vista, Google, … • Web directories 入門網站目錄 • Classify selected Web documents by subject • Yahoo!
Search engines concept 搜尋引擎 • Model the Web as a database • All queries must be answered without accessing the Web pages database User queries
Search engines (cont.) • AltaVista (www.altavista.com) • 20 multi-processor machines • 130 Gb of RAM each • Over 500 Gb of disk space each • 75% resources on the query engine
The top search engines • Foreign • Google ( www.google.com ) • www.yahoo.com • www.altavista.com • Inktomi ( www.inktomi.com ) • Statistics on search engines • www.searchenginewatch.com • http://imt.net/~notess/search • Taiwan • Yahoo!/Kimo uses google • Openfind ( www.openfind.com.tw )(中正大學吳昇教授) • Yam ( www.yam.com.tw )
Search engines (cont.) • Centralizedcrawler-indexer architecture Index database Query Engine User Interface Indexer users Crawler Web
User Interface • Query interface • Keywords • Boolean operator • Answer interface • Rank the searched pages • Statistics about the term occurrence within the document • Popularity • Hyperlink information
Index database Query Engine User Interface Indexer users Crawler Web
Crawler • Robots, spiders (蜘蛛), wanderers, walkers, and knowbots • In spite of their name, the crawler runs on a local system and sends requests to remote Web servers • Method: start with a set of URLs, and from there extract other URLs
Crawler (cont.) • How the Web is traversed, the index of a search engine can be thought as analogous to the stars in a sky • Invalid links in search engines vary from 2% to 9% • The current fastest crawlers are able to traverse up to 10M Web pages per day (’98) • 300M/10M = 30 days
Web directories 網站目錄 • Classify the Web pages by categories • Directories are hierarchical taxonomies that classify human knowledge • Yahoo! has close to 1M pages classified • How to classify pages? • Pages has to submitted to the Web directories • Manually done by few people • Automatic classification is not yet mature • Not every page is classified
Some Web directories Web directories URL Web sites(K) Categories Yahoo! www.yahoo.com 750 LookSmart www.looksmart.com 300 24 Lycos Subjects a2z.lycos.com 50 eBLAST www.eblast.com 125 NewHoo www.newhoo.com 100 23 Magellan www.mckinley.com 60 Netscape www.netscape.com Snap www.snap.com
Lab about search engine • Today 1:00~3:00
Final typing test • 10/20 • 沒達到標準學期總分扣 10 分