490 likes | 663 Views
Tiny Search Engine : Design and implementation. YAN Hongfei ( 闫宏飞 ) yhf@net.cs.pku.edu.cn Network Group Oct. 2003. Outline. analysis which deals with the design requirements and overall architecture of a system; design
E N D
Tiny Search Engine:Design and implementation YAN Hongfei (闫宏飞) yhf@net.cs.pku.edu.cn Network Group Oct. 2003
Outline • analysis • which deals with the design requirements and overall architecture of a system; • design • which translates a system architecture into programming constructs (such as interfaces, classes, and method descriptions); • and programming • which implements these programming constructs. 2
Defining System Requirements and Capabilities • Supports capability to crawl pages multi-threadly • Supports persistent HTTP connection • Supports DNS cache • Supports IP block • Supports the capability to filter unreachable sites • Supports the capability to parse links • Supports the capability to crawl recursively • Supports Tianwang-format output • Supports ISAM output • Supports the capability to enumerate a page according to a URL • Supports the capability to search a key word in the depot 3
Web有向图 <href …> <href …> <href …> <href …> <href …> <href …> <href …> 网页为节点 HTML链接引用为有向边 4
有向图的连通性 • 强连通(strong connectivity):任何两点存在一条有向通路 • “根连通”:存在一个节点,从它到每一个其他节点都有一条有向通路 • 定理:一个强连通的有向图一定是根连通的 5
Web有向图的性质 • Web graph at any instant of time contains k-connected subgraphs (but we do not know the value of k, nor do we know a-priori the structure of the web-graph). • If we knew every connected web subgraph, we could build a k-web-spanning forest, but this is a very big "IF.“ • 还可以细一些,我们实际可以关心“根连通子图”(但)找到那些“根”不容易 6
Three main components of the Web • HyperText Markup Language • A language for specifying the contents and layout of pages • Uniform Resource Locators • Identify documents and other resources • A client-server architecture with HTTP • By with browsers and other clients fetch documents and other resources from web servers 7
HTML <IMG SRC = http://www.cdk3.net/WebExample/Images/earth.jpg> <P> Welcome to Earth! Visitors may also be interested in taking a look at the <A HREF = “http://www.cdk3.net/WebExample/moon.html>Moon</A>. </P> (etcetera) • HTML text is stored in a file of a web server. • A browser retrieves the contents of this file from a web server. • -The browser interprets the HTML text • -The server can infer the content type from the filename extension. 8
URL Scheme: scheme-specific-location e.g: mailto:joe@anISP.net ftp://ftp.downloadIt.com/software/aProg.exe http://net.pku.cn/ …. • HTTP URLs are the most widely used • An HTTP URL has two main jobs to do: • To identify which web server maintains the resource • To identify which of the resources at that server 9
HTTP URLs • http://servername[:port]//pathNameOnServer][?arguments] • e.g. http://www.cdk3.net/ http://www.w3c.org/Protocols/Activity.html http://e.pku.cn/cgi-bin/allsearch?word=distributed+system ---------------------------------------------------------------------------------------------------- Server DNS name Pathname on server Arguments www.cdk3.net (default) (none) www.w3c.org Protocols/Activity.html (none) e.pku.cn cgi-bin/allsearch word=distributed+system ------------------------------------------------------------------------------------------------------- 10
HTTP • Defines the ways in which browsers and any other types of client interact with web servers (RFC2616) • Main features • Request-replay interaction • Content types. The strings that denote the type of content are called MIME (RFC2045,2046) • One resource per request. HTTP version 1.0 • Simple access control 11
More features-services and dynamic pages • Dynamic content • Common Gateway Interface: a program that web servers run to generate content for their clients • Downloaded code • JavaScript • Applet 12
Web Graph-Search Algorithms I PROCEDURE SPIDER1(G) Let ROOT := any URL from G Initialize STACK <stack data structure> Let STACK := push(ROOT, STACK) Initialize COLLECTION <big file of URL-page pairs> While STACK is not empty, URLcurr := pop(STACK) PAGE := look-up(URLcurr) STORE(<URLcurr, PAGE>, COLLECTION) For every URLiin PAGE, push(URLi, STACK) Return COLLECTION What is wrong with the above algorithm? 13
1 2 5 3 6 4 7 Depth-first Search numbers = order in which nodes are visited 14
SPIDER1是不正确的 • 如果web graph有回路会导致 => Algorithm will not halt • 遇到汇聚的结构会出现 => Pages will replicated in collection => Inefficiently large index => Duplicates to annoy user 15
SPIDER1 is 不完整的 • Web graph has k-connected subgraphs. • SPIDER1 only reaches pages in the the connected web subgraph where ROOT page lives. 16
A Correct Spidering Algorithm PROCEDURE SPIDER2(G) Let ROOT := any URL from G Initialize STACK <stack data structure> Let STACK := push(ROOT, STACK) Initialize COLLECTION <big file of URL-page pairs> While STACK is not empty, | Do URLcurr := pop(STACK) | Until URLcurr is not in COLLECTION PAGE := look-up(URLcurr) STORE(<URLcurr, PAGE>, COLLECTION) For every URLiin PAGE, push(URLi, STACK) Return COLLECTION 17
A More Efficient Correct Algorithm PROCEDURE SPIDER3(G) Let ROOT := any URL from G Initialize STACK <stack data structure> Let STACK := push(ROOT, STACK) Initialize COLLECTION <big file of URL-page pairs> | Initialize VISITED <big hash-table> While STACK is not empty, | Do URLcurr := pop(STACK) | Until URLcurr is not in VISITED | insert-hash(URLcurr, VISITED) PAGE := look-up(URLcurr) STORE(<URLcurr, PAGE>, COLLECTION) For every URLiin PAGE, push(URLi, STACK) Return COLLECTION 18
A More Complete Correct Algorithm PROCEDURE SPIDER4(G, {SEEDS}) | Initialize COLLECTION <big file of URL-page pairs> | Initialize VISITED <big hash-table> | For every ROOT in SEEDS | Initialize STACK <stack data structure> | Let STACK := push(ROOT, STACK) While STACK is not empty, Do URLcurr := pop(STACK) Until URLcurr is not in VISITED insert-hash(URLcurr, VISITED) PAGE := look-up(URLcurr) STORE(<URLcurr, PAGE>, COLLECTION) For every URLiin PAGE, push(URLi, STACK) Return COLLECTION 19
爬取器的一种结构图 20
What we need? • Intel x86/Linux (Red Hat Linux) platform • C++ • …. Linus Torvalds 21
Get the homepage of PKU site • [webg@BigPc ]$ telnet www.pku.cn 80 连接到服务器的80号端口 • Trying 162.105.129.12... 由Telnet客户输出 • Connected to rock.pku.cn (162.105.129.12). 由Telnet客户输出 • Escape character is '^]'. 由Telnet客户输出 • GET / 我们只输入了这一行 • <html> Web服务器输出的第一行 • <head> • <title>北京大学</title> • …… 这里我们省略了很多行输出 • </body> • </html> • Connection closed by foreign host. 由Telnet客户输出 22
Outline • analysis • which deals with the design requirements and overall architecture of a system; • design • which translates a system architecture into programming constructs (such as interfaces, classes, and method descriptions); • and programming • which implements these programming constructs. 23
Defining system objects • URL (RFC-1738) • <scheme>://<net_loc>/<path>;<params>?<query>#<fragment> • 除了scheme部分,其他部分可以不在URL中同时出现。 • scheme ":" ::= 协议名称. • "//" net_loc ::= 网络位置/主机名,登陆信息. • "/" path ::= URL 路径. • ";" params ::= 对象参数. • "?" query ::= 查询信息. • Page • …. 24
Class URL class CUrl { public: string m_sUrl; // URL字串 enum url_scheme m_eScheme; // URL scheme 协议名 string m_sHost; // 主机字串 int m_nPort; // 端口号 /* URL components (URL-quoted). */ string m_sPath, m_sParams, m_sQuery, m_sFragment; /* Extracted path info (unquoted). */ string m_sDir, m_sFile; /* Username and password (unquoted). */ string m_sUser, m_sPasswd; public: CUrl(); ~CUrl(); bool ParseUrl( string strUrl ); private: void ParseScheme ( const char *url ); }; 25
CUrl::Curl() CUrl::CUrl() { this->m_sUrl = ""; this->m_eScheme= SCHEME_INVALID; this->m_sHost = ""; this->m_nPort = DEFAULT_HTTP_PORT; this->m_sPath = ""; this->m_sParams = ""; this->m_sQuery = ""; this->m_sFragment = ""; this->m_sDir = ""; this->m_sFile = ""; this->m_sUser = ""; this->m_sPasswd = ""; } 26
CUrl::ParseUrl bool CUrl::ParseUrl( string strUrl ) { string::size_type idx; this->ParseScheme( strUrl.c_str( ) ); if( this->m_eScheme != SCHEME_HTTP ) return false; // get host name this->m_sHost = strUrl.substr(7); idx = m_sHost.find('/'); if( idx != string::npos ){ m_sHost = m_sHost.substr( 0, idx ); } this->m_sUrl = strUrl; return true; } 27
Defining system objects • URL • <scheme>://<net_loc>/<path>;<params>?<query>#<fragment> • 除了scheme部分,其他部分可以不在URL中同时出现。 • scheme ":" ::= 协议名称. • "//" net_loc ::= 网络位置/主机名,登陆信息. • "/" path ::= URL 路径. • ";" params ::= 对象参数. • "?" query ::= 查询信息. • Page • …. 28
Class Page public: string m_sUrl; string m_sLocation; string m_sHeader; int m_nLenHeader; string m_sCharset; string m_sContentEncoding; string m_sContentType; string m_sContent; int m_nLenContent; string m_sContentLinkInfo; string m_sLinkInfo4SE; int m_nLenLinkInfo4SE; string m_sLinkInfo4History; int m_nLenLinkInfo4History; string m_sContentNoTags; int m_nRefLink4SENum; int m_nRefLink4HistoryNum; enum page_type m_eType; RefLink4SE m_RefLink4SE[MAX_URL_REFERENCES]; RefLink4History m_RefLink4History[MAX_URL_REFERENCES/2]; map<string,string,less<string> > m_mapLink4SE; vector<string > m_vecLink4History; 29
Class Page …continued public: CPage(); CPage::CPage(string strUrl, string strLocation, char* header, char* body, int nLenBody); ~CPage(); int GetCharset(); int GetContentEncoding(); int GetContentType(); int GetContentLinkInfo(); int GetLinkInfo4SE(); int GetLinkInfo4History(); void FindRefLink4SE(); void FindRefLink4History(); private: int NormallizeUrl(string& strUrl); bool IsFilterLink(string plink); }; 30
Requesting a connection Listening and accepting a connection s = socket(AF_INET, SOCK_STREAM,0) s = socket(AF_INET, SOCK_STREAM,0) bind(s, ServerAddress); listen(s,5); connect(s, ServerAddress) sNew = accept(s, ClientAddress); write(s, "message", length) n = read(sNew, buffer, amount) ServerAddress and ClientAddress are socket addresses Sockets used for streams 31
与服务器建立连接中需要考虑的问题 • DNS缓存 • URL数以亿计,而主机数以百万计。 • 是否为允许访问范围内的站点 • 有些站点不希望搜集程序搜走自己的资源。 • 针对特定信息的搜索,比如:校园网搜索,新闻网站搜索。 • 存在着类似这样的收费规则: 同CERNET连通的国内站点不收费。 • 是否为可到达站点 • 与服务器connect的时候,使用了非阻塞连接。 • 超过定时,就放弃。 32
构造请求消息体并发送给服务器 1/3 • 实现代码 • int HttpFetch(string strUrl, char **fileBuf, char **fileHeadBuf, char **location, int* nPSock) • 参考了 http://fetch.sourceforge.net中的int http_fetch(const char *url_tmp, char **fileBuf) • 申请内存,组装消息体,发送 33
获取header信息 2/3 • 实现代码 • int HttpFetch(string strUrl, char **fileBuf, char **fileHeadBuf, char **location, int* nPSock) • e.g. • HTTP/1.1 200 OK Date: Tue, 16 Sep 2003 14:19:15 GMT Server: Apache/2.0.40 (Red Hat Linux) Last-Modified: Tue, 16 Sep 2003 13:18:19 GMT ETag: "10f7a5-2c8e-375a5cc0" Accept-Ranges: bytes Content-Length: 11406 Connection: close Content-Type: text/html; charset=GB2312 34
获取body信息 3/3 • 实现代码 • int HttpFetch(string strUrl, char **fileBuf, char **fileHeadBuf, char **location, int* nPSock) • e.g. <html> <head> <meta http-equiv="Content-Language" content="zh-cn"> <meta http-equiv="Content-Type" content="text/html; charset=gb2312"> <meta name="GENERATOR" content="Microsoft FrontPage 4.0"> <meta name="ProgId" content="FrontPage.Editor.Document"> <title>Computer Networks and Distributed System</title> </head> …. 35
多道收集程序并行工作 • 局域网的延迟在1-10ms,带宽为10-1000Mbps • Internet的延迟在100-500ms,带宽为0.010-2 Mbps • 在同一个局域网内的多台机器,每个机器多个进程并发的工作 • 一方面可以利用局域网的高带宽,低延时,各节点充分交流数据, • 另一方面采用多进程并发方式降低Internet高延迟的副作用。 36
应该有多少个节点并行搜集网页?每个节点启动多少个Robot? 1/2应该有多少个节点并行搜集网页?每个节点启动多少个Robot? 1/2 • 计算理论值: • 平均纯文本网页大小为13KB • 在连接速率为100Mbps快速以太网络上,假设线路的最大利用率是100%,则最多允许同时传输(1.0e+8b/s)/ (1500B*8b/B)≈ 8333个数据帧,也即同时传输8333个网页 • 如果假设局域网与Internet的连接为100Mbs,Internet带宽利用率低于 50%(网络的负载超过80%,性能是趋向于下降的;路由),则同时传输的网页数目平均不到4000个。 • 则由n个节点组成的搜集系统,单个节点启动的Robot数目应该低于4000/n。 37
应该有多少个节点并行搜集网页?每个节点启动多少个Robot? 2/2应该有多少个节点并行搜集网页?每个节点启动多少个Robot? 2/2 • 经验值: • 在实际的分布式并行工作的搜集节点中,还要考虑CPU和磁盘的使用率问题,通常CPU使用率不应该超过50%,磁盘的使用率不应该超过80%,否则机器会响应很慢,影响程序的正常运行。 • 在天网的实际系统中局域网是100Mbps的以太网,假设局域网与Internet的连接为100Mbps(这个数据目前不能得到,是我们的估计),启动的Robot数目少于1000个。 • 这个数目的Robot对于上亿量级的搜索引擎(http://e.pku.cn/ )是足够的。 38
单节点搜集效率 • 以太网数据帧的物理特性是其长度必须在46~1500字节之间。 • 在一个网络往返时间RTT为200ms的广域网中,服务器处理时间SPT为100ms,那么TCP上的事务时间就大约500ms(2 RTT+SPT)。 • 网页的发送是分成一系列帧进行的,则发送1个网页的最少时间是(13KB/1500B) * 500ms ≈4s。 • 如果系统中单个节点启动100个Robot程序,则每个节点每天应该搜集(24 *60 *60s/4s)* 100 = 2,160,000个网页。 • 考虑到Robot实际运行中可能存在超时,搜集的网页失效等原因,每个节点的搜集效率小于2,160,000个网页/天。 39
TSE中多线程工作 • 多个收集线程并发的从待抓取的URL队列中取任务 • 控制对一个站点并发收集程序的数目 • 提供WWW服务的机器,能够处理的未完成的TCP连接数是有一个上限,未完成的TCP连接请求放在一个预备队列 • 多道收集程序并行的工作,如果没有控制,势必造成对于搜集站点的类似于拒绝服务(Denial of service)攻击的副作用 40
如何避免网页的重复收集? • 记录未访问和已访问URL信息 • ISAM格式存储: • 当新解析出一个URL的,要在WebData.idx中查找是否已经存在,如果有,则丢弃该URL。 • .md5.visitedurl • E.g. 0007e11f6732fffee6ee156d892dd57e • .unvisit.tmp • E.g. http://dean.pku.edu.cn/zhaosheng/北京大学2001年各省理科录取分数线.files/filelist.xml http://mba.pku.edu.cn/Chinese/xinwenzhongxin/xwzx.htm http://mba.pku.edu.cn/paragraph.css http://www.pku.org.cn/xyh/oversea.htm 41
域名与IP对应问题 • 存在4种情况: • 一对一,一对多,多对一,多对多。一对一不会造成重复搜集, • 后三种情况都有可能造成重复搜集。 • 可能是虚拟主机 • 可能是DNS轮转 • 可能是一个站点有多个域名对应 42
ISAM • 抓回的网页存到isam形式的文件中, • 包括一个存数据的文件(WebData.dat) • 和一个索引文件(WebData.idx) • 索引文件中存放每个网页开始位置的偏移地址以及url • 函数原型: • int isamfile(char * buffer, int len); 43
Enumerate a page according to a URL • 根据文件WebData.dat和WebData.idx查找指定url并将此网页的前面一部分内容显示在屏幕上。 • 函数原型: • int FindUrl(char * url,char * buffer,int buffersize); 44
Search a key word in the depot • 根据WebData.dat查找含有指定关键字的网页,并输出匹配的关键字的上下文。 • 函数原型: • void FindKey(char *key); • 函数中打印找到的url以及key附近的相关内容,每打印一个就出现提示让用户选择继续打印或退出以及显示整个网页文件。 45
Tianwang format output • a raw page depot consists of records, every record includes a raw data of a page, records are stored sequentially, without delimitation between records. • a record consists of a header(HEAD), a data(DATA) and a line feed ('\n'), such as is HEAD + blank line + DATA + '\n‘ • a header consists of some properties. Each property is a non blank line. Blank line is forbidden in the header. • a property consists of a name and a value, with delimitation ":". • the first property of the header must be the version property, such as: version: 1.0 • the last property of the header must be the length property, such as: length: 1800 • for simpleness, all names of properties should be in lowercase. 46
Summary • Supports capability to crawl pages multi-threadly • Supports persistent HTTP connection • Supports DNS cache • Supports IP block • Supports the capability to filter unreachable sites • Supports the capability to parse links • Supports the capability to crawl recursively • Supports Tianwang-format output • Supports ISAM output • Supports the capability to enumerate a page according to a URL • Supports the capability to search a key word in the depot 47
TSE package • http://net.pku.edu.cn/~webg/src/TSE/ • nohup ./Tse –c seed.pku & • To stop crawling process • ps –ef • Kill ??? 48