1 / 90

高级人工智能 第十六章

高级人工智能 第十六章. 互联网智能 Web Intelligence. 史忠植 中国科学院计算技术研究所. 内容提要. 16.1 概述 16.2 语义WEB 16.3 本体知识管理 16.4 WEB挖掘 16.5 搜索引擎 16.6 WEB技术的演化 16.7 集体智能 16.8 人工生命. Web 语言层次. Attribution. Explanation. Rules & Inference. Ontologies. Metadata annotations. Standard Syntax. Identity.

randi
Download Presentation

高级人工智能 第十六章

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. 高级人工智能第十六章 互联网智能 Web Intelligence 史忠植 中国科学院计算技术研究所 高级人工智能 史忠植

  2. 内容提要 16.1 概述 16.2 语义WEB 16.3 本体知识管理 16.4 WEB挖掘 16.5 搜索引擎 16.6 WEB技术的演化 16.7 集体智能 16.8 人工生命 高级人工智能 史忠植

  3. Web语言层次 Attribution Explanation Rules & Inference Ontologies Metadata annotations Standard Syntax Identity 史忠植 智能获取

  4. Knowledge Application Knowledge Distribution Knowledge organization Ontology Acquisition Email Document File Image Video Web KMSphere Layers Zhongzhi Shi: Semantic Web Services

  5. KMSphere Architecture Zhongzhi Shi: Semantic Web Services

  6. KMSphere Workflow Zhongzhi Shi: Semantic Web Services

  7. KMSphere Demo Create ontology by hand Zhongzhi Shi: Semantic Web Services

  8. KMSphere Demo Ontology acquisition from databases Zhongzhi Shi: Semantic Web Services

  9. KMSphere Demo Ontology acquisition from text Zhongzhi Shi: Semantic Web Services

  10. KMSphere Demo Edit ontology Zhongzhi Shi: Semantic Web Services

  11. KMSphere Demo Ontology consistency check Zhongzhi Shi: Semantic Web Services

  12. KMSphere Demo RDQL (RDF Data Query Language) Zhongzhi Shi: Semantic Web Services

  13. Web挖掘分类

  14. Web内容挖掘 • 基于网页内容或其描述中抽取知识的过程。 • Web内容挖掘主要包括文本挖掘和多媒体挖掘两类,其挖掘对象包括文本、图像、音频、视频和其他各种类型的数据。

  15. Agent IP Address Time/Date Method/URI Referrer 202.120.224.4 202.120.224.4 202.120.224.4 202.120.224.4 202.120.224.4 202.120.224.4 202.120.224.4 202.120.224.4 202.120.224.4 15:30:01/2-Jan-01 15:30:01/2-Jan-01 15:35:11/2-Jan-01 15:30:01/2-Jan-01 15:35:11/2-Jan-01 15:37:09/2-Jan-01 15:33:04/2-Jan-01 15:33:04/2-Jan-01 15:33:04/2-Jan-01 GET A.htm GET 1.htm GET C.htm GET Index.htm GET E.htm GET Index.htm GET A.htm GET B.htm GET 1.htm http://ex.edu/index.htm http://ex.edu/index.htm http://ex.edu/C.htm http://ok.edu/A.htm http://ok.edu/link.htm http://ok.edu/res.php http://ex.edu/index.htm http://ex.edu/A.htm http://ex.edu/index.htm Mozilla/4.0(IE4.0NT) Mozilla/4.0(IE4.0NT) Mozilla/4.0(IE5.0W98) Mozilla/4.0(IE5.0W98) Mozilla/4.0(IE4.0NT) Mozilla/4.0(IE4.0NT) Mozilla/4.0(IE5.0W98) Mozilla/4.0(IE5.0W98) Mozilla/4.0(IE5.0W98) 日志的预处理

  16. Web文本挖掘 • Web文本挖掘针对包括Web页面内容、页面结构和用户访问信息等在内的各种Web数据,应用数据挖掘方法发现有用的知识帮助人们从大量Web文档集中发现隐藏的模式。

  17. Web文本挖掘的方法 • 文本概括:从文本(集)中抽取关键信息,用简洁的形式总结文本(集)的主题内容。例如搜索引擎在向用户返回查询结果时,通常需要给出文本摘要。 • 文本分类 :把一些被标记的文本作为训练集,找到文本属性和文本类别之间的关系模型,然后利用这种关系模型判断新文本的类别。召回率和精度。 • 文本聚类:根据文本的不同特征划分为不同的类。 • 从大量文档中发现一对词语出现模式的关联分析以及特定数据在未来的情况预测。

  18. Web文本挖掘的应用 • 搜索引擎领域:利用Web文本挖掘可以更合理地组织搜索结果:按照页面之间的相似程度分为若干簇。 • 自然语言理解领域:结合自然语言处理技术和Web文本挖掘技术。

  19. 文本挖掘在垃圾邮件过滤中的应用

  20. Web结构挖掘 • 有用的知识不仅存在于Web页面间的链接结构和Web页面内部结构,而且也存在于URL中的目录路径结构(页面之间的目录结构关系)。 • Web结构挖掘是指挖掘Web链接结构模式,即通过分析页面链接的数量和对象,从而建立Web的链接结构模式。

  21. Log Spider 20M queries/day Web SE Index Spam SE Browser SE Freshness 24x7 Quality results 800M pages? Web结构挖掘主要方法 • HITS算法 • PageRank算法

  22. HITS - Kleinberg’s Algorithm • HITS – Hypertext Induced Topic Selection • For each vertex v in a subgraph of interest: • a(v) - the authority of v • h(v) - the hubness of v • A site is very authoritative if it receives many citations. Citation from important sites weight more than citations from less-important sites • Hubness shows the importance of a site. A good hub is a site that links to many authoritative sites "Web Search and Mining" Course @ USTC, 2005

  23. Authority and Hubness 权威等级 中心等级 5 2 3 1 1 6 4 7 h(1) = a(5) + a(6) + a(7) a(1) = h(2) + h(3) + h(4) "Web Search and Mining" Course @ USTC, 2005

  24. Convergence of Authority and Hubness • Recursive dependency: • a(v)  Σ h(w) • h(v)  Σ a(w) w  pa[v] w  ch[v] • Using Linear Algebra, we can prove: a(v) and h(v) converge "Web Search and Mining" Course @ USTC, 2005

  25. HITS Example Find a base subgraph: • Start with a root set R {1, 2, 3, 4} • {1, 2, 3, 4} - nodes relevant to the topic • Expand the root set R to include all the children and a fixed number of parents of nodes in R  A new set S (base subgraph)  "Web Search and Mining" Course @ USTC, 2005

  26. HITS Example Hubs and authorities: two n-dimensional a and h • HubsAuthorities(G) • 1  [1,…,1]  R • a  h  1 • t  1 • repeat • for each v in V • do a (v)  Σ h (w) • h (v)  Σ a (w) • a  a / || a || • h  h / || h || • t  t + 1 • until || a – a || + || h – h || < ε • return (a , h ) |V| 0 0 t w  pa[v] t -1 w  pa[v] t t -1 t t t t t t t t -1 t t -1 t t "Web Search and Mining" Course @ USTC, 2005

  27. HITS Example Results Authority Hubness 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Authority and hubness weights "Web Search and Mining" Course @ USTC, 2005

  28. Matrix Denotion of HITS • It is clear that the authority and hubness values calculated by the aforementioned algorithm is the left and right singular vector of the adjacency matrix of the base sub graph. "Web Search and Mining" Course @ USTC, 2005

  29. PageRank • Introduced by Page et al (1998) • The page rank is proportional to its parents’ rank, but inversely proportional to its parents’ outdegree PRi:the PageRank value of page i PRj : the PageRank value of page j kj:number of the pages j refer to d:a parameter ranging [0,1].

  30. PageRank: Formula Given page A, and pages T1 through Tn linking to A, PageRank is defined as: PR(A) = (1-d) + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn)) C(P) is the cardinality (out-degree) of page P d is the damping (“random URL”) factor

  31. PageRank: Intuition Calculation is iterative: PRi+1 is based on PRi Each page distributes its PRi to all pages it links to. Linkees add up their awarded rank fragments to find their PRi+1 d is a tunable parameter (usually = 0.85) encapsulating the “random jump factor” PR(A) = (1-d) + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn))

  32. PageRank v.s. HITS - Algorithm "Web Search and Mining" Course @ USTC, 2005

  33. Web结构挖掘的应用 • 信息检索 • 社区识别 • 网站优化

  34. Ranking for the Search Results • Today’s search engines may return millions of pages for a certain query • It is definitely not possible for the user to preview all these results • An appropriate ranking will be very helpful. • Ranking on relevance • Ranking on importance "Web Search and Mining" Course @ USTC, 2005

  35. 传统信息检索排序 • A ranking purely on relevance • Term frequency (tf) • Inverse Document Frequency (idf) • Okapi … • Many other aspects that Dr. Shuming Shi will mention in the next course. "Web Search and Mining" Course @ USTC, 2005

  36. Limitations of Traditional IR • Text-based ranking function • www.harvard.edu can hardly be recognized as one of the most authoritative pages for the query “harvard”, since many other web pages contain “harvard” more often. • The number of pages with the same relevance is still too large for the users to preview. • Pages are not sufficiently self-descriptive • Usually the term “search engine” doesn't appear on the web pages of search engines. "Web Search and Mining" Course @ USTC, 2005

  37. What’s More for Web Search • In order to solve these problems • We must leverage other information on the Web • We must distinguish those pages with the same amount of relevance • Link Analysis • The web is not just a collection of pure-text documents • the hyperlinks are also very important! • A link from page A to page B may indicate: • A is related to B, or • A is recommending, citing, voting for or endorsing B • Links effect the ranking of web pages and thus have commercial value. "Web Search and Mining" Course @ USTC, 2005

  38. 搜索引擎系统结构Typical Search Engine Architecture 搜索器 用户检索 结果显示 检索器 索引库

  39. 智能搜索引擎 GHunt 智能搜索引擎GHunt是网络信息智能获取与处理系统,支持分布式的网络信息并行搜索与内容过滤;采用文本挖掘自动建立概念语义空间和事件来龙去脉,提供高效的基于语义的文本信息检索、基于内容的图像检索,以及个性化的专题信息推送服务。 Zhongzhi Shi, Qing He, Ziyan Jia and Jiayou Li. Intelligence Chinese Document Semantic Indexing System. International Journal of Information Technology and Decision Making,Vol.2, No.3, 2003:407-424. 史忠植 智能获取

  40. Internet Facilitator Spider1 Spider2 Spidern Database Web Spider • 提出了一种基于智能主体的大规模、分布式智能信息并行搜索的Spider,搜集因特网上各种文档信息,实现相关的网络信息过滤与网址解析,并将搜集的信息返回给服务器。 • DONG Mingkai, SHI Zhongzhi. Distributed Web Spider Based on Intelligent Agent. World Wide Web Technologies in China: Research, Development, and Applications, 2002, pp. 148-162 • 史忠植 董明楷 蒋运承 张海俊. 语义Web 的逻辑基础. 中国科学 E 辑 信息科学 2004, 34(10): 1123-1138 史忠植 智能获取

  41. Web Spider 用户可以对信息进行多方面个性化定制,例如: • IP范围:202.96.*.*;或 202.96.100.0-202.98.255.255 • 网站类型:例如,.com, .edu,或要求URL中包含sports,或peopledaily等特征字; • 关键词:过滤方式有“包含关键词”和“不包含关键词”两种; • 模式:简单模式、标准模式和网站模式; • 个数: Spider并行运行的个数; • 时间: Spider运行的起始时间和终止时间; • 周期: Spider更新运行的周期 史忠植 智能获取

  42. 概念语义空间 • 通过建立概念语义索引,对网络文本与图像信息进行自组织,实现了文本和图像基于概念语义的检索。检索结果不仅提供与查询概念直接相关的文档,还提供这些文档所在的类别,可进一步择类进行类内检索,提高了检索的准确率。另一方面提供与查询概念语义相关的概念,从而克服由于用户检索用词与相关内容文本中使用的概念表述不一致的问题,实现了基于概念语义的互动式检索,提高了查全率。 • Zhongzhi Shi, Qing He, Ziyan Jia and Jiayou Li. Intelligence Chinese Document Semantic Indexing System. International Journal of Information Technology and Decision Making,Vol.2, No.3, 2003:407-424. 史忠植 智能获取

  43. 概念语义空间生成 史忠植 智能获取

  44. 概念语义检索结果 史忠植 智能获取

  45. 事件查询参数选择 史忠植 智能获取

  46. 伊朗地震事件来龙去脉检索 史忠植 智能获取

  47. 伊朗地震专题相关图片报道 史忠植 智能获取

  48. 神州五号专题相关图片报道 史忠植 智能获取

  49. 基于复杂网络的搜索

More Related