370 likes | 519 Views
北京大学计算机科学技术研究所 Institute of Computer Science and Technology of Peking University. Graph Data Management. Instructor: ZOU’ lei zoulei@pku.edu.cn. Outline. Applications and Challenges of Graph Data No-SQL systems Exiting Graph Database Systems About the course. Outline.
E N D
北京大学计算机科学技术研究所 Institute of Computer Science and Technology of Peking University Graph Data Management Instructor: ZOU’ lei zoulei@pku.edu.cn
Outline • Applications and Challenges of Graph Data • No-SQL systems • Exiting Graph Database Systems • About the course
Outline • Applications and Challenges of Graph Data • No-SQL systems • Exiting Graph Database Systems • About the course
Graph Data (a) Protein Network (b) Social Network
Some Challenges in Large Graph Data Management • An Example: Considering a SNS website, there are more than 1 billion active users. Query: I want to know whether “Tom is a friend of Jack, or a friend of his friends…?” Possible Solutions: (Storage) Store the connections between individuals in a relational table (Query) Perform Self-join Recursively….
Some Challenges in Large Graph Data Management recursivequeries
Network Motifs: Simple Building Blocks of Complex Networks (R. Milo, et al.@SCIENCE03)
Network Motifs: Simple Building Blocks of Complex Networks (R. Milo, et al.@SCIENCE03) • Network motifs are patterns (sub-graphs) that recur within a network much more often than expected at random. Network motifs always correspond to some functional patterns in different networks. Questions: • How to find such motifs efficiently ? • Given a motif, how to find all embeddings of this motif efficiently?
Frequent Subgraph Pattern Mining Graph Dataset (A) (B) (C) Frequent Patterns (min support is 2) (2) (1)
query graph graph database Subgraph Search Query: Which compounds contain “benzene ring” ?
Reachablility Query 15 • ?Query(1,11) • Yes • ?Query(3,9) • No 14 11 13 10 12 6 7 8 9 3 4 5 1 2
Shortest Path Distance Query What’s the distance between two specified individuals ?
RDF Data Management The Resource Description Framework (RDF) is a family of World Wide Web Consortium (W3C) specifications originally designed as a metadata data model. WWW Web of Pages Semantic Web Web of Data
An RDF Data Example –Yago Project Structural Data
SPARQL Query Query: Find all individuals who were born on Feb. 12, 1809 and died on April. 15, 1865. SPARQL Syntax Select ?name Where { ?m <hasName> ?name. ?m <BornOnDate> “1809-02-12”. ?m <DiedOnDate> “1865-04-15”. } Query Graph
Outline • Applications and Challenges of Graph Data • No-SQL systems • Exiting Graph Database Systems • About the course
NO-SQL Databases • Key-value Store -- e.g., berkelyDB
NO-SQL Databases • Column Family Store -- e.g., Hadoop/Hbase, Cassandra, Hypertable.. This is an evolution of key-value model. [1] Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Michael Burrows, Tushar Chandra, Andrew Fikes, Robert Gruber: Bigtable: A Distributed Storage System for Structured Data. OSDI 2006: 205-218
NO-SQL Databases • Document Store -- e.g., MongoDB, …
Outline • Applications and Challenges of Graph Data • No-SQL systems • Exiting Graph Database Systems • About the course
Existing Graph Database Systems The following is a list of several well-known graph database projects: • HyperGraphDB - an open-source (LPGL) graph database supporting generalized hypergraphs where edges can point to other edges • InfoGrid - an open-source / commercial (AGPLv3, free for small entities)graph database with web front end and configurable storage engines (MySQL, PostgreSQL, Files, Hadoop)
Some Existing Graph Database Systems • Neo4j - an open-source / commercial (AGPLv3)graph database • DEX - A high-performance graph database and so on… International Graph Database Workshops: http://www.icst.pku.edu.cn/IWGD2010/index.html http://www.cse.unsw.edu.au/~gdm2011/
An Example of Neo4j Finding friends of “Thomas Anderson” and the friends of the friends too • Neo4j http://wiki.neo4j.org/content/The_Matrix
Neo4j API---An Example private void printFriends( Node person ) { Traverser traverser = person.traverse( Order.BREADTH_FIRST, //Traverse图的模式 StopEvaluator.END_OF_GRAPH, // Traverse图的停止条件 ReturnableEvaluator.ALL_BUT_START_NODE, // 哪些图节点被返回 MyRelationshipTypes.KNOWS, //按照那些边来进行Traverse Direction.OUTGOING ); // Traverse的方向 for ( Node friend : traverser ) { System.out.println( friend.getProperty( "name" ) ); } }
Outline • Applications and Challenges of Graph Data • No-SQL systems • Exiting Graph Database Systems • About the course
Course Content • Graph Mining - frequent subgraph mining • Indexing & Query Processing - reachablility query - shortest path query - subgraph query - keyword search • RDF Data Management - Indexing & SPARQL Query Processing - RDF Dataset Construction
课程网站 • 网址: http://www.icst.pku.edu.cn/course/Graphdb/index.html • 教材(作者、书名、出版社及出版年): 1. 《数据挖掘概念与技术》 Jiawei Han & Micheline Kamber 著, 范明&孟小峰 译,机械工业出版社 (第二版) 2.《MANAGING AND MINING GRAPH DATA》, edited by CHARU C. AGGARWAL, HAIXUN WANG, Kluwer Academic Publishers, 2009 3. 《语义网基础》 Grigoris Antoniou;Frank van Harmelen 著, 机械工业出版社, 2008
课程考核 • 课堂报告 (30%) 每位学生报告一篇数据库领域(含数据挖掘,信息检索相关领域)顶级论文(20分钟+5分钟提问) • 作业(30%) 3 项作业,完成3项课题 • 课上表现(10%)
课程考核 • 课程研修报告 (30%): 课程研修报告包括两种形式,学生任选其一: 1) 文献综述型:介绍该课题的研究背景和相关已有工作。并对不同已有研究结果给出自己的评论。 2)论文型报告:鼓励学生就某个特定课题的从事创新性研究,并撰写论文。
自学内容 • Neo4j, http://neo4j.org/ • Freebase, http://www.freebase.com/ Freebase is a large collaborative knowledge base consisting of metadata composed mainly by its community members. It is an online collection of structured data harvested from many sources, including individual 'wiki' contributions. http://wiki.freebase.com/wiki/Freebase_API http://wiki.freebase.com/wiki/Libraries
课程目标 • 掌握图数据库的几种基本的查询算法和挖掘算法 • 了解图数据库技术在不同领域的应用情况 • 培养学生的独立思考和开展研究的能力。
zoulei@pku.edu.cn 助教: 曲丞 qucheng@pku.edu.cn 贺斌斌 hebinbin@pku.edu.cn Let’s begin!
References • Network Motifs: Simple Building Blocks of Complex Networks, R. Milo, et al., Science 298, 824 (2002) • Tim Berners-Lee, Lalana Kagal: The Fractal Nature of the Semantic Web. AI Magazine 29(3): 29-34 (2008) • Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Michael Burrows, Tushar Chandra, Andrew Fikes, Robert Gruber: Bigtable: A Distributed Storage System for Structured Data (Awarded Best Paper!). OSDI 2006: 205-218 • Neo4j http://neo4j.org/ • Kurt D. Bollacker, Colin Evans, Praveen Paritosh, Tim Sturge, Jamie Taylor: Freebase: a collaboratively created graph database for structuring human knowledge. SIGMOD Conference 2008: 1247-1250 • Renzo Angles, Claudio Gutiérrez: Survey of graph database models. ACM Comput. Surv. 40(1): (2008)