Web 数据管理

Web数据管理 孟小峰中国人民大学信息学院 xfmeng@public.bta.net.cn

数据库主要进展（1） • 1961 - North American Rockwell, 阿波罗计划的合同商，委托IBM开发一个处理大量数据的系统 • 1964 - IBM 开发了Generalized Update Access Method (GUAM), • 1965- General Electric 开发了Integrated Data Store, 这是CODASYL系统的先驱

数据库主要进展（2） • 1969- IBM公司研制的层次模型的数据库管理系统IMS（Information　Management System）。 • 60年代末70年代初-美国数据库系统语言协商会CODASYL下属的数据库任务组DBTG（Data Base Task Group）提出了若干报告，称为DBTG报告。DBTG 报告确定并建立了数据库系统的许多概念、方法和技术。 • 1970 - Dr Fred Codd 发表了题为“大型共享数据库数据的关系模型”论文，提出了数据库的关系模型，开创了数据库关系方法和关系数据理论的研究，为关系数据库技术奠定了理论基础。 • 1980- IBM 开发了RDB 原型系统System R.

数据库主要进展（3） • 1981 - 商业版的System R， DB2开发成功 • 1982 以后 - 许多商品化的RDBs 涌现. 主要有Oracle, Ingres,Informix, Sybase, SQL Server ……等等 • 数据库技术造就了C.W.Bachman、E.F.Codd和James Gray三位图灵奖得主；带动了一个巨大的软件产业多方位迅速发展

山穷水尽疑无路 • 山在何处？ • 关系 • 面向对象 • Date: Is the OO DBMS a DBMS

柳暗花明又一村 • 村在何方？ • Broadening the Database Field for VLDB2000 • First, fundamental data management assumptions do not apply to the data requirements of the next generation of applications. • all forms of data in all applications

OS FILE APPLICATION DB CORE DB CONVENTIONAL APPLICATION DB DB EXTENDED DB ADVANCED APPLICATION DB DB DB ?DB

数据库—A Changing World（1） • 数据变化 ·数据容量急剧增长(高于1012字节DB将成为常见DB) ·卫星每年返回1015字节数据 ·高能物理实验数据每年达10000盘 30-GB磁带 ·数字图书馆数据每年增长1012字节 ·数据内容迅速变化(多媒体数据已广为人知) ·时间序列、矩阵等复杂结构数据 ·文本、图形、图象、视频、声频等多媒体数据 ·过程、程序等行为型数据

数据库—A Changing World（2） • 计算机系统的变化 ·系统结构的变化：并行计算机、客户/服务器 ·主存储器的容量和价格 ·外存储器容量、价格和性能(第三级存储器) ·通信速度和价格(Internet/Web, Mobile)

数据库—A Changing World（3） • 数据库应用的变化 ·数据仓库、OLAP、数据挖掘 ·数字图书馆、电子出版物 ·电子商务、Web医院、远程教育、虚拟现实 ·Workflow management ·Integrating distributed information resources ·Mobile databases . . . . . .

ADVANCED APPLICATIONS • 形式多样的数据 • data in the Web, e-commerce, digital libraries, and knowledge management/discovery • 形式多样的计算 • Internet, • gizmos, • e-applications, • ubiquitous computing, • P2P computing • Wearable computing, • and other trends revolutionize computing, • All is accompanied by an explosive growth of data and transaction volumes

Data Type Simple Complex structured Complex unstructured Data Location Single server (known) Distributed servers (known) Distributed servers (unknown) Applics Query only Query and Updates Complex data analysis 数据库技术的发展

Relational systems Warehousing Spatial DBs WebDBS Emerging efforts Text DBs Distributed DBs Multi-media DBs Data Type Simple Complex structured Complex unstructured Data Location Single server (known) Distributed servers (known) Distributed servers (unknown) Applics Query only Query and Updates Complex data analysis 数据库技术的发展

数据库与WWW - 它们之间共同点何在? • The Web was originally an Interface for access to distributed documents, • Now - a platform for IS of all types - WIS coupled to DBMSs,

DBMS 数据库不能承受之重?

The Asilomar Report on Database Research[BBC+98] • Phil Bernstein, Michael Brodie, Stefano Ceri, David DeWitt, Mike Franklin, Hector Garcia璏olina, Jim Gray, Jerry Held, Joe Hellerstein, H. V. Jagadish, Michael Lesk, Dave Maier, Jeff Naughton, Hamid Pirahesh, Mike Stonebraker, and Jeff Ullman, J Widom • 2.1 Web changes everything • This is good news for database systems research: the Web is one huge database. However, the database research community has contributed little to the Web thus far. Rather than being an integral part of the fabric of the Web, database systems appear in peripheral roles. ……….However, the largest of the web sites, especially those run by portal and search engine companies, have not adopted database technology.

WWW提供了如下技术 • 全球的互连网络基础结构和支持文本交换的一组协议 • 超连接文档的格式语言HTML • 文档抽取技术及其用户界面 • 多层的Web应用的体系结构 • Web信息检索:基于关键字的搜索引擎 • 数据交换的新的格式标准XML

数据库领域提供的成熟技术： • 数据存储和访问大量高度结构化数据的查询语言 • 数据模型和构造数据的方法 • 维护数据完整性和一致性的机制 • 客户/服务器体系结构的数据库应用 • 新的半结构化数据的模型，放松了传统数据库系统对结构的限制

两者的结合 • 观点之一:There is no new major problems in DBs due to introduction of WIS; There are NEW problems in WIS design and management • 观点之二:Web是一个巨大的数据库 • 观点之三:XML与半结构化数据的结合是Web数据研究的崭新的课题，为Web数据管理和应用提供了一种新的解决方案。

Web数据管理 • 我们说从广义数据库理解的角度来说，Web是一个数据库，它是指一组相关的有用的信息的集合；从狭义的角度来说，Web不是一个数据库，因为它不是按一定数据模型组织的数据的集合。 • 定义Web数据管理建立在广义数据库理解的基础上，它是指在Web环境下，对复杂信息的有效组织与集成，方便而准确的信息查询与发布。

Web数据管理 • Web数据管理中的数据组织是研究Web信息的特点，找出适合Web信息的合理组织模式，目前的研究成果主要体现为半结构化数据模式的研究。 • Web上的信息集成是Web数据管理的最现实的问题。Web上诸多数据源中的信息如何构成为一个为用户可用的整体，是目前很多应用亟待解决的问题。

Web数据管理 • Web查询是指能根据更丰富的语义信息在有效数据组织模式下找出更准确的信息。 • Web信息发布是Web数据管理有别于传统数据管理的新问题。它是指如何把Web 上的数据按用户的需求自动发送给目标用户。从技术上讲，Web数据管理融合了WWW技术，数据库技术，信息检索技术，移动计算技术，多媒体技术以及数据挖掘技术，是一门综合性很强的新兴研究领域。

Semistructured Data Origins: • integration of heterogeneous sources • data sources with non-rigid structure • biological data • Web data

The Semistructured Data Model Bib &o1 complex object paper paper book references &o12 &o24 &o29 references references author page author year author title http title title publisher author author author &o43 &25 &96 1997 last firstname atomic object firstname lastname first lastname &243 &206 “Serge” “Abiteboul” “Victor” 122 133 “Vianu” Object Exchange Model (OEM)

Syntax for Semistructured Data Bib: &o1 { paper: &o12 { … }, book: &o24 { … }, paper: &o29 { author: &o52 “Abiteboul”, author: &o96 { firstname: &243 “Victor”, lastname: &o206 “Vianu”}, title: &o93 “Regular path queries with constraints”, references: &o12, references: &o24, pages: &o25 { first: &o64 122, last: &o92 133} } }

Syntax for Semistructured Data May omit oid’s: { paper: { author: “Abiteboul”, author: { firstname: “Victor”, lastname: “Vianu”}, title: “Regular path queries …”, page: { first: 122, last: 133 } } }

Characteristics of Semistructured Data • missing or additional attributes • multiple attributes • different types in different objects • heterogeneous collections self-describing, irregular data, no a priori structure

{ row: { name: “John”, phone: 3634 }, row: { name: “Sue”, phone: 6343 }, row: { name: “Dick”, phone: 6363 } } row row row name phone name phone name phone “John” 3634 “Sue” 6343 “Dick” 6363 Comparison with Relational Data

XML • a W3C standard to complement HTML • origins: structured text SGML • motivation: • HTML describes presentation • XML describes content • http://www.w3.org/TR/REC-xml (2/98)

XML Data Model • does not exists • Document Object Model (DOM): • http://www.w3.org/TR/REC-DOM-Level-1 (10/98) • class hierarchy (node, element, attribute,…) • objects have behavior • defines API to inspect/modify the document

XML Parsers • traditional: return data structure (DOM?) • event based: SAX (Simple API for XML) • http://www.megginson.com/SAX • write handler for start tag and for end tag

XML v.s. Semistructured Data • both described best by a graph • both are schema-less, self-describing

<personid=“o123”> <name> Alan </name> <age> 42 </age> <email> ab@com </email> </person> { person: &o123 { name: “Alan”, age: 42, email: “ab@com” } } <personfather=“o123”> … </person> { person: { father: &o123 …} } father person father person name email age name age email Alan 42 ab@com Alan 42 ab@com Similarities and Differences similar on trees, different on graphs

More Differences • XML is ordered, ssd is not • XML can mix text and elements: <talk> Making Java easier to type and easier to type <speaker> Phil Wadler </speaker> </talk> • XML has lots of other stuff: entities, processing instructions, comments

Schemas for Semistructured & XML Data Motivations for considering schema: • Optimize query evaluation • Improve storage efficiency • Support index construction • Facilitate the description of database content • Facilitate query formulation • Facilitate data integration.

半结构数据模式 • 模式的描述形式 • 基于逻辑的描述形式 • 基于图的描述形式 • 模式的抽取 • 给定一个数据实例，在没有任何事先知识的情况下，自动地计算数据的相应模式；如果存在多个可能的模式，选择能最好地描述给定数据的模式

&r1 person person company person manages company works-for employee &p1 &c1 &p2 &c2 &p3 c.e.o. c.e.o. works-for works-for position phone name name address name position name address name description &s0 &s1 &s2 &s3 &s4 &s5 &s6 &s7 &s8 &s9 url “Paris” “Sales” “Jones” “Gadget” “Dupont” “Widget” “5552121” “Smith” “Trenton” “Manager” &s10 description &a5 1998 &a1 “www.gp.fr” eval &a4 1997 salesrep procurement &a7 task &a3 &a2 &a6 “below target” contact “on target” Schemas: An Example Some database:

Lower-Bound Schemas Root person company works-for managed-by Employee Company c.e.o. name address name string

Root person company Company works-for managed-by Employee Company c.e.o. Employee name address name string Application 1: Improve Secondary Storage Lower-bound schema Store rest in overflow graph

Bib paper book address year title title journal author string int string string last name first name zip street city string string string string string Application 2: Query Optimization select X.title from Bib._ X where X.*.zip = “12345” select X.title from Bib.book X where X.address.zip = “12345” Upper-bound schema [Fernandez, Suciu 1998]

Schema Extraction(From Data) Problem statement • given data instance D of semistructured data • find the “most specific” schema S for D

Schema Extraction: Sample Data &r employee employee employee employee employee employee employee employee manages manages manages manages manages &p1 &p2 &p3 &p4 &p5 &p6 &p7 &p8 managedby managedby managedby managedby managedby worksfor worksfor worksfor worksfor worksfor company worksfor worksfor worksfor &c

Web 数据管理

Web 数据管理

Presentation Transcript