1 / 23

Basi di Dati e Sistemi Informativi su Web

UNICAL - A.A. 2008-2009. Basi di Dati e Sistemi Informativi su Web. Prof. Massimo Ruffolo Ing. Ermelinda Oro. DataBase and Information System … on Web.

Download Presentation

Basi di Dati e Sistemi Informativi su Web

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. UNICAL - A.A. 2008-2009 Basi di Dati e Sistemi Informativi su Web Prof. Massimo Ruffolo Ing. Ermelinda Oro

  2. DataBase and Information System … on Web The term information systemrefers to a system of persons, data records and activities that process the data and information in an organization, and it includes the organization's manual and automated processes. A database is a structured collection of records or data that is stored in a computer system. The structure is achieved by organizing the data according to a database model. The model in most common use today is the relational model.

  3. Querying unstructured sources

  4. Querying unstructured sources • Structure query over unstructured document Extract/Select/AnnotatepoliticianNews Fromhttp://... WherepoliticianNews(X,Y,Z), Z:politician(name:N), N=hillaryClinton [Fill database uri] • This kind of query can be executed over database or unstructured document. Only the rewriting strategy changes

  5. Information extraction and Annotation An ontology-based system for information extractionfrom semi and unstructured Web Documents • Information extraction (IE): enables to acquire information contained in unstructured documents and store them in structured forms • Current Web into a Semantic Web requires automatic approaches for annotation of existing data since manual annotation approaches will not scale in general. More scalable semi-automatic approaches known from ontology learning deal with extraction of ontologiesfrom texts (also in tabular form).

  6. Motivations • Existing IE approaches mainly exploits syntactic structure of information and not its actual semantics • Much work on IE from HTML documents: • There is not a unique winning approach • Extraction rules are able to identify tabular information only when such a structure is explicitly declared • Variability of HTML language and the use of Cascading Style Sheet technology, produce classic HTML approaches not robust • Too little work on IE from PDF documents: • No ontology-based approaches • Existing Table Recognition approaches and information extraction follow distinct scope

  7. State of Art Manual approches NLP-oriented system PDF-oriented approaches TSIMMIS Gottlob et Al. 06 GATE Flesca et Al. (Fuzzy System) 06 Minerva Document Understanding techniques RAPIER W4F SRV XWRAP HTML WHISK JEDI TextRunner FLORID SnowBall Existing Approaches and Systems Supervised Approches Unsupervised Approaches SRV STAVIES RAPIER DeLa WHISK NoDoSe RoadRunner WIEN DEByE EXALG STALKER LixTo DEPTA SoftMealy

  8. PDF Document: the standard format for document publication, sharing and exchange • IE from Adobe Portable Document Format (PDF) • One of the most diffused unstructured document format • PDF documents are completely unstructured and their internal encoding is visualization-oriented • The PDF document description language represents a PDF document as a collection of 2-dimensional typographic elements contained in content streams • Traditional wrapping/IE systems cannot be applied

  9. Goals ? • Information Extraction from Documents by means extraction rules that: • Exploit a human-oriented document representation: 2-dimensional representation • Exploit semantics of the information represented in a Knowledge Base • Directly Populate (enrich) the Knowledge Base with the Extracted Information • Handle both natural language and document structures (by exploiting embedded Table Recognition Approach) • Allow (Semantic) annotation of unstructured sources for enabling semantic classification and search

  10. Proposed Approach • To exploit semantics represented in a Knowledge Base • To recognize information (when they are organized in both textual and tabular form) • To directly store extracted information in the Knowledge Base

  11. 2-Dimensional Document Representation Value about Operating revenues Obtained in 2007 year Semantic given by the position

  12. Internal Document Representation:Input Document

  13. 2-Dimensional Document Representation:Document Portion (0,0) X Y

  14. 2-Dimensional Document Representation:Document Portion 4 1 (0,0) X 32 (1,32) 33 (4,33) Y

  15. 2-Dimensional Document Representation:Document Portion (1,32) (4,33) Portioning Process

  16. Attribute Grammars Example: math expression E → [+ | −] T [ (+ | −) T ]* T → F [ (* | /) F]* F → NUM | (E) An attributeforeachsymbolof the grammar and localattributesusedasaid. So, the semanticactionallowtocompute the valueof the expression: E → {double E.ris; intsegno =1;} [+ | − {segno= −1;} ] T1{E.ris=segno*T1.ris;} [ (+ {segno=1;} | − {segno=−1;}) T2{E.ris=E.ris+segno*T2.ris;} ]* T → {double T.ris; intoper;} F1 {T.ris=F1.ris;} [ (* {oper=1;} | / {oper=2;} ) F2{T.ris=(oper==1)?T.ris*F2.ris : T.ris/F2.ris;}]* F → {double F.ris;} NUM {F.ris=NUM.val;} | (E) {F.ris=E.ris;}

  17. Simple Extraction Patterns: regex • Recognize a float number • \d+(\.\d{2})? • Mail address: • ^[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}$ • (C|c)ittà

  18. Knowledge Representation Formalism

  19. Self-Describing/Populating Ontology (SDO) • A SDO is an ontology in which objects and classes can be equipped by a set of rules named descriptors. • Descriptors are object-oriented grammatical rulesthat: • Allow to recognize and extract objects from documents and populateclasses with new extracted objects • Exploit Knowledge contained in OOKB for the extraction • Can exploit each other in describing more complex objects

  20. Descriptors General or Domain Specific Knowledge class weatherRecord( wCity:city, wWarns:warnings,Temp:temperature, wHumid:percentage, wPress:pressure, wDescr:weatherDescription, wWind:wind). Class Descriptors that handle 2-D capabilities: <weatherRecord(C,Wa,T,H,P,D,Wi)> -> <X:city()>{C:=X;} (<X:warnings()>{Wa:=X;})? <X:temperature()>{T:=X;} <X:percentage()>{H:=X} <X:pressure()>{P:=X;} <X:wind()>{Wi:=X;} 2D-BOTH. <X:weatherDescription()>{D:=X;}

  21. The system architecture Attribute Transition Network (ATN) implemented as logic programs in OntoDLP Language

  22. The system architecture Direct use of Chart Parsing Algorithms for AG parsing

  23. The system architecture: 2-D matcher Direct use of Chart Parsing Algorithms for AG parsing

More Related