50 likes | 193 Views
edot expertise and interests of IASI. General goal of edot. Integration of data from the web with existing data (about a given topic) Target topic: risk assesment in food existing databases managed by BIA/INRA a hot topic a lot of useful public data distributed over the web.
E N D
General goal of edot • Integration of data from the web with existing data (about a given topic) • Target topic: risk assesment in food • existing databases managed by BIA/INRA • a hot topic • a lot of useful public data distributed over the web
Overview of the programme • Specification of a data warehouse on risk assessment in food • Data acquisition from the Web • Structuring the warehouse • Validation
Data acquisition from the Web • Related work (Djalil Mezaour’s Phd): • Focused crawling guided by a declarative specification of the needs • Design and evaluation of a form-based query language • Each item of the form is a keywords query specifying an aspect of the searched documents (title, anchor, text neighborhood of input or output links) • Experimented on three domains • good precision can be reached but not enough answers to populate the warehouse • Use of machine learning techniques to learn a crawling strategy to find more data
Structuring the warehouse • Design of a global schema • pivot between existing databases and the specification of the searched web data • existing databases: fixed schemas • specification of the searched data: ?? • Keywords • Hierarchies of keywords (ontology) • URLs • Classification and integration of new web data Problem: bridging the structure gap