150 likes | 342 Views
The Web-DL Environment for Building Digital Libraries from the Web. P. Calado 1 , M. Gonçalves 2 , E. Fox 2 , B. Ribeiro-Neto 1 , A. Laender 1 , A. da Silva 3 , D. Reis 1 , P. Roberto 1 , M. Vieira 1 , J. Lage 1. 1 Federal University of Minas Gerais 2 Virginia Tech
E N D
The Web-DL Environment for Building Digital Libraries from the Web P. Calado1, M. Gonçalves2, E. Fox2, B. Ribeiro-Neto1, A. Laender1, A. da Silva3, D. Reis1, P. Roberto1, M. Vieira1, J. Lage1 1Federal University of Minas Gerais 2Virginia Tech 3Federal University of Amazonas Project I3DL: Integrating Network Representations and Inference Models to Generate Tailored and Interoperable Digital Libraries from Specifications
Introduction • The Web • Largely unstructured • No assumptions about users • Huge volume of information • Digital Libraries • Information is explicitly organized, described, and managed • Users have broad interests but are more specialized • Controlled environment
Moving From the Web to a Digital Library High Databases Knowledge of users/tasks Digital Libraries Web Low High Structure of data
The Web-DL Environment • Combines data extraction and DL tools to: • Collect data from the Web • Normalize the data • Make the data publicly available • Thus providing: • Services and organization of a DL • Access to the breadth of Web contents
Context • Federated digital library systems • Autonomous and heterogeneous • Different systems/protocols • Challenges: interoperability, data fusion, etc. • Networked Library of Theses and Dissertations (NDLTD) • 176 members (and growing) • Continuous flow of new data (ETDs) • Not all support the same protocols/standards
Web-DL Architecture • From the Web to a DL in 3 steps: • Crawl Web sites to collect pages • Parse the pages to extract the data, mapping it to a standard format • Make the data available through a standard protocol (to a DL service provider)
Generates agents for collecting Web pages Provides a visual metaphor for navigation examples Able to collect both static and dynamic pages Collecting Web pages: the ASByE tool
Provides a visual paradigm to specify data examples Generates wrappers based on context Implemented as a Web service Extracting the data: the DEByE tool
Web data is far from standardized Missing mandatory information Information only implicitly present Data in wrong format Data in wrong encoding A solution: data conversion modules Normalizing Web data
Configurable from specifications Semantic networks allow a unified representation of complex digital objects and relationships Extensible API Providing DL services:the MARIAN system
An example Web DL • ETDs were collected from NDLTD member sites • 21 different institutions • 9595 ETDs • Work required: • 2—3 examples/field, 9 min./site • Major problems found: • Sites offline • The hidden Web problem • Data normalization
Conclusions • Moving from the Web to DL is not trivial • Site availability, hidden Web • Data extraction • Data normalization • A general solution depends on solutions for these tasks and on integrating these solutions • Web-DL provides and environment for such integration
Future Work • Data integration • Data quality estimation • Automatic example generation