110 likes | 234 Views
Extracting Semistructured Information from the Web J. Hammer, H. Garcia-Molina, J. Cho, R. Aranha, A. Crespo from Stanford University Presented by: Wei Mao. Introduction: Background Fast growing of WWW Semistructured data in web pages Difficulty with manipulating web data
E N D
Extracting Semistructured Information from the Web J. Hammer, H. Garcia-Molina, J. Cho, R. Aranha, A. Crespo from Stanford University Presented by: Wei Mao
Introduction: • Background • Fast growing of WWW • Semistructured data in web pages • Difficulty with manipulating web data • One solution • A configurable extraction program • Extraction result in OEM • A wrapper is used for query
A detailed example: • Weather table Can we query “What is the forecast for Vienna for Jan. 28, 1997?”?
Extraction process: • HTML file • Specification file • Commands • [ variables, source, pattern ] • Package result into an OEM object
Additional capabilities • Extract_table construct • Case operator • Get(url) operator Query the extracted result • Use existing wrapper generation tool • Only simple interface is required
Advantages • Manipulate web data efficiently • Flexible • Easy to use • Reuse the existing systems • (OEM, Lorel, HTML parser)
Disadvantages • Depends on outside input • Requires prior knowledge of the • structure of HTML file • Have to use specification file