110 likes | 214 Views
PAIR project progress report. Yi-Ting Chou Shui-Lung Chuang Xuanhui Wang. Motivation. A lot of information exists distributed and unstructured on the Web Web IE: To extract and organize such information into a structured format
E N D
PAIR project progress report Yi-Ting Chou Shui-Lung Chuang Xuanhui Wang
Motivation • A lot of information exists distributed and unstructured on the Web • Web IE: To extract and organize such information into a structured format • E.g., Person (name, contact (email, phone, address), research interests,…) • E.g., Book (title, authors, price, ISBN,… )
Example Person (name, contact (email, phone, address), research interests,…) …… Page 1 Page 3 Page 2
Motivation (cont.) • Direct Web IE is very hard. • E.g., distributed and unstructured • This project is to provide a instance-attribute retrieval engine towards this problem • In this project, We focus on personal information. • The attribute should be given (e.g. contact).
Flow Chart Pages PageCollector SegmentTool Name Trees AttributeExpansion Attribute* Attribute Retrieval Rank List
Why tree structure for page segmentation?? • The parameter which controls the size of leaf block is difficult to tune • Our Solution: score each node of the tree instead of the leaf blocks. Then select the appropriate node to rank.
Current Progress Pages PageCollector SegmentTool Name Trees AttributeExpansion Attribute* Attribute Retrieval Rank List
The remaining task • 1. Improve the accuracy for single page. • 2. Extend to multiple pages: • INPUT: a person name (instead of a URL) and attribute name. • OUTPUT: a rank list of the blocks.
Issues for discussion • The possible problem of our method • E.g. how to effectively score and rank the “node” of the page “tree”? • The way to improve and extend our method • E.g. how to combine with the NLP/Name-Entity-Extraction on the retrieved blocks • E.g. How to deal with multiple page and duplicated information • The evaluation suggestion of our method • E.g. user study, anything more?? • The relation with Entity Retrieval • ??