1 / 11

PAIR project progress report

PAIR project progress report. Yi-Ting Chou Shui-Lung Chuang Xuanhui Wang. Motivation. A lot of information exists distributed and unstructured on the Web Web IE: To extract and organize such information into a structured format

kane-hebert
Download Presentation

PAIR project progress report

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. PAIR project progress report Yi-Ting Chou Shui-Lung Chuang Xuanhui Wang

  2. Motivation • A lot of information exists distributed and unstructured on the Web • Web IE: To extract and organize such information into a structured format • E.g., Person (name, contact (email, phone, address), research interests,…) • E.g., Book (title, authors, price, ISBN,… )

  3. Example Person (name, contact (email, phone, address), research interests,…) …… Page 1 Page 3 Page 2

  4. Motivation (cont.) • Direct Web IE is very hard. • E.g., distributed and unstructured • This project is to provide a instance-attribute retrieval engine towards this problem • In this project, We focus on personal information. • The attribute should be given (e.g. contact).

  5. Flow Chart Pages PageCollector SegmentTool Name Trees AttributeExpansion Attribute* Attribute Retrieval Rank List

  6. Why tree structure for page segmentation?? • The parameter which controls the size of leaf block is difficult to tune • Our Solution: score each node of the tree instead of the leaf blocks. Then select the appropriate node to rank.

  7. Current Progress Pages PageCollector SegmentTool Name Trees AttributeExpansion Attribute* Attribute Retrieval Rank List

  8. Demo

  9. The remaining task • 1. Improve the accuracy for single page. • 2. Extend to multiple pages: • INPUT: a person name (instead of a URL) and attribute name. • OUTPUT: a rank list of the blocks.

  10. Issues for discussion • The possible problem of our method • E.g. how to effectively score and rank the “node” of the page “tree”? • The way to improve and extend our method • E.g. how to combine with the NLP/Name-Entity-Extraction on the retrieved blocks • E.g. How to deal with multiple page and duplicated information • The evaluation suggestion of our method • E.g. user study, anything more?? • The relation with Entity Retrieval • ??

More Related