150 likes | 166 Views
Automatically Annotating Web Pages Using Google Rich Snippets.
E N D
Automatically Annotating Web Pages Using Google Rich Snippets This talk is based on the paper A Framework for Automatic Annotation of Web Pages Using the Google Rich Snippets Vocabulary. Meer, J. van der, Boon, F., Hogenboom, F.P., Frasincar, F. & Kaymak, U. (2011). In 26th Symposium on Applied Computing (SAC 2011) (pp. 763-770). ACM. 11th Dutch-Belgian Information Retrieval Workshop (DIR 2011)
Introduction (1) • Semantically annotating Web pages enhances machine interpretation • Google Rich Snippets (RDFa) enable Web page owners to add semantics to their pages • The vocabulary enables interesting applications 11th Dutch-Belgian Information Retrieval Workshop (DIR 2011)
Introduction (2) • Automating annotation for static and 3rd party Web sites is deemed necessary • Hence, we propose the Automatic Review Recognition and annOtation of Web pages (ARROW) framework 11th Dutch-Belgian Information Retrieval Workshop (DIR 2011)
Framework (1) • Four main stages: • Hotspot identification • Subjectivity analysis • Information extraction • Page annotation • Web pages are converted to DOM trees in order to enable easy processing 11th Dutch-Belgian Information Retrieval Workshop (DIR 2011)
Framework (2) RDFa 11th Dutch-Belgian Information Retrieval Workshop (DIR 2011)
Framework (3): Hotspots • Reviews are characterized by large blocks of text: hotspots • Headers, navigation elements, footers, etc., do not contain these blocks • Text blocks have few HTML elements • For each element in the DOM tree, we compute the text-to-content-ratio (TTCR): , with = # textual characters, and = total # characters in DOM 11th Dutch-Belgian Information Retrieval Workshop (DIR 2011)
Framework (4): Hotspots • Illustrative example: • The h1 element contains 64/73 × 100% ≈ 88% text • However, the div element merely contains 34/116 × 100% ≈ 29% text due to its span elements <h1> Intel Core i7-975 Extreme And i7-950 Processors Reviewed </h1> <div> <p> Page <span class="page-number">1</span> of <span class="num-pages">15</span> </p> </div> 11th Dutch-Belgian Information Retrieval Workshop (DIR 2011)
Framework (5): Subjectivity • Hotspots are verified as reviews whenever they are subjective enough • We utilize an updated version of the LightWeight subjectivity Detection mechanism (LWD) of Barbosa et al. (2009): • Original: check if document has ≥ ksentences that contain ≥ nsubjectivity words each • Modification: check if document has ≥ m percent of all sentences that contain ≥ nsubjectivity words each 11th Dutch-Belgian Information Retrieval Workshop (DIR 2011)
Framework (6): IE • Various information is extracted: • Authors: • Named entities are detected in the vicinity of hotspots • Named Entity Recognizer (NER) • Dates: • Many different date formats are easily parsed • Regular expressions • Products: • Name often found in title and h1 elements • Overlapping words • Ratings: • Many formats, e.g., images (90%), which can be numerical (80%), descriptors (15%), or letters (5%) • We focus on numerical ratings • Regular expressions on plain text or alt text of images (\w)\s(\d{1,2})(th|,)?\s(\d{2,4}) MM ddyyyy ([0-9.,]+)\s?/\s?([0-9.,]+) 4/5 11th Dutch-Belgian Information Retrieval Workshop (DIR 2011)
Framework (7): Annotation • Key elements are tagged using Google Rich Snippets • A new annotated Web page is returned <div xmlns:v="http://rdf.data-vocabulary.org/#" typeof="v:Review"> <span property="v:itemreviewed"> Tango Hotel Taichung </span> <span property="v:reviewer">Sarah Lee</span> <span property="v:rating">4 stars</span> <span property="v:dtreviewed">18th December 2008</span> <p property="v:summary"> Boutique like hotel without the boutique price </p> </div> 11th Dutch-Belgian Information Retrieval Workshop (DIR 2011)
Implementation (1) • We have implemented the ARROW framework as a Web application: • Java-based • Apache Tomcat server • Input: • URL • Preferred output: • Visualizer • Annotated document 11th Dutch-Belgian Information Retrieval Workshop (DIR 2011)
Implementation (2) 11th Dutch-Belgian Information Retrieval Workshop (DIR 2011)
Evaluation • Test set: 100 review, 100 non-review Web pages • Sub-second performance • Precision and specificity are good (both ± 90%), while accuracy and recall are varying (± 40% – 60%) • Main problems related to detecting authors, likely caused by the use of nicknames • Dependency on Web site structures 11th Dutch-Belgian Information Retrieval Workshop (DIR 2011)
Conclusions • We presented ARROW, a framework for automatically annotating reviews with Google Rich Snippets • Framework not bound to vocabulary • Proof-of-concept implementation shows promising results • Future work: • Improve heuristics • Add intelligent (semantically enabled) text parsers • Extend to other domains, e.g., recipes, videos, etc. 11th Dutch-Belgian Information Retrieval Workshop (DIR 2011)
Questions http://www.arrow-project.com/ 11th Dutch-Belgian Information Retrieval Workshop (DIR 2011)