Automatically Annotating Web Pages Using Google Rich Snippets

Automatically Annotating Web Pages Using Google Rich Snippets This talk is based on the paper A Framework for Automatic Annotation of Web Pages Using the Google Rich Snippets Vocabulary. Meer, J. van der, Boon, F., Hogenboom, F.P., Frasincar, F. & Kaymak, U. (2011). In 26th Symposium on Applied Computing (SAC 2011) (pp. 763-770). ACM. 11th Dutch-Belgian Information Retrieval Workshop (DIR 2011)

Introduction (1) • Semantically annotating Web pages enhances machine interpretation • Google Rich Snippets (RDFa) enable Web page owners to add semantics to their pages • The vocabulary enables interesting applications 11th Dutch-Belgian Information Retrieval Workshop (DIR 2011)

Introduction (2) • Automating annotation for static and 3rd party Web sites is deemed necessary • Hence, we propose the Automatic Review Recognition and annOtation of Web pages (ARROW) framework 11th Dutch-Belgian Information Retrieval Workshop (DIR 2011)

Framework (1) • Four main stages: • Hotspot identification • Subjectivity analysis • Information extraction • Page annotation • Web pages are converted to DOM trees in order to enable easy processing 11th Dutch-Belgian Information Retrieval Workshop (DIR 2011)

Framework (2) RDFa 11th Dutch-Belgian Information Retrieval Workshop (DIR 2011)

Framework (3): Hotspots • Reviews are characterized by large blocks of text: hotspots • Headers, navigation elements, footers, etc., do not contain these blocks • Text blocks have few HTML elements • For each element in the DOM tree, we compute the text-to-content-ratio (TTCR): , with = # textual characters, and = total # characters in DOM 11th Dutch-Belgian Information Retrieval Workshop (DIR 2011)

Framework (4): Hotspots • Illustrative example: • The h1 element contains 64/73 × 100% ≈ 88% text • However, the div element merely contains 34/116 × 100% ≈ 29% text due to its span elements <h1> Intel Core i7-975 Extreme And i7-950 Processors Reviewed </h1> <div> <p> Page <span class="page-number">1</span> of <span class="num-pages">15</span> </p> </div> 11th Dutch-Belgian Information Retrieval Workshop (DIR 2011)

Framework (5): Subjectivity • Hotspots are verified as reviews whenever they are subjective enough • We utilize an updated version of the LightWeight subjectivity Detection mechanism (LWD) of Barbosa et al. (2009): • Original: check if document has ≥ ksentences that contain ≥ nsubjectivity words each • Modification: check if document has ≥ m percent of all sentences that contain ≥ nsubjectivity words each 11th Dutch-Belgian Information Retrieval Workshop (DIR 2011)

Framework (6): IE • Various information is extracted: • Authors: • Named entities are detected in the vicinity of hotspots • Named Entity Recognizer (NER) • Dates: • Many different date formats are easily parsed • Regular expressions • Products: • Name often found in title and h1 elements • Overlapping words • Ratings: • Many formats, e.g., images (90%), which can be numerical (80%), descriptors (15%), or letters (5%) • We focus on numerical ratings • Regular expressions on plain text or alt text of images (\w)\s(\d{1,2})(th|,)?\s(\d{2,4}) MM ddyyyy ([0-9.,]+)\s?/\s?([0-9.,]+) 4/5 11th Dutch-Belgian Information Retrieval Workshop (DIR 2011)

Framework (7): Annotation • Key elements are tagged using Google Rich Snippets • A new annotated Web page is returned <div xmlns:v="http://rdf.data-vocabulary.org/#" typeof="v:Review"> <span property="v:itemreviewed"> Tango Hotel Taichung </span> <span property="v:reviewer">Sarah Lee</span> <span property="v:rating">4 stars</span> <span property="v:dtreviewed">18th December 2008</span> <p property="v:summary"> Boutique like hotel without the boutique price </p> </div> 11th Dutch-Belgian Information Retrieval Workshop (DIR 2011)

Implementation (1) • We have implemented the ARROW framework as a Web application: • Java-based • Apache Tomcat server • Input: • URL • Preferred output: • Visualizer • Annotated document 11th Dutch-Belgian Information Retrieval Workshop (DIR 2011)

Implementation (2) 11th Dutch-Belgian Information Retrieval Workshop (DIR 2011)

Evaluation • Test set: 100 review, 100 non-review Web pages • Sub-second performance • Precision and specificity are good (both ± 90%), while accuracy and recall are varying (± 40% – 60%) • Main problems related to detecting authors, likely caused by the use of nicknames • Dependency on Web site structures 11th Dutch-Belgian Information Retrieval Workshop (DIR 2011)

Conclusions • We presented ARROW, a framework for automatically annotating reviews with Google Rich Snippets • Framework not bound to vocabulary • Proof-of-concept implementation shows promising results • Future work: • Improve heuristics • Add intelligent (semantically enabled) text parsers • Extend to other domains, e.g., recipes, videos, etc. 11th Dutch-Belgian Information Retrieval Workshop (DIR 2011)

Questions http://www.arrow-project.com/ 11th Dutch-Belgian Information Retrieval Workshop (DIR 2011)

Automatically Annotating Web Pages Using Google Rich Snippets

Automatically Annotating Web Pages Using Google Rich Snippets

Presentation Transcript

Rich Snippets

Web Pages Using FirstClass

Visual Snippets Summarizing Web Pages for Search and Revisitation

Google Rich Snippets

Rich Snippets PrestaShop Plug-in by FME

Google Translating Pages

Google Rich Snippets

Automatically Hardening Web Applications Using Precise Tainting

Automatically Annotating and Integrating Spatial Datasets

NETE4631 Using Google Web Services

Chapter 4 Web Pages Using Web Standards

Evolving dynamic web pages using web mining

JAVA SERVER PAGES CREATING DYNAMIC WEB PAGES USING JAVA

Magento Rich Snippets Extension - Add Google Schema Tags

6 Best Rich Snippets Plugins

How to Get More Traffic from Google Using Rich Snippets

Magento 2 Google Rich Snippets

Google Rich Snippets For Magento 2

What are Rich Snippets | Rich Snippets Genretor

WEB PAGES:

Automatically Annotating Web Pages Using Google Rich Snippets

Google Featured Snippets In Seo