110 likes | 227 Views
Using Events for Content Appraisal and Selection in Web Archives. Thomas Risse 1 , Stefan Dietze 1 , Diana Maynard 2 , Nina Tahmasebi 1 , and Wim Peters 2 1 L3S Research Center, Hanover, Germany 2 University of Sheffield, Sheffield, UK
E N D
Using Events for Content Appraisal and Selection in Web Archives Thomas Risse1, Stefan Dietze1, Diana Maynard2, Nina Tahmasebi1, and Wim Peters2 1 L3S Research Center, Hanover, Germany 2 University of Sheffield, Sheffield, UK Workshop on Detection, Representation, and Exploitation of Events in the Semantic Web (DeRiVE 2011)
The Web is a core part of our daily life World Wide Web = 50 Bill.Pages + 1 Bill. Users • The Web plays a crucial role • Information and services for all domains • Reflects all types of events, opinions, developments within society, science, politics, environment, business, … • Giving room for the articulation for a multitude of stakeholders • SocialWeballows contributions by every citizen Workshop on Detection, Representation, and Exploitation of Events in the Semantic Web (DeRiVE 2011)
The Web is Changing and Forgetting • We face great economic and environmental challenges • Understand the past to help us navigate to a sustainable future • The Web is a quickly changing, ever growing information space [1] • It’s growing by >8% per week • After 1 year only 40% of the pages are still accessible while 60% of the pages are new • A Web Archive as a Collective Memory is a cultural necessity for the future [1] A. Ntoulas, J. Cho, and C. Olston. What's new on the web?: the evolution of the web from a search engine perspective. In Proceedings of the 13th international conference on World Wide Web (WWW '04)
An Event to Crawl: The Olympic Games London 2012 Workshop on Detection, Representation, and Exploitation of Events in the Semantic Web (DeRiVE 2011)
The ARCOMEM Project • To transform Web archives into community memories that are much more tightly integrated with their community of current and future users. • To develop methods and tools based on novel socially-aware and socially-driven Web preservation models. • Three dimensions • Intelligent and collaborative content acquisition support for archives • Social Web analysis: leverage Social Web information, relying on the Wisdom of the Crowds for intelligent content appraisal, selection, contextualization and preservation. • Archive enrichment: extract information about entities, events, topics, and opinions. Workshop on Detection, Representation, and Exploitation of Events in the Semantic Web (DeRiVE 2011)
General Approach for Crawling Crawling Phase (Online) Learning Phase (Offline) Workshop on Detection, Representation, and Exploitation of Events in the Semantic Web (DeRiVE 2011)
Extraction and Detection of Events • Extraction of events performed by GATE tools • Uses an entity-centric approach, based on named entity (ANNIE+) and term recognition (TermRaider) • Employ a combination of two (NLP-based) methods: • top-down (template-based recognition of pre-defined events) • bottom-up (open IE approach based on semantic clustering to find new unknown events) • Both approaches use ontology-based IE, detection and clustering of verbal relations, shallow parsing
Events are hierarchical Rock am Ring 2011 Prodigy performance at Rock am Ring 2011 Man gets crushed to death during Prodigy performance at Rock am Ring 2011
Challenges • Extraction of events in the learning phase • unstructured and heterogeneous Web data • Good context information needs to be extracted • Reasonably fast to make results available for crawling • Detection of events during crawling • Large amounts of Web data need to be analysed in a very short time (>1000 pages/sec) • No real event extraction possible • Only event detection based on learned information • Priorization and decision making for crawling • Integrate everything to prioritize the crawling • Identification of “stop pages” Workshop on Detection, Representation, and Exploitation of Events in the Semantic Web (DeRiVE 2011)
Summary & Conclusions • There is a need to preserve parts of the Web (e.g. events) for future generations • Crawlers need more guidance • “Collect all” is not a solution • Need to understand the coverage of page to an event • Analysis needs to be performed very efficiently Workshop on Detection, Representation, and Exploitation of Events in the Semantic Web (DeRiVE 2011)
Thank you! Workshop on Detection, Representation, and Exploitation of Events in the Semantic Web (DeRiVE 2011)