"Stories" in data and the roles of crowdsourcing – views of a Web miner

"Stories" in data and the roles of crowdsourcing – views of a Web miner Bettina Berendt Dept. of Computer Science KU Leuven, Belgium http://people.cs.kuleuven.be/~bettina.berendt/ Thanks to: Ilija Subašić, Markus Luczak-Rösch, and Laura Drăgan

A story

Story structure

One case of provenance

Another case of provenance

Formalizing provenance: a high-level view

Challenge 1:Many voices

Challenge 2

Challenge 3:subjectivity

The STORIES Tool

Uncover (1)

Uncover (2)

Scan (over time)

Uncover

Zoom

Search: formulating ad-hoc concepts

Track (2)

Textual summarization

Challenge 4

Crowd-sourcing the truth? Wikipedia (here: the Gaza Flotilla Raid)

Challenge 5

Challenge 5: vagueness - reprise Challenge 4: More specifically

The “live crowdsourcing activity“ • Goal: crowdsource data citation metadata • Motivation 1 / possible extension • Motivation 2 / case study

http://prov.usewod.org

The data Datasets Publications [People]

The datasets Preloaded: USEWOD datasets DBpedia SWDF Bio2RDF LinkedGeoData BioPortal OpenBioMed

The datasets Preloaded: Generic (!) Versions/releases References

The datasets Add new: Name* Version Release date URL

The publications Preloaded: USEWOD workshop papers

The publications Add new: Title* Authors Year URL

The data

The task Capture which dataset is used in which publication and how

Data representation Datasets Publications Connections between them schema.org prov:Entity ?

Data representation Datasets Publications Connections between them schema.org prov:Entity prov:Derivation

Connections Publication – Publication Publication – Dataset Dataset – Publication Dataset - Dataset

Connections Publication – Publication citation

Connections Publication – Dataset Dataset – Publication mentions describes evaluates analyses compares

Connections Dataset – Dataset extends includes overlaps transformation of generalisation of

Data representation Subclasses of prov:Derivation (inverse of Publication-DS)

Data representation

Bundles

Live crowdsourcing activity 2014: outcomes

Lessons learned Data is dirty even coming from experts Focus on the task make everything else simpler minimise data input

Questionnaire results Inconclusive results on the suitability of the vocabulary, But interesting answers to: „“what questions would this information answer for you?“: “What are popular datasets?” “Which datasets are facilitators for research on X?” “What publications are related through a dataset (but don't mention each other)?”

Outlook (1): Dimensions of crowdsourcing What is outsourced Who is the crowd How is the task designed How are the results validated How can the process be optimised [Quinn & Bederson, 2012]

"Stories" in data and the roles of crowdsourcing – views of a Web miner