BiographyNet Project review, year-1 September, 18th, 2013

Agenda. Project objectives and first year results (Piek) Methodology and historian perspective (Serge) Model, conversions and interface (Niels) NLP tools and research (Antske) Discussion.

BiographyNet Project review, year-1 September, 18th, 2013

  BiographyNetProject review, year-1September, 18th, 2013 eScience Center 18 September 2013

  Agenda • Project objectives and first year results (Piek) • Methodology and historian perspective (Serge) • Model, conversions and interface (Niels) • NLP tools and research (Antske) • Discussion

  3. Starting point • http://www.biografischportaal.nl • Academic discipline of writing histories: • computational tools marginally used, • long scholarly tradition of study by reading, • single authored historical narratives, • while more and more historical sources digitally available. • Project challenges: • “Computational thinking in history”: • Narrative historians not used to frame research problems in computational terms, while computer-science researchers understand little of the subtleties of historical analysis • Strong multi-disciplinary cooperation of front runners in both fields & demonstrator development to achieve common understanding. • Methodological and tool support BiographyNet Review Meeting, eScience centre, September 18th, 2013

  4. Contribution to historical research • New research on the Dutch nation building and a revaluation of biographical information. • Bridging a gap between life histories, qualitative historical research, and quantitative historical research. • Open research on less static objects and relations such as events: • most important pieces of information capturing changes and processes that matter. • Capture historiographic perspective: • Requires a model that takes different framings of the same event into account. • Adds to the who-knows-who, when, where and how did the lives of people cross; how did they affect each other’s lives and the world they lived in. • How do and did we conceive historic events, how are different narratives created around the same history? BiographyNet Review Meeting, eScience centre, September 18th, 2013

  5. Expected outcome • Demonstrator on top of the Biography Portal. • Cyclic development. • links within the Biography Portal among the various (textual and visual) datasets • Open-source release of the e-science platform for analyzing biographical texts about people. • Adherence to all relevant Web standards and APIs, maximizing reusability. • Proposal for methodology for extraction of a network of relations between people and (historic) events. BiographyNet Review Meeting, eScience centre, September 18th, 2013

  6. Short term goals • Building a richer data repository by connecting different distributed sources of data through formalized links and metadata. • Detection of (co-referenced) named-entities (persons, places and dates) and events. • Harmonize the texts that vary from 19th century Dutch to contemporary Dutch, where the OCR-ed texts also contain errors. • Development of visualization, analytic tools, as well as computational historiographical methods on the structured data that is generated for 1. through 3. BiographyNet Review Meeting, eScience centre, September 18th, 2013

  7. Results first year • Methodology: • Use cases and the anticipation of data- and process-driven biases • Formal modeling of provenance • Sustainability, replication, reproducibility • Software: • Design of interfaces and analytic tools • Text mining and evaluation • Linked Data conversion scripts • Data: • Linked Data version of the Portal • Linking to Agora • Discussions with Wikimedia/Wikipedia/Dbpedia& Bibliotheek.nl • VerrijktKoninkrijk • HuygensING exploitation to extend the Portal with enriched data produced • 6 accepted papers BiographyNet Review Meeting, eScience centre, September 18th, 2013

  8. BiographyNetandhistorical approaches to ‘big’ andheterogeneousdata eScience Center 18 September 2013 BiographyNet Review Meeting, eScience centre, September 18th, 2013

  9. The historian’srole • Methodology: Work on a methodology to extract information, relationships and events from short biographicaltexts • Question the data: developuse cases • Contribute to the design of a user interface thatchallengeshistorians to digdeeperinto the data • Sensitize target user groups (historians) forboth the possibilities and the limitations of computationalmethods in historical research.

  10. 1: Methodology • Year 1 - Historian’s focus: howreliable and representative are the textsfromthisparticular dataset? Whichquestionscan and cannotbeanswered? Howwell do ‘tools’ perform, as compared to the performance of a ‘real’ historian? Seealsopublications (below). • Year 1 - Interdisciplinary focus: what is the provenance of the information, how is itmanipulated in order to arrive at the answer to a query, and who are responsiblefor the toolsthatmanipulatethose data?

  11. 2: Use Cases • 12 cases developed, rangingfrom ‘simple’ to ‘highly complex’ • Simple: Group analysis of Governors-general of the Dutch Indies • More complex: whendid Dutch elites getinvolvedwith the ‘New World?’ • Complex: Whatcan we sayaboutnationalism in biographicaldictionariesfrom the nineteenth and twentiethcentury?

  12. Governors-General of the Dutch Indies • Highest Official in the Dutch Indies1610-1949 • 71 men • Whatcan we sayabout these men as a group? • Who was appointed and whatqualitiesdidhe have to have? • Etc ….

  13. 3: User friendly interface • Mainlywork in progress, • Discussionabout the impact of a ‘design metaphor’ (like “time line” … , “house of…”, “building blocksfor…”, “family tree…”) on the type of questionsraisedby the user • … presentation Niels.

  14. The House of History

  15. Time line

  16. Family Tree

  17. 4: Sensitize target user groups • Publication in Tijdschrift voor Biografie (reaching the nearest target user group of the demonstrator): Serge ter Braake, ‘Het individu en zijn tijdgenoten. Wat een biograaf kan doen met prosopografie en biografische woordenboeken’, Tijdschrift voor Biografie 2 (summer 2013) vol. 2, 52-61. • ‘Biography and ComputationalMethods’, joint paper in preparation (to besubmittedbefore the end of the month to Journal forHistoricalBiography (Ter Braake, Ockeloen and Fokkens) • Research onnationalism and nationalbiographies, to bepublished in 2014

  18. 4: Sensitize target user groups • Presentation at Huygens ING, 10 October 2013 (for circa 50 professional historians) • Presentationonprovenance at KNAW Digital Humanities Workshop, 14-15 November 2013 • Introduction in e-Humanities in the current curriculum of BA1 students at the Vrije Universiteit (what is e-Humanities, how does oneuse a sourcelike the Oxford Dictionary of National Biography?) • Design and development of a series of electives and a minor one-history and ane-humanities (BA 2-3; starting 2014/2015). Dataset of BiographyNetwillbeused in a lab forhistory bachelor students.

  19. BiographyNetTowards the demonstrator eScience Center 18 September 2013 BiographyNet Review Meeting, eScience centre, September 18th, 2013

  20. Overview Main components of the demonstrator • Schema to structure the data • Conversion of the BP to Linked Data • NLP system setup • Interface

  21. A crash courseonLinked Data Online machine readable data with links • Simple facts called ‘RDF Triples’ • Thorbecke > hasBirthPlace > Zwolle Some technology concepts: • Schemas: To structure LD • RDF Stores: To store LD • SPARQL: To access LD Huge growth in the past years: • More than 300 data sources • More than 30 billion triples

  22. The conversionprocess Purely syntactic conversion • Preserve the original structure of the data • Prevent los of information • Allow for reinterpretation of the original data in the future Data Preservation

  23. The conversionprocess Conversion steps: • Retrieval of XML dump of the Biography Portal • Initial conversion to ‘crude’ RDF • Using ClioPatria and the XMLRDF tool for ClioPatria • RDF restructuring • Linking to other sources • Essential step in the ‘Linked Data’ philosophy

  24. The conversionprocess Data schema: • Based on the structure of the original XML files • Needs to facilitate the coupling of different biographies of the same person, without compromising the original data • Needs to facilitate the incorporation of several enrichments, following from NLP, Entity Reconciliation, etc. • Compatible with existing schemas such as the Europeana Data Model,PROV, P-PLAN, DC terms, etc.

  25. BiographyNet: Schema illustration http://www.biographynet.nl/schema

  26. Provenance: What is it? Provenance information is information on how Entities come into existence • What are entities? • Documents, Articles, Pictures, etc. • Basically anything that can be ‘produced’ by something or someone • What kind of information? • Who did what? • Using which entities? • In which processes?

  27. Provenance in BiographyNet For the demonstrator, provenance needs to be modeled: • From several perspectives: • Information involved • Processes involved • People involved • At multiple levels: • An aggregated level, i.e. per enrichment • Detailed level, i.e. all individual processes

  28. Why is provenance info important forBiographyNet? Needed to ensure credibility of the demonstrator, to evaluate its performance and to improve the academic status of the tool • Historians need to be able to validate results • Replication: Retrieving the same results later using the demonstrator • Reproducibility: Manually by the historian • The aggregated level – Targeted at the historian • Which original sources where involved? • Who to contact in case results are pulled into question? • The detailed level – Targeted at the computer scientist • Detailed information on each individual step • Allows for debugging the internal processing pipeline

  29. BiographyNet Enrichment example Provenance Meta Data NNBW “Thorbecke” Biographical Description Person Meta Data Birth Event 1798 Johan Rudolph Thorbecke werdin 1798 geboren op 14 januari in Zwolle en komt uit een half-Duitse… Johan Rudolph Thorbecke werdin 1798 geboren op 14 januari in Zwolle en komt uit een half-Duitse… Johan Rudolph Thorbecke werdin 1798 geboren op 14 januari in Zwolle en komt uit een half-Duitse… Biography Parts Thorbecke Enrichment NLP Tool Biographical Description Person Meta Data Birth Event 1798-01-14 Zwolle

  30. More than just Provenance… P-PLAN is not only used to model what actually happened, but also what was supposed to happen • ‘Plans’ describe the original idea behind an activity • Describe what should happen in a certain activity • Each ‘Plan’ corresponds with an ‘Activity’ • ‘Variables’ describe the input/output of an activity • Structure, format, quantity, etc. • Each ‘Variable’ corresponds with an input/output ‘Entity’ of an ‘Activity’ • ‘Plans’ have their own provenance info • E.g. who was responsible for the creation of a plan?

  31. Why model plans besides provenance? The benefits of modeling plans: • Forces the recording of what an activity and its input/output should look like • Provides information on the original idea behind an activity • As such, can provide info on possible assumptions and biases • Allows for comparing between the actual activity and its input/output and the original plan and its variables • Do they differ from each other and to what extend? • Makes finding errors much easier, as more information is available about what the input/output should look like

  32. BiographyNet: Schema illustration

  33. Variable Variable Plan Plan Agent Person Association Agent NLP Tool Entity Activity Entity Activity

  34. Recap / Current Status Main components of the demonstrator • Initial schema available (publication LISC @ISWC 2013) • Schema models enrichments and aggregations alongside original sources • Allows for storing various levels of provenance information • Model will be adapted while progressing with building the demonstrator • Initial conversion to Linked Data available • Structure according to schema presented • Next step is linking to external sources • NLP system setup available (Antske) • Interface • Presentation of general outline and ideas

  35. Interface: Focus • The interface should be easy to use • The demonstrator should inspire historians to undertake new research and give direction, rather than being the ‘closing factor’ in their research • The interface should allow users to ‘fine tune’ results returned upon an initial action

  36. Interface: Options • Query composition • Faceted browsing • A combination

  37. Interface: Query composition • Drop down boxes to select ‘Verbs’, data elements and relations

  38. Interface: Facetedbrowsing • No explicit querying, but convergence of the data through browsing and selecting • Provides better feedback to the user • Allows for more direct and easier adjustment of the selected data

  39. Interface: Facetedbrowsing

  40. Interface: A combination • Query composition combined with faceted browsing • Create new facets by defining a query • The result of the query is available as a subset of the data by selecting the defined facet • As such, combinable with other facets • Method to integrate ‘open’ querying of the data into a general interface and visualization

  41. Interface: A combination Facets Results Selection Process Data QuestionAnalysis

  42. Interface: Demonstrator Time and place are primary elements Results ?

  43. BiographyNetTextMining eScience Center 18 September 2013 BiographyNet Review Meeting, eScience centre, September 18th, 2013

  44. First year goals for Text Mining • Methodology • Requirements • Approach • Basic System for data enrichment in text • Identify metadata in text • Setup that can easily be improved and extended • (co-referenced) named entities, events • Deal with alternative spelling BiographyNet Review Meeting, eScience centre, September 18th, 2013

  45. Methodology Requirements • Reproducing results in Natural Language Processing is non-trivial • Details in implementations or experimental setup can influence results up to a point where they tell a different story BiographyNet Review Meeting, eScience centre, September 18th, 2013

  46. Reproducing results • Example: Performance of WordNet similarity scores compared to human ranking: BiographyNet Review Meeting, eScience centre, September 18th, 2013

  47. Reproducing results • Clear registration of all steps involved and storage of (intermediate) system output can improve reproducibility • Systematic testing can help to gain insight into the variation of the outcome of our systems and hence lead to more insight in their performance Antske Fokkens, Marieke van Erp, Marten Postma, Ted Pedersen, PiekVossen and NunoFreire (2013) Offspring from Reproduction Problems: What Replication Failure Teaches Us. In: Proceedings of ACL 2013, Sofia, Bulgaria, August 2013. BiographyNet Review Meeting, eScience centre, September 18th, 2013

  48. Methodology requirements • The method used to extract information may introduce a bias that has unintended influence on the outcome of the historian’s questions • For example: location identification with GeoNames • Heuristic: when multiple locations with the same name, take the one in or closest to the Netherlands • High precision, but `America’, `Willemstad’: what if the historian investigates trips to the Netherlands by officials overseas? BiographyNet Review Meeting, eScience centre, September 18th, 2013

  49. Methodology requirements • Maximize reuse of existing tools for BiographyNet • Maximize reuse of tools developed within BiographyNet by other researchers • How can we create a setup that facilitates this? BiographyNet Review Meeting, eScience centre, September 18th, 2013

