1 / 57

BiographyNed

BiographyNed. eScience Center 21 March 2013. Methodological Issues. How telling is the output of our tools? Selection made by ( editors of) dictionaries Reliability of automated text analysis Introduction of biases in the methodology

udell
Download Presentation

BiographyNed

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. BiographyNed eScience Center 21 March 2013

  2. Methodological Issues How telling is the output of our tools? • Selection made by (editors of) dictionaries • Reliability of automatedtextanalysis • Introduction of biases in the methodology Carefulevaluation and detailedcommunication is required…

  3. Statisticsonavailableinformation

  4. TextualInformation per person

  5. Availability of Information in the portal

  6. Presence of informationforgovernors of Dutch Indies (% on 71 individuals)

  7. Biography Portal of the Netherlands. The Sources

  8. Overview • History and Biography • Where do eScience and History meet? • Use Cases

  9. Historical Research The Art and Science of History: Drawing up a narrativefromprimary and secondarysourceswhichapproximateshistoricalreality as well as possible.

  10. Building Blocks and Concrete • Building blocks: factsderivedmainlyfromarchivalfindings and existingliterature • Concrete: the methodshistoriansuse to put themtogetherinto a narrative/synthesis. • The Narrative: a historicalsynthesiswhichcannotbescientifically proven (only made likely) basedonfactswhichcanbe proven orfalsified. There is necessarily a creative element in drawing up a narrative

  11. Example: Grand Pensionary Johan de Witt (1625-1672) • Building blocks: born in 1625; son of Jacob and Anna van den Corput; appointed grand pensionary in 1653;murdered in the Hague in 1672; enemy of William (III) of Orange; William ofOrange rewardedone of the instigators of the murder • Concrete: (logic) Basedon these last data itis likelythat William ordered the death of Johan • Narrative: William probablyordered the death of Johan <= propositionbasedonfacts and reasoning

  12. The House of History

  13. The Importance of Provenance The onlyway to falsifypresentedhistoricalfacts is bygoing back to the originalsource(s) and look at thosesourcescritically. Highly important to beable to knowwhatinformation comes fromwhereexactly.

  14. OurSourcesHere • The Metadata: building blocks • The entries in biographicaldictionariesthemselves: short historicalnarratives

  15. Status of Biography in Academia and Society • Despiteimprovedeffortsthiscentury to embedbiography in academictheories and methods, somestill do notconsiderit (e.g. somesocialhistorians) a worthyacademic discipline, beingtooanecdotal and limited. • Biography is the most popularnon-fiction genre in bookstores(frombothacademic and layauthors)

  16. Where do eScience and History meet? (I) “And when the capsule biography of anindividual is combinedwith 50,000 others, many of themrelatively obscure, […] and whenthey are all powerfullysearchable online, the socialhistorian’sgrumblesaboutbiography’slimitations as anapproach to historicalstudydissolvesintonothingness.”(Brian Harrison, 2004, formereditor of the Oxford Dictionary of National Biography)

  17. Where do eScience and History meet? (II) • Quantitative analyses of a largergroup of people(prosopography).Surpassing the anecdotal. B. Finding relations/networksbetweenpeoplewhich are otherwise hard to detect

  18. Where do eScience and History meet? III C. Insight in Historiography and historicalselectivity. Who was described/included and why? “Undoubtedly I have deprivedmanyinterestingwomenbynotincludingthem. The onlything I cansay to defendmyselfis this: historywriting is also a process of ruthlessselection.” (Els Kloek, HeadBiography portal and mainauthor 1001 vrouwen) D. Thematic research. E.g.: Whendid the discovery of Americastart to influencepeople’s lives?

  19. BiographyNed Use Cases In the initial stages of the research a list of possiblehistoricalquestionswithinone of thosefourthemes was drawn up (subject to change) , which the demonstratorshouldbeable to giveusananswer to, or at least point into a direction/trend.

  20. Case I: Makinglifeeasier: Group portrait of the Governors-General • Highest Official in the Dutch indies 1610-1949 • 71 men (still a relativelysmallgroup) • Whatcan we sayabout these men as a group? • Who was appointed and whatqualitiesdidhe have to have? • Etc ….

  21. Case I: data mining • Family connections (parents/wife/children, otherrelevant connections <= patronage) • Place of Birth • Education • Religion • Career(patterns) • Age at appointment • Duration of holding the office • Reasonforleaving the office • Place of Death

  22. Case I: Time and Effort More than 1 full weekto manually mine thisinformationfrom the Biography Portal. Can a historian do thiswith (almost) the sameresults in underonehourifhelpedby the demonstrator?

  23. Case II: Makingthingspossible: The Dutch Nation & Identity • Whowereselected to beincluded in National BiographicalDictionaries and why? (what was theirclaim to fame?) • Are there different perspectiveson the sameperson over the time and howcanthisbeexplained? • Who was deemed most important? (basedon the length of the entries) • What time periods are most represented? • Is there a difference in claim to fameforpeoplefrom different periods in history, orbetween men and women? • Whichwords are used most often and can we link them to nationalidentities?

  24. Case II: More Questions … • Whatevents are mentioned most often and what does thatsayabout the status questionisof how the Dutch see/sawthemselves? • What are the differences in the answers to these questionsbetweenseveralnationalbiographicaldictionaries? • Are people and eventsdescribedorappreciateddifferently over time? Does the perspectivechange? • How does thisrelate to biographicaldictionaries, nations and identitieselsewhere in Europe?

  25. Conversion to Linked Data

  26. A crash courseonLinked Data Online machine readable data with links • Simple facts called ‘RDF Triples’ • Thorbecke > hasBirthPlace > Zwolle Some technology concepts: • Schemas: To structure LD • RDF Stores: To store LD • SPARQL: To access LD Huge growth in the past years: • More than 300 data sources • More than 30 billion triples

  27. The conversionprocess Purely syntactic conversion • Preserve the original structure of the data • Prevent loss of information • Allow for reinterpretation of the original data in the future Data Preservation

  28. The conversionprocess Conversion steps: • Retrieval of XML dump of the Biography Portal • Initial conversion to ‘crude’ RDF • Using ClioPatria and the XMLRDF tool for ClioPatria • RDF restructuring • Linking to other sources • Essential step in the ‘Linked Data’ philosophy

  29. The conversionprocess Data schema: • Based on the structure of the original XML files • Needs to facilitate the coupling of different biographies of the same person, without compromising the original data • Needs to facilitate the incorporation of several enrichments, following from NLP, Entity Reconciliation, etc. • Compatible with existing schemas such as the Europeana Data Model,PROV, RDAgr2, FOAF, DC terms

  30. BiograpyNed schema Provenance Meta Data NNBW “Thorbecke” Biographical Description Person Meta Data Birth Event 1798 Johan Rudolph Thorbecke werdin 1798 geboren op 14 januari in Zwolle en komt uit een half-Duitse… Johan Rudolph Thorbecke werdin 1798 geboren op 14 januari in Zwolle en komt uit een half-Duitse… Johan Rudolph Thorbecke werdin 1798 geboren op 14 januari in Zwolle en komt uit een half-Duitse… Biography Parts Thorbecke Enrichment NLP Tool Biographical Description Person Meta Data Birth Event 1798-01-14 Zwolle

  31. Retrieving Information from Text

  32. The texts in the Biography Portal • Collection of biographicaldictionaries • Dutch, includingfrom the 19th and early 20th century and even olderquotes • Sources (different dictionaries/collections) have theirownstyle • Metadata available (thoughlargedifferences in completeness)

  33. Challenges and Advantages • Challenges: • Littleworkon NLP and biographies • Performance of Dutch NLP tools onvariations of Dutch • Advantages: • High quality metadata coverageseveralcategories of information (supervised machine learning) • Withinsources, clear and similarstructure of texts

  34. General Approach • Start byusingadvantages: • Use metadata to label information • A basic IR system canbebuildusingsentencenumber and lemmas as features • Enhance performance with NLP tools • Builduponinformationretrieve in the first steps to tackle more challengingtasks

  35. A Basic System • Supervised Machine Learning • Two step identificationprocess (Wu and Weld 2007;2010, Fader et al. 2011) • Identifysentencethatcontainsinformation • Sequencetagging to identifyinformationwithin the sentence

  36. Adding NLP • Location & Date recognition (GeoNames) • (other) NamedEntities (VIAF enhancedwithnamesfrom metadata) • Dependingon performance of the system, we’llworkon: • Chunking, multiwordrecognition • Parsing • Word SenseDisambiguation

  37. Metadata & Project Goals • Duplicatedetection (metadata and text) • Events/Networkdiscovery • Education (begin, end, location) • Occupation (begin, end, location) • Relations (parents, partners) • Temporal relations betweenevents

  38. Output first system • Bettercoverage of categoriesmentionedabove • A timelinefor a person’slife (birth, education, occupation, locations, death) • NamedEntities in text (dates, locations, persons)

  39. Beyond the first system The informationprovidedby the first system can beused to: • Identifyalternativedescriptions of events(same time, location and/or participants) • Identify relations betweenevents(samelocations & time, consequent events, sameparticipants, etc.) • Initialnetworks of people

  40. Methodological issues and textinterpretation • Resultsshouldbereproducible • Code release (including scripts, configurations, …) • Documentation • Open source data • The setupshouldbemodular • Combine output of different tools • Flexiblechoice of methodsused

  41. EvaluationChallenges (1/2) • How to evaluate the extraction tools? • Partialevaluationusing metadata (10-fold cross-validation), but: • No preciseindication of precisionorrecall (incomplete metadata…) • Biographieswithrich metadata are notnecessarilyrepresentativeManuallyannotated data needed!

  42. EvaluationChallenges (2/2) • How to compare performance NLP tools? • Littleworkonbiographies, littleor none on Dutch ones… • How hard are oldertexts? Can we quantify?Systematiccomparison: • Englishbiographies (wikipedia) • Dutch biographies (wikipedia) • Biographiesfrom the portal

  43. Reproducibility/Replication • What do resultsmeaniftheycannotbereproduced? • Whatvariation in resultscanbeexpectedbasedon details notmentioned in papers? • Whichinformation is needed to replicateresultsorfind the origin of differences?Paper submitted ACL 2013 (joint workwith Marieke van Erp and others)

  44. Representations (tools) • How to represent and combine output of different tools? • Compatibility (easy to convert output of external NLP tools) • Flexibility (beable to containalternativerepresentations and interpretations)Integraterepresentations in NIF (joint workwith Jesper Hoeksema and Willem van Hage)

  45. Representation (events) • How to combine knowledgefrom the NLP community and Linked Data community? • Combination of textualinformationwithexternal resources • Complete representation of informationfromtext (location, retrievalmethod)Paper submitted to workshop on Events: Definition, detection, coreference and representation (joint work with Marieke van Erp, Willem van Hage, Sara Tonelli, and others)

  46. Current state of affairs • Basic system usingsentencenumber and lemmasformaincategories metadata (evaluationongoing) • Module forlabelinglocations and dates in text (adaptions to be made formodularity) • Annotationeffortstartedforevaluation (selection of approximately 700 texts)

  47. Demonstrator

  48. Interface: Focus • The interface should be easy to use • The demonstrator should inspire historians to undertake new research and give direction, rather than being the ‘closing factor’ in their research • The interface should allow to ‘fine tune’ results returned upon an initial action

  49. Interface: Options • Query composition • Faceted browsing • A combination

  50. Interface: Query composition • Drop down boxes to select ‘Verbs’, data elements and relations

More Related