420 likes | 437 Views
Explore challenges in extracting data across multiple text datasets for arts and humanities research, using semantic web technologies and historical sources. Discuss user requirements analysis and interaction problem solving methods.
E N D
Armadillo Data Extraction Across Multiple Text Datasets for Arts and Humanities Research Mark Greengrass University of Sheffield 15 July 2007 (c) M.Greengrass
15 July 2007 (c) M.Greengrass Response to the RePAH questionnaire (2005-6), aggregate of all Arts and Humanities respondants (Repah: A User Requirements Analysis Report (2006), p. 102.
15 July 2007 (c) M.Greengrass Repah, A user requirements analysis… (2006), p. 109
15 July 2007 (c) M.Greengrass Some Distinctive Features of in Historians’ Approach to their Evidence • Promiscuous range of sources consulted • Firm distinction between primary and secondary sources • Complex dialogue between existing historiography and constitutive source materials • Reiterative process of open interrogation of source materials • A ‘coherent’ narrative consists of one composed (generally) from more than one source
15 July 2007 (c) M.Greengrass • Historians’ Database Challenge • Growing number of (mainly text-based) historical datasets in electronic media, furnished from a wide variety of providers • These datasets utilise a variety of different historical sources • They contain varying amounts of encoded information (dependant on the historical question being asked by the PI; and by the constraints of the particular source being used) • The information is encoded in different ways • The delivery formats used also vary widely
15 July 2007 (c) M.Greengrass Sources The Marine Society Registers The Westminster Historical Database Eighteenth Century Fire Insurance Policies Prerogative Court of Canterbury Wills The Proceedings of the Old Bailey AHDS Deposits St. Martin’s Settlement Exams Index WESTCAT Collage image databse Guildhall Library Harben’s Dictionary of London John Strype’s “Survey…” Metropolitan London in the 1690s IHR Selected Criminal Records TNA http://www.motco.com House of Lords Journals BOPCRIS
15 July 2007 (c) M.Greengrass The Old Bailey Proceedings: XML <trial> <p> <person> <defend gender="m"><given>William</given><surname>Mawn</surname></defend> </person> was Tryed for <off> <theft type="animals">stealing a Bay Gelding price 20 l.</theft> </off> from one <victim gender="m"><given>Thomas</given><surname>Lane</surname></victim> out of Berkshire on the <cd>25th of April</cd>. The Witness swore that the Horse was found in the Prisoner's custody in Smithfield, which the Prosecutor owned to be his. The Prisoner could not produce any Evidence to prove that he came honestly by the Horse only produc'd a Felonious person, that was no stranger to Newgate, who went under the Notion of his Man, he declared that the Prisoner bought the Horse upon the Road beyond Uxbridge. The Prisoners being found in several faultering stories, he was found <verdict> <guilty>Guilty</guilty> </verdict>.</p> <p> <punish><death><note type="editorial">[Death. See summary.]</note></death></punish> </p> </trial>
15 July 2007 (c) M.Greengrass 2530553 W Agnes Kervill or Kervytt 2530553 W Andrew Bridham London 2530553 W Andrew Pykeman London 2530553 W Austin Hawkyns 2530553 W Cecilia Foster 2530553 W Christian Chepman 2530553 W Christian Cust 2530553 W David Syadine Bristol, 2530553 W Edmund Bybbesworth 2530553 W Edward Wellys Hadley, 2530553 W Ellen Lacy Widow Saint Pe 2530553 W Gerard Heshull 2530553 W Guy Shuldham 2530553 W Helmingus Leget 2530553 W Henry Porter 2530553 W Henry Warlegh Keynesha 2530553 W Henry Wellis 2530553 W Hugh Caundyssh 2530553 W Hugh Geynesburgh Rector 2530553 W Isabelle Woodhill Canterbury Wills: Delimited Text
15 July 2007 (c) M.Greengrass • The Issues • Can the technologies developed for the ‘semantic web’ help us:- • To structure the (different) encoded information across varying sources in a way that the user community will find (research) fruitful? • To understand the way in which these different sources relate to one another, such that they can be used in an intelligent fashion? • To ‘bootstrap’ relevant historical/semantic information from one source, by using another?
15 July 2007 (c) Oscar Korcho (with acknowledgement) Data ‘Sharing’ and Data ‘Re-use’ Reuse means to build new applications, assembling components already built Sharing is when different applications use the same resources
15 July 2007 (c) O. Corcho (with acknowledgement) Interaction Problem Representing Knowledge for the purpose of solving some problem is strongly affected by the nature of the problem and the inference strategy to be applied to the problem Ontologies Problem Solving Methods Describe domain knowledge in a generic way and provide agreed understanding of a domain Describe the reasoning process of a dataset (‘Knowledge-Based System’) in a domain-independent manner Bylander Chandrasekaran, B. Generic Tasks in knowledge-based reasoning.: the right level of abstraction for knowledge acquisition. In B.R. Gaines and J. H. Boose, EDs Knowledge Acquisition for Knowledge Based systems, 65-77, London: Academic Press 1988.
15 July 2007 (c) O. Corcho (with acknowledgement) Definitions of an Ontology 1.“An ontology defines the basic terms and relations comprising the vocabulary of a topic area, as well as the rules for combining terms and relations to define extensions to the vocabulary” Neches R, Fikes RE, Finin T, Gruber TR, Senator T, Swartout WR (1991) Enabling technology for knowledge sharing. AI Magazine 12(3):36–56 2.“An ontology is an explicit specification of a conceptualization” Gruber TR (1993a) A translation approach to portable ontology specification. Knowledge Acquisition 5(2):199–220 Studer R, Benjamins VR, Fensel D (1998) Knowledge Engineering: Principles and Methods. IEEE Transactions on Data and Knowledge Engineering 25(1-2):161–197 3.“An ontology is a formal, explicit specification of a shared conceptualization” Guarino N, Giaretta P (1995) Ontologies and Knowledge Bases: Towards a Terminological Clarification. In: Mars N (ed) Towards Very Large Knowledge Bases: Knowledge Building and Knowledge Sharing (KBKS’95). University of Twente, Enschede, The Netherlands. IOS Press, Amsterdam, The Netherlands, pp 25–32 4.“A logical theory which gives on explicit, partial account of a conceptualization” Guarino N (1998) Formal Ontology in Information Systems. In: Guarino N (ed) 1st International Conference on Formal Ontology in Information Systems (FOIS’98). Trento, Italy. IOS Press, Amsterdam, pp 3–15 5.“A set of logical axioms designed to account for the intended meaning of a vocabulary”
15 July 2007 (c) M.Greengrass Key Components of an Ontology Concepts are organized in taxonomies R: C1 x C2 x ... x Cn-1 x Cn Relations Subclass-of: Concept 1 x Concept2 Connected to: Component1 x Component2 Functions F: C1 x C2 x ... x Cn-1 --> Cn Mother-of: Person --> Women Price of a used car: Model x Year x Kilometers --> Price Instances Elements Axioms Sentences which are always true
15 July 2007 (c) M.Greengrass, after Corcho Semantic Continuum and Formality Semantics hardwired; used at runtime Formal (for humans) Shared human consensus Semantics processed and used at runtime Text descriptions Informal [explicit] Formal [for machines] Implicit e.g. Language e.g. dictionaries e.g. library catalogues E.g. see below
15 July 2007 (c) M.Greengrass http://www.vicodi.org
15 July 2007 (c) M.Greengrass Web-based ‘secondary’ historical writing ‘top-down ontologies’ (generated from discipline-accepted taxonomies) ‘middle-out ontologies’ (generated by intelligent iteration) Primary sources (historical documents; images; artefacts) in elecronic media ‘bottom-up ontologies’ (generated from a representative sample of canonical data
15 July 2007 (c) M.Greengrass John Wilkins, An Essay towards a Real Character and a Philosophical Language (1668)
15 July 2007 (c) M.Greengrass Armadillo – a Semantic Agent • Retrieves information according to pre-agreed ontologies • Takes account of deviations in spelling, typographic formatting and contextual information • Makes use of delimited fields and tagged data as ‘oracles’ to provide firm instantiations of elements in an ontology to apply to electronic materials which have no such structure
15 July 2007 (c) M.Greengrass Automated Text-Mining, used for tagging purposes in Central Criminal Court records <p>CENTRAL CRIMINAL COURT,</p> <p>Held on Monday, December 17th, 1866, and following days,</p> <p><sc>BEFORE THE RIGHT HON.</sc> <lc><name role="judiciary" given="THOMAS" surname="GABRIEL" sex="m" age="na">THOMAS GABRIEL</name>, LORD MAYOR</lc> of the City of London; Sir <sc><name role="judiciary" given="JOHN" surname="MELLOR" sex="m" age="na">JOHN MELLOR</name></sc>, Knt., one of the Justices of Her Majesty's Court of Queen's Bench; <sc><name role="judiciary" given="WILLIAM TAYLOR" surname="COPELAND" sex="m" age="na">WILLIAM TAYLOR COPELAND</name></sc>, Esq., <sc><name role="judiciary" given="THOMAS" surname="CHALLIS" sex="m" age="na">THOMAS CHALLIS</name></sc>, Esq., <sc>THOMAS QUESTED FINNIS</sc>, Esq., Sir <sc><name role="judiciary" given="ROBERT WALTER" surname="CARDEN" sex="m" age="na">ROBERT WALTER CARDEN</name></sc>, Knt., and <sc><name role="judiciary" given="WILLIAM" surname="LAWRENCE" sex="m" age="na">WILLIAM LAWRENCE</name></sc>, Esq., Aldermen of the said City;
15 July 2007 (c) M.Greengrass Automated Text-Mining, used for tagging purposes in Central Criminal Court records – with less success! <p>CENTRAL CRIMINAL COURT,</p> <p>Held on Monday, July 22nd, 1912, and following days.</p> <p>Before the Right Hon. Sir <lc>THOMAS BOOR CROSBY, M.D., LORD MAYOR</lc> of the said City of London; the Right Hon. Lord <sc>COLERIDGE</sc>, one of the Justices of His Majesty's High Court; Sir <sc><name role="judiciary" given="HENRY" surname="KNIGHT" sex="m" age="na">HENRY KNIGHT</name></sc>, Knight; Sir <sc><name role="judiciary" given="HORATIO" surname="DAVIES" sex="m" age="na">HORATIO DAVIES</name></sc>, K.C.M.G.; Sir <sc><name role="judiciary" given="JOHN" surname="POUND" sex="m" age="na">JOHN POUND</name></sc>, Bart.; Sir <sc>GEORGE W. TRUSCOTT</sc>, Bart.; Sir <sc><name role="judiciary" given="CHARLES" surname="JOHNSTON" sex="m" age="na">CHARLES JOHNSTON</name></sc>, Knight; and Sir <sc>HORACE B. MARSHALL</sc>, Knight, LL.D., Aldermen of the said City; Sir <sc>FORREST FULTON</sc>, Knight, K.C., Recorder of the said City; Sir <sc>FK. ALBERT BOSANQUET</sc>, K.C., Common Serjeant of the said City; Not identified Not identified