20 likes | 104 Views
SMART QUALITATIVE DATA: METHODS AND COMMUNITY TOOLS FOR DATA MARK-UP. THE PROJECT. WHAT FEATURES OF TEXT CAN BE MARKED UP?. SQUAD aims to explore methodological and technical solutions for ‘exposing’ digital qualitative data to make them fully shareable and exploitable.
E N D
SMART QUALITATIVE DATA: METHODS AND COMMUNITY TOOLS FOR DATA MARK-UP THE PROJECT WHAT FEATURES OF TEXT CAN BE MARKED UP? SQUAD aims to explore methodological and technical solutions for ‘exposing’ digital qualitative data to make them fully shareable and exploitable. The main objectives are to: Spoken interview texts provide the clearest and most common example of the types of encoding features that can be marked up. There are three basic groups of structural features: • specify, test and propose an eXtended Markup Language (XML) schema for storing and marking up qualitative data • investigate requirements for contextualising qualitative data and developing standards for data documentation • develop semi-automated using natural language processing (NLP) tools for preparing marked up qualitative data for sharing • research tools for publishing and interrogating data via the web – Qualitative Data Mark-Up Tools (QDMT) • utterance, specific turn taker, defining idiosyncrasies in transcription • links to analytic annotation and other data types (e.g. thematic codes,concepts,audio or video links, researcher annotations) • identifying information such as real names, company names, place names, occupations, temporal information • Example: • Italy's business world was rocked by the announcement last Thursday that Mr. Verdi would leave his job as vice-president of Music Masters of Milan, Inc to become operations director of Arthur Anderson. DEFINING CONTEXT Rich context enables informed re-use of data. But defining how to provide context for raw data to make it more ‘usable’ is complex. ESDS Qualidata has done much to establish informal ways of documenting raw data. Micro and macro level features should be considered including: USING NLP TOOLS Information Extraction (IE) is a sub-field of NLP which aims to identify key pieces of information in texts using 'shallow' analysis techniques. A typical IE system will perform Named Entity Recognition where particular kinds of proper names and terms are identified, classified and marked up. • how the research question was framed • the research application process • project progress • fieldwork situations • analyses processes Fieldwork observations are useful as are timelines and political chronologies. Equally when undertaking a replication or restudy, detailed information on sampling procedures, field work approaches and question guides will be essential. SQUAD has identified a minimal generic set of elements that represent a baseline for contextualising data. This is a means of annotating documents with semantic metadata – enabling resource discovery and data exploration. The Edinburgh LT-XML and CME tools have been used to process the data. quads.esds.ac.uk/squad
SMART QUALITATIVE DATA: METHODS AND COMMUNITY TOOLS FOR DATA MARK-UP METADATA STANDARDS ANONYMISING DATA TOOL The XML schema will specify a ‘reduced’ set of Text Encoding Initiative (TEI) elements: This tool imports marked up data from from the Edinburgh pipeline system. Named entities are highlighted and co-reference chains – e.g numerous references to a single person - are identified. • core tag set for transcription • names, numbers, dates <persname> • links and cross references <ref> • notes and annotations <note> • text structure <body> • unique to spoken texts <kinesic> • linking, segmentation and alignment <link> • advanced pointing - XPointer framework • text and AV synchronisation • contextual information (participants, setting, text) Names can be anonymised with chosen pseudonyms. The references of names to pseudonyms is saved. Annotations are explored in an XML format in the NITE NXT model. NXT uses ‘stand off’ annotation – where annotation is linked to or referenced by words. • <u who="#interviewer" xml:id="u1">There's just one or two factual things first of all do you mind my asking how old you are?</u> • <u who="#subject" xml:id="u2">49.</u> • <u who="#interviewer" xml:id="u3">And what schools did you go to?</u> • <u who="#subject" xml:id="u4"> • <orgName>King Street</orgName> interview text with XML tags embedded TOOLS PROGRESS • defined header metadata for a standardised transcript • defined and tested generic XML models for qualitative data • tested and refined NLP tools for qualitative data • built front end to NLP named entity tools • chosen software to enable annotation of data • explored export formats for longer-term archiving • investigated powerful XML based indexing tools for searching and retrieving data • investigated web display of multimedia data and pointers to other resources using XML – extending the functionality of ESDS Qualidata DATA EXCHANGE STANDARDS • A uniform format for richly encoding qualitative research is necessary as it: enables preservation and re-use of metadata, data and annotation; ensures consistency of presentation and description of data; supports the development of common web-based publishing and search tools; and facilitates data interchange and comparison among datasets. • SQUAD has produced a limited formal definition of a common XML vocabulary and DTD based on the TEI and tested a new Qualitative Data Interchange Format (QDIF). THE PROJECT TEAM CONTACT Claire Grover Maria Milosavljevic Louise Corti and Claire Grover UK Data ArchiveUniversity of EssexColchester, Essex CO4 3SQ Email: quads@esds.ac.ukTel: +44 (0)1206 872145 URL: quads.esds.ac.uk/squad Louise Corti Libby Bishop Mijail Alexandrov Kabadjov quads.esds.ac.uk/squad