720 likes | 1.03k Views
Data Management. David Nathan & Peter Austin & Robert Munro. This section. Data management Properties of data Relational data model XML Example. something happened. . representations, lists, summaries, analyses. something inscribed. cleaned up, selected, analysed.
E N D
Data Management David Nathan & Peter Austin & Robert Munro
This section • Data management • Properties of data • Relational data model • XML • Example
something happened representations, lists, summaries, analyses something inscribed cleaned up, selected, analysed you applied knowledge, made decisions archived, presented, published NOT OF INTEREST! recapitulates representations, eg transcription, annotation recording you applied knowledge, techniques made decisions, applied linguistic knowledge FOCUS OF INTEREST! archived & ... ?? something happened Workflows - description vs documentation Description Documentation
Data? • What is data? • Documentation data?
What is data management? • using appropriate and shared-standard data encoding methods (e.g. Unicode) • model the data domains (units, processes) • use appropriate and standard data structure methods (= knowledge representation) • capture and document steps, decisions, conventions, structures • consistency • (=machine readable) • awareness of planning and flow of data • working with others and across systems • catering for archiving
Example • eg if you need to collect/compare linguistic material according to speaker, then you need to not only make suitable recordings, but also create suitable labels, metadata, annotation etc
Choosing values/priorities • Standards & compliance • Adeptness with tools • Modelling of phenomena, architecture of data • Dissemination/publishing • Preserving • Ethics, responsibility, protocol • Range, comprehensiveness • Intellectual rigour • Which are priorities? • Which are dispensible?
A (thought) provoking example • \Indigenous title < > • \English title <The angry daughter> • \Language <Betta Kurumba> • \Duration <0:11:56> • \Description <A story about a king who had six sons and one daughter. The daughter, who has supernatural powers, gets angry with the family over an injust act done to her and runs away from the family, concealing herself as a spirit living in a well.> • \Rec_date <1999-11-20> • \Rec_location <Theppakkadu, Tamil Nadu, India> • \Indigenous speaker <B. Badsi, wife of BNHS Bomman> • \Collector <Gail Coelho> • \Genre <Folktale> • \Tape_medium <DAT> • \media_file <AngryD-Bsi.wav> • \annotation_file <AngryD-Bsi.pdf>
A (thought) provoking example • 1. What is this? • 2. Where would it be found? • 3. What is it for? • 4. Why does it look like this? • 5. List two good points about it. • 6. List at least 2 bad points about it
Data should be: • explicit • consistent • robust • meaningful • conventional • adaptable, convertible, machine readable etc • useful!
Where do word processors fit in? • MS Word is not a good data management tool - although if used well, it can play a role. • Fragment of an attention-seeking Look at me! MS Word file Underlying RTF: \pard\plain \s55\widctlpar \f4 Fragment of an attention-seeking {\b Look at me!} MS Word file \par \pard\plain • dual representations • WYSIWYG or WYSIAYG • distinguish structure and representation from presentation • ambiguities of typography: possible solution with styles • how would styles work in an RTF doc?
“Portability” • Bird and Simons 2003: language documentation data needs to have integrity, flexibility, longevity
“Portability” • complete • explicit • documented • preservable • transferable • accessible • adaptable • not technology-specific • (also appropriate, accurate, useful etc!!)
Data management • the way that data is structured is also information, that may be complex • properly structured data allows: • usage including manipulation, conversion, derivation • preservation • machine readability
Data management systems • a data management system is a system you design for storing data and metadata: • information about content and structures • relationship between units of information • it is not necessarily tied to any particular software, or even a computer
Naive managment using filenames • a (too) simple management system: • information about a recording is captured in the filenames: 1st_int_john_5Aug.wav market_conv_mj.wav …. • what does ‘int’ mean? • what information about the recording is missing?
Data modeling • World/universe • Domain • Relevant • entities • properties • relationships • We also need formal ways to represent these
Data modeling • data modelling is the process of designing your data management system: • what information do you need to record? • what are the units of information? • what are their properties (attributes)? • what are the relationships between the units of information? • how is the information etc likely to change in the future? • how can all this be represented?
Data management • two well-known formats for structured data: • relational database • eXtensible Markup Language (XML) • these are methods, not softwares or hardwares • any system for well-structured data could be OK, but generally: • smaller community of users so less tools and support • ... so errors more likely
Databases • Note that database has 3 senses: • a body of related information • type of software (eg Oracle, Access, Filemaker) • a model for the domain of information (ie. formulation of entities and relationships)
Relational format • Uses tables • Table rows represent entities in a domain • Table columns represent properties/attributes of entities • Each cell represents one atomic unit of data • The order of rows and columns has no significance
TABLE NAME field name Representing a relational design • simplest example
Representing a relational design • less trivial entity TABLE NAME field 1 field 2
CONTINENT name COUNTRY name Representing a relational design • less trivial domain = one to many
AUTHOR ..... SUBJECT name ..... name Non-trivial domains • non-trivial domains have many-to-many relationships
From model to implementation • implementing table relationships CONTINENT COUNTRY name name id id continent_id
Designing a database • Determine the domain, entities and relationships • Experiment with scenarios • Any non-trivial model will evolve as it is thought out and tested • Normalisation is the process of refining models
Practical example • Create a database model to record bicycle owners • Populate your database with 3 bicycle owners: • Alf • Betty • Cherie
Extending ... • Cater for the brands of bicycle they own: • Alf Dawes • Betty Giant • Cherie Malvern Star
Testing ... • Dennis also has a Malvern Star
Testing ... • Alf has two bicycles
Simple relational example • don’t need to pack information into filenames: 1st_int_john_5Aug.wav market_conv_mj.wav • use a table in MS Word, Excel, Filemaker etc
Structured data management • some information is about the data • some is about relationships between data
Structured data management • a separate table should define these codes
Structured data management • formalise the relationships within the data: • need unique identifiers
Structured data management • formalise the relationships within the data: • need unique identifiers
DBMS software also handles • entry • value checking • deletion • manipulation • querying
DBMS software • most use the ‘tables and keys’ model described here: • MS Access, Oracle, MySQL, Filemaker • they differ in what they additionally offer: • user interfaces (MS Access) • scalability, enforcement of data integrity (Oracle) • free-cost (MySQL) • easily manipulated (Filemaker)
What does all this achieve? • conceptual/intellectual validity • scalable, searchable, modular • machine readable • in fact, portable: • complete • explicit • documented • preservable • transferable • accessible • adaptable • not technology-specific
XML history • XML came out of SGML - a system for incremental and collaborative “enrichment” of texts • XML design principles • 1. XML shall be straightforwardly usable over the Internet. • 2. XML shall support a wide variety of applications. • 3. XML shall be compatible with SGML. • 4. It shall be easy to write programs which process XML documents. • 5. The number of optional features in XML is to be kept to the absolute minimum, ideally zero. • 6. XML documents should be human-legible and reasonably clear. • 7. The XML design should be prepared quickly. • 8. The design of XML shall be formal and concise. • 9. XML documents shall be easy to create. • 10. Terseness is of minimal importance.
XML • An in-line markup system • Single sequence of text only (but can be unicode) • Reserved characters < > & " ‘ • Tag syntax • Entities syntax • Elements
Definitions • What is an XML document? An XML document consists of sequences and hierarchies of elements and text • What does XML do? XML is a method for expressing languages (knowledge representation languages)
Like HTML, except • emphasis on logical structure, not display properties • encourages human readability • documents must be well formed • no predefined elements - open and extensible
XML concepts • XML can be thought of as: • as a stream (eg: a stream of text) and/or • as a tree structure
Elements • XML is way of creating structures or “elements” using only plain text • elements are written via tags in angle brackets: eg: <noun> • tags are usually in pairs: • a start/open tag, and an end/close tag: the <noun> dog </ noun> chased ... • but can also be single and closed: the dog <pause /> sat down
Attributes • tags can have attributes with values: the <noun num=“1”> dog </ noun> sat down • you can name your tags, attributes or values (almost) anything • there are some restrictions: • you can have hierarchies, but not overlaps: <a>the<b><c>cat</c> sat</b> on the mat</a> <a>the<b><c>cat</b> sat</c> on the mat</a>
Creating XML documents • You need to design/define/model the domain • Your design is a grammar of a particular XML document • The grammar can be expressed: • with the data representation • independently, using a DTD or an XML schema