330 likes | 496 Views
“Reverse Engineering” Statistical Metadata through User Studies. Carol A. Hert Syracuse University January 23, 2003. Presentation Overview. Defining metadata (yet again) Rationale for user studies—reverse engineering of metadata Two studies of users Users of statistical tables
E N D
“Reverse Engineering” Statistical Metadata through User Studies Carol A. Hert Syracuse University January 23, 2003
Presentation Overview • Defining metadata (yet again) • Rationale for user studies—reverse engineering of metadata • Two studies of users • Users of statistical tables • Users during statistical integration tasks • Implications for system design
Definition of Metadata metadata are information entities preserved in artifacts that perform the task of providing context designed to help the user create, locate, understand, and use* the entities/data to which the metadata refer *help the user manipulate the entity throughout the entity’s lifecycle
The Metadata Challenge • What information entities are metadata (and what aren’t)? • Which metadata are necessary, essential, optimal for which tasks (and can we acquire them)? • How can we understand metadata use and creation to improve our metadata systems (and other tools for user understanding)?
A Viable Approach • Reverse engineer metadata elements by investigating how users interact with statistical information and determining what information is necessary to support them
The Viability of User Metadata Studies • Plethora of potential metadata • Cost of creating or harvesting, maintaining metadata and metadata systems • Uncertain utility of some metadata
Rationale for User Studies • Examination of users in situ can provide insight into which metadata are used, when, in what formats, etc. • Accepted strategy in social informatics, sociology of technology and work
The User Studies • Study 1: Metadata needs during usage of statistical tables • Study 2: Metadata needs during tasks requiring integration of statistical information Both funded by U.S. National Science Foundation and Bureau of Labor Statistics
Exploring Metadata for Understanding Statistical Tables • Task concerned understanding statistical tables • Identified user questions/uncertainties about specific tables • Yielding potential metadata elements • Searched for answers in existing metadata sources • Investigating potential for harvesting metadata
Exploring Metadata for Understanding Statistical Tables • 11 respondents, each worked with 3 tables (mix of electronic and paper) • total 170 uncertainties categorized into 5 major categories
Findings about Metadata for Tables • Most common questions concerned definitions, followed by rationales • Questions related to statistical domain, general table structure, and interface • Rationale questions difficult to answer with existing metadata
Types of Uncertainties • Definitions (of terms, categories, abbreviations, universe) (97 of 170) • Rationales (28 of 170) • Table structure (e.g. format, layout, link structure) (24 of 170) • Lack of information on • Data collection and sources (4 of 170) • Computational methods (4 of 170) • Comparability/relationship of information (6 of 170) • Others (5 of 170) • Other (2 of 170)
Insights about Metadata • Metadata often difficult to retrieve (due to unstructured format) • Metadata duplicated in multiple places (often manually and with editorial changes) • Metadata needed were agency-, table-, or statistics-specific
And a Tension What is the relationship among metadata and other types of information and when and how to these sources interact to support particular tasks? (a.k.a. what are metadata?)
Metadata During Integration Tasks • What problems/uncertainties do specific types of users have during tasks involving integration of statistical data? • For the same tasks, what problems/uncertainties do experts perceive as being relevant to usage of the data by the user populations? • How do problems experienced by end-users compare to those identified by experts? • What metadata or other information can be identified to resolve user problems?
Metadata During Integration Tasks • Goals of Study • Extend our knowledge of metadata usage • Inform design of tools that incorporate metadata • Consider metadata tools in conjunction with larger set of statistical literacy tools
Metadata During Integration Tasks • Methodology • Five tasks requiring integration across sources • Users did 1-2 of the tasks • Think aloud protocols used with follow-up interview • To date, 14 expert users, second round of data collection about to begin
The Tasks • 3 variants of “Find 4-6 economic indicators for a particular county and compare the county’s economic status to its state and the United States as a whole” • While looking at the economic indicators for Nebraska you notice that the unemployment numbers are not the same at the BLS site and at the Nebraska site—try to determine why.
The Tasks • You are interested in building a soybean crushing plant in either Nebraska or South Dakota. Examine natural gas and electricity prices in the states to determine an appropriate location.
The Tasks • You have become increasingly concerned about urban sprawl in North Carolina. You are looking for statistics on loss of farming lands and farming income in Orange, Durham, and Wake counties. Has the loss of farmland in these counties been greater than 50% since 1992? How does the loss of farmland and farm income in the Raleigh-Durham area compare to the loss of farmland and farm income across the nation as a whole?
Findings to Date • Integrating activities of users • Making comparisons • Noting discrepancies (between data, in presentation approach, etc.) and/or asking what the difference is due to • Manipulations (e.g., mathematical, exporting to spreadsheets) • Barriers to integration
More findings • Strategies used to find and integrate sources, data, to understand scope of task • Knowledge used • Types of questions/uncertainties • Terminology used • Aspects of data that matter to the user during the task
Findings to Date • Comparisons are a critical aspect of integration • Comparison types identified: • Geographic units • Definitional differences in concepts and variables • Across time • Data from different sources (websites, surveys) • Index value comparisons
Barriers to Successful Integration • Definition, source information lacking • User lack of knowledge of appropriate strategies (e.g., using time series data, types of calculations to perform) • User lack of knowledge about usage of index values, statistical activity purpose and approach • Interface design problems (such as scrolling row and column headers)
Further Barriers • Inconsistent data across sources • Inconsistent interfaces • Inability to determine whether data wanted for comparison are available • Lack of domain knowledge • Lack of knowledge of how to handle inflation, seasonal adjustment • Terminology differences
Other Findings • Terminological variants within/across agencies and between users and agencies • Different approaches suggest different statistics to users • Experts use agency and domain knowledge extensively
Using the Results • Incorporate specific metadata into a variety of tools • Provide answers from metadata sources for specific presentations, tasks, etc. • Issues are specificity of answer, uniqueness of answer • Identifying metadata elements and sources of metadata • Determine tools appropriate to a particular user situation
Tools/Approaches under Development • Glossary lookup • Ontology for cross walking • Relationship browser • Enables a person to preview website, datasets by specifying particular relationships (e.g. show me datasets that include unemployment variables and come from surveys of households)
Tools/Approaches Under Development • Relationship browser that will modify itself based on the underlying object classes/variables available • Embedded help via “sticky notes” • Online communities of interest (via communication tools) • Tutorials, scenarios of use
Mapping Needs to Tools • Definitional information: glossary, mappings of agency terminology to user terminology, ontologies • Scoping problem (e.g., what is an economic indicator): example indicators, general definitions • Non-linked explanatory information—mouse-overs at point of linkage, additional linkings
Mapping Needs to Tools • Managing data collected: access to table builders, word processing, spreadsheets • Finding comparable numbers: relationship browser (e.g., geographic, time unit by indicator) • Confusion of large number of text links: relationship browser (show me pages/parts of site) that have economic indicators
Integrating Metadata Systems with Other Tools • Metadata are one component of a statistical information network • Metadata systems important • Metadata as “organizers, content” of other systems • Metadata systems need to pass metadata to other tools and vice versa • A New Question: How do our metadata systems and repositories interact with other tools?
Further Information Carol A. Hert cahert@syr.edu The overall project: http://ils.unc.edu/govstat