“Reverse Engineering” Statistical Metadata through User Studies

“Reverse Engineering” Statistical Metadata through User Studies Carol A. Hert Syracuse University January 23, 2003

Presentation Overview • Defining metadata (yet again) • Rationale for user studies—reverse engineering of metadata • Two studies of users • Users of statistical tables • Users during statistical integration tasks • Implications for system design

Definition of Metadata metadata are information entities preserved in artifacts that perform the task of providing context designed to help the user create, locate, understand, and use* the entities/data to which the metadata refer *help the user manipulate the entity throughout the entity’s lifecycle

The Metadata Challenge • What information entities are metadata (and what aren’t)? • Which metadata are necessary, essential, optimal for which tasks (and can we acquire them)? • How can we understand metadata use and creation to improve our metadata systems (and other tools for user understanding)?

A Viable Approach • Reverse engineer metadata elements by investigating how users interact with statistical information and determining what information is necessary to support them

The Viability of User Metadata Studies • Plethora of potential metadata • Cost of creating or harvesting, maintaining metadata and metadata systems • Uncertain utility of some metadata

Rationale for User Studies • Examination of users in situ can provide insight into which metadata are used, when, in what formats, etc. • Accepted strategy in social informatics, sociology of technology and work

The User Studies • Study 1: Metadata needs during usage of statistical tables • Study 2: Metadata needs during tasks requiring integration of statistical information Both funded by U.S. National Science Foundation and Bureau of Labor Statistics

Exploring Metadata for Understanding Statistical Tables • Task concerned understanding statistical tables • Identified user questions/uncertainties about specific tables • Yielding potential metadata elements • Searched for answers in existing metadata sources • Investigating potential for harvesting metadata

Exploring Metadata for Understanding Statistical Tables • 11 respondents, each worked with 3 tables (mix of electronic and paper) • total 170 uncertainties categorized into 5 major categories

Findings about Metadata for Tables • Most common questions concerned definitions, followed by rationales • Questions related to statistical domain, general table structure, and interface • Rationale questions difficult to answer with existing metadata

Types of Uncertainties • Definitions (of terms, categories, abbreviations, universe) (97 of 170) • Rationales (28 of 170) • Table structure (e.g. format, layout, link structure) (24 of 170) • Lack of information on • Data collection and sources (4 of 170) • Computational methods (4 of 170) • Comparability/relationship of information (6 of 170) • Others (5 of 170) • Other (2 of 170)

Insights about Metadata • Metadata often difficult to retrieve (due to unstructured format) • Metadata duplicated in multiple places (often manually and with editorial changes) • Metadata needed were agency-, table-, or statistics-specific

And a Tension What is the relationship among metadata and other types of information and when and how to these sources interact to support particular tasks? (a.k.a. what are metadata?)

Metadata During Integration Tasks • What problems/uncertainties do specific types of users have during tasks involving integration of statistical data? • For the same tasks, what problems/uncertainties do experts perceive as being relevant to usage of the data by the user populations? • How do problems experienced by end-users compare to those identified by experts? • What metadata or other information can be identified to resolve user problems?

Metadata During Integration Tasks • Goals of Study • Extend our knowledge of metadata usage • Inform design of tools that incorporate metadata • Consider metadata tools in conjunction with larger set of statistical literacy tools

Metadata During Integration Tasks • Methodology • Five tasks requiring integration across sources • Users did 1-2 of the tasks • Think aloud protocols used with follow-up interview • To date, 14 expert users, second round of data collection about to begin

The Tasks • 3 variants of “Find 4-6 economic indicators for a particular county and compare the county’s economic status to its state and the United States as a whole” • While looking at the economic indicators for Nebraska you notice that the unemployment numbers are not the same at the BLS site and at the Nebraska site—try to determine why.

The Tasks • You are interested in building a soybean crushing plant in either Nebraska or South Dakota. Examine natural gas and electricity prices in the states to determine an appropriate location.

The Tasks • You have become increasingly concerned about urban sprawl in North Carolina. You are looking for statistics on loss of farming lands and farming income in Orange, Durham, and Wake counties. Has the loss of farmland in these counties been greater than 50% since 1992? How does the loss of farmland and farm income in the Raleigh-Durham area compare to the loss of farmland and farm income across the nation as a whole?

Findings to Date • Integrating activities of users • Making comparisons • Noting discrepancies (between data, in presentation approach, etc.) and/or asking what the difference is due to • Manipulations (e.g., mathematical, exporting to spreadsheets) • Barriers to integration

More findings • Strategies used to find and integrate sources, data, to understand scope of task • Knowledge used • Types of questions/uncertainties • Terminology used • Aspects of data that matter to the user during the task

Findings to Date • Comparisons are a critical aspect of integration • Comparison types identified: • Geographic units • Definitional differences in concepts and variables • Across time • Data from different sources (websites, surveys) • Index value comparisons

Barriers to Successful Integration • Definition, source information lacking • User lack of knowledge of appropriate strategies (e.g., using time series data, types of calculations to perform) • User lack of knowledge about usage of index values, statistical activity purpose and approach • Interface design problems (such as scrolling row and column headers)

Further Barriers • Inconsistent data across sources • Inconsistent interfaces • Inability to determine whether data wanted for comparison are available • Lack of domain knowledge • Lack of knowledge of how to handle inflation, seasonal adjustment • Terminology differences

Other Findings • Terminological variants within/across agencies and between users and agencies • Different approaches suggest different statistics to users • Experts use agency and domain knowledge extensively

Using the Results • Incorporate specific metadata into a variety of tools • Provide answers from metadata sources for specific presentations, tasks, etc. • Issues are specificity of answer, uniqueness of answer • Identifying metadata elements and sources of metadata • Determine tools appropriate to a particular user situation

Tools/Approaches under Development • Glossary lookup • Ontology for cross walking • Relationship browser • Enables a person to preview website, datasets by specifying particular relationships (e.g. show me datasets that include unemployment variables and come from surveys of households)

Tools/Approaches Under Development • Relationship browser that will modify itself based on the underlying object classes/variables available • Embedded help via “sticky notes” • Online communities of interest (via communication tools) • Tutorials, scenarios of use

Mapping Needs to Tools • Definitional information: glossary, mappings of agency terminology to user terminology, ontologies • Scoping problem (e.g., what is an economic indicator): example indicators, general definitions • Non-linked explanatory information—mouse-overs at point of linkage, additional linkings

Mapping Needs to Tools • Managing data collected: access to table builders, word processing, spreadsheets • Finding comparable numbers: relationship browser (e.g., geographic, time unit by indicator) • Confusion of large number of text links: relationship browser (show me pages/parts of site) that have economic indicators

Integrating Metadata Systems with Other Tools • Metadata are one component of a statistical information network • Metadata systems important • Metadata as “organizers, content” of other systems • Metadata systems need to pass metadata to other tools and vice versa • A New Question: How do our metadata systems and repositories interact with other tools?

Further Information Carol A. Hert cahert@syr.edu The overall project: http://ils.unc.edu/govstat

“Reverse Engineering” Statistical Metadata through User Studies

“Reverse Engineering” Statistical Metadata through User Studies

Presentation Transcript

2010 Analysis of existing metadata case studies

Metadata Driven Statistical Data Warehouse System at the Hungarian Central Statistical Office

Metadata Standards and Their Support of Data Management Needs

Forging new generations of engineers

The Quality Metadata System In the Czech Statistical Office

Reverse Engineering, Ethics, Sports, Safety, and other issues in Design

Intervention Tools

Reverse Engineering:

Census Corporate Statistical Metadata Registry

Engineering Design and Problem Solving

Reverse Engineering Methods

REVERSE ENGINEERING BY LASER SCANNER 3D

ROLE OF REVERSE ENGINEERING IN AN INTRODUCTORY CAD COURSE

Designing a Metadata Repository

REVERSE ENGINEERING

Software Reverse Engineering (SRE)

Metadata management in National Statistical Institutes and researcher access: an example

Producing and managing metadata

Managing Metadata System Projects ; Experiences of the Czech Statistical Office

Selling Metadata

Introduction Problems in analyzing Internet traffic: Management Data, metadata, and tools

Metadata use in the Statistical Value Chain