480 likes | 606 Views
Linking Multi-faceted Complex Data From Vision to Reality. PNC 2013 Kyoto, Japan December 10 th 2013 Jeanette Zerneke Electronic Cultural Atlas Initiative (ECAI) School of Information, UC Berkeley . Starting with a Simplified Vision.
E N D
Linking Multi-faceted Complex Data From Vision to Reality PNC 2013 Kyoto, JapanDecember 10th 2013Jeanette Zerneke Electronic Cultural Atlas Initiative (ECAI) School of Information, UC Berkeley
Starting with a Simplified Vision • Analysis of whole corpora and collections of text.. Linking multiple types of information…. Sharing data from different sources…. Mapping multiple layers of complex data Complex yet simple interactive visualizations…. Network analysis of everything Open Linked Data… BIG Data
Initial steps and inspiration • Text Analysis • Word Counts and Networks • Space and Time • Maps, GIS and Historical GIS • Science Visualizations Early Experiments • Alice in Wonderland • Interactive Word Cloud Generator
Word Cloud Alice in Wonderland Alice in Wonderland This visualization has all the words in the text, multiple interactive viewing modes,and it still works TextArc ~ 2002
Static Word Clouds – Single Dataset 123rf Royalty free stock photos http://willsfamily.org/gwills/books/ The image shows the 300 most commonly occurring words in the text. Full tech details provided
The I-Ching These word clouds are ‘Authored w/ specific cartographic choices’
When Physicists do Linguistics • …a question that’s becoming increasingly central as the social sciences embrace new quantitative tools: Can number-crunchers outside these fields use the data to make bold, useful contributions? Or do they need more specialized knowledge to be able to ask the right questions in the first place? • …the unprecedented scale of the Google dataset, encompassing millions of books, has enticed scholars with the promise of a new, quantitative approach to language and culture—“culturomics,” as a pivotal Science paper dubbed it. • A language isn’t simply a set of words; it encompasses structures all the way from individual sounds to the combination of words and phrases into syntactic patterns. “This ‘language is words’ axiom is part of most people’s folk linguistics that we have to train people out of when they take Intro to Linguistics,” Fruehwald said. “That’s why it’s a little hard to take the work of these physicists seriously at first glance.” Ben Zimmer, Boston Globe 2013
Inspired by The Internet http://en.wikipedia.org/wiki/InternetPartial map of the Internet based on the January 15, 2005 data found on opte.org.
Inspired by The Internet • This image of the internet shows a beautiful complex structure which appears to have a holographic structure • It resonates with us • It has inspired us to look for these structures in the rest of our world
Great Data Visualizations • These examples show many creative ways to present data • They are ‘authored’ – a human decides how it looks • Generally they present a static view of one type of data and/or relationship • The visualization expresses something about: • The characteristics of the data elements within the data universe • or the quality or pattern of the relationships between elements
Challenges • There is a temptation to make the data fit the box so you can do amazing visualizations – • To cleanse the data so it fits • To forge ahead and forget the pedigree and links to sources • To believe any text should be graph-able with a network graph – so it is must be useful • But we know • analysis requires understandable algorithms • evaluation requires judgment not just visualization • There must be feedback between visual inspection / analysis / evaluation / human interaction and judgment
Zooming in to what works • We noticed that all the highly valued visualizations are authored visualizations. • They are designed by a person or group of people to express a specific point or an understanding of the data. • A person has chosen both the data and visualization methodology • A person has chosen the color, size, line width, etc. • This is quite similar to making maps… and there is a long tradition of study on how to make maps as accurate and readable as possible
Mapping & Spatio-temporal Visualization • Digital mapping systems inherently have ways of handling data from different sources and of different data types • Three primary mappable data types: point, line, and shape • Attributes & metadata linkable at data set and element levels • They were originally built for both user interaction with the data and authoring views / maps • Initial ECAI projects took advantage of these features • Sasanian Seals, Missions of North America, Silk Road
ECAI Silk Road Routes • Color coded routes of various explorers are displayed • Options on the left enable overlay of printed maps, links to videos, images, manuscripts and source documentation
Blue Dots Project • Use the same methodologies as mapping to explore complete text corpus • 3 dimensional representation of data • Linkage to metadata, attributes and source info • Multiple views – search within dataset • Creating a customized work environment for the scholar’s analysis, visualization and publication of results
A custom work environment for the scholar • Blue Dots Linked to Source Data
ECAI Religious Atlas of Asia ‘30000 Foot’ View: All Religions – Random Overlay - An Imprecise Tool
Linked Open Data – the details Required functions to ingest data 6.7 Architecture of a Linked Data application that implements the crawling pattern.
Identifying Requirements for Data Reuse & Linking • The use of data sets generated by others in the past can be impeded in many different ways – the hard drive crashed, and there was no back-up; the person who could give permission cannot be found and so on. There are some clearly distinct barriers to be overcome. Here is one typology:1. Discovery: Does a suitable data set exist? 2. Location: Where is a copy? 3. Deterioration: Is the copy too deteriorated and/or obsolete to be usable? 4. Permission: May it be used? 5. Interoperability: Is it standardized enough to be usable with acceptable effort? 6. Description: It is clear enough what the data represent? 7. Trust: Are the lineage, version, and error rate understood and acceptable? 8. Use: Should I use it for my purpose? Data Management as Bibliography by Michael Buckland, 2011
Data Interoperability & Reuse Buckland, Zerneke 2011
Should you use it? • Fitness for purpose: • Is the data adequately documented to allow confidence in the content for your purpose? What is the uncertainty / ambiguity of the data? Are the sources of the data adequately documented?
Editor’s Notes – Supporting Scholars • Liberate the notes! • Expand ‘publication’ to notes • Notes as a primary resource • Expand ‘library’ • Published volumes as derivative • Modernize bibliography • Preserve the ‘workshop’ • Make work environment closer to a shared office (convergence), enable collaboration, and support creativity. ecai.org/mellon2010 ecai.org/KnowledgeUnix More in Ryan Shaw’s presentation later in this session
Early California Cultural AtlasUC Riverside and Electronic Cultural Atlas Initiative with the California Center for Native Nations at UCR, Stanford Spatial History Lab, and the National Center for the Teaching of History in the Schools Map of North America, 1685 ECPP: Baptism Record for Mission San Carlos Ranchos San Bonito and El Pescadero
ECCA dataset sources • Most of the data layers come from institutionally supported data sources: • California Digital Library • State of California • Huntington Library – Early California Population Project • Library of Congress • David Rumsey Map Collection • ECAI ePublication – North American Missions • One new dataset had to be compiled from historical sources • Estimated locations of Indian villages
ECCA Data Evaluation • Usability Evaluation • Discoverable, physically and legally available • Functional with cost effective methods of ingest • Semantic compatibility, acceptable level of uncertainty • ECCA Topology of Uncertainty & Ambiguity • We defined • Uncertainty as a combination of multiple factors, which affect the accuracy and precision of data. • Ambiguity is uncertainty whose source is differences of opinion, perception or understanding of the data. In humanities projects, ambiguity is accepted. It is not expected to be eliminated. • For ECCA we divided uncertainty characteristics into • two dimensions: source and type
Sources of Uncertainty In the ECCA project, each data layer has unique uncertainty and ambiguity. We have characterized the sources of this uncertainty below. • Spatio-temporal paradigm diversity (Ambiguity / Subjectivity) • Perception of time and place for different communities effects how place and ‘land use’ are documented • Data recording and collection (kinds of ambiguity, accuracy, and precision) • What was recorded and what has been preserved • Cultural perspectives and technology have influenced what was recorded • Events that followed affect preservation • Data characterization- categorization (generalization and interpretation) • Deciding how to convert the collected data into categories and objects which can be visualized and analyzed • Building / using ontologies with mapping
Type of Uncertainty ECCA composite characterization of the types of uncertainty: • Accuracy - Is there a knowable correct value? How close are we to it? • Precision – exactness of measurement • Lineage of the data - Documenting sources & metadata • Legal / Protocol limitations – what data is available for use • Credibility - reliability of information source • Completeness - Data sample size / number of observations • What percent of the total items do we know • Documentation if there are known areas of missing data • Scale - For maps and timelines scale is important • What scale is appropriate for what we know or can represent about the data?
Role of Scale • Datasets for use at specific levels of ‘certainty’ – digits of resolution • For maps and timelines scale is a crucial component of the visualization design. Precision implies scale in spatial and temporal data. It affects the scale at which it is appropriate to represent the data. If the data is presented at an incorrect scale it can appear either more or less specific than the data warrants. Other aspects of uncertainty can also impact the appropriate scale of data representation. • Interactive maps display change of scale seamlessly with zoom functions. At the small-scale, lines are generalized and labels are moved around or even dropped for some items when they won’t fit. ‘Zooming in’ triggers display of data with greater precision. For some implementations we will need to have datasets customized for different scales of display.
Native California Ethno-geographyA case study of ambiguity in the origin of Indians who were baptized at mission San Juan Bautista
Villages and Networks Complex GIS Data: Individual villages, networks of villages that functioned together, and villages that changed locations.
The Synthesis – A Gazetteer With one set of Locations for Linking to Data Together the detailed and synthesis Gazetteers create a dataset that allows linking to objects at different scales
Visualization of Ethno-geographic Change over Time Village Location Active Baptisms Depopulated Village Sites Mission Site Rancho Site URL
Reality – Creating Research Environments • What we are doing is creating custom research environments for each project • How can we create sharable tools that implements this process? • Can we create infrastructure that supports the data collection and analysis. One that includes authoring system that are customizable to match your research and publication needs?
What we still need – Back to Vision • Research Process Support • Development of infrastructure that helps researchers, data creators, and catalogers construct the metadata needed to support decision-making about fitness of data for re-use -- for specified quality levels, specific analysis methods, and appropriate visualization tools. • Data Evaluation Interface • Wouldn’t be great if we had a plug and play interactive context and quality visualization interface for scholars and visualization authors? So you could see if the data was what you need. • Visualization Authoring Systems • Easily customizable visualization authoring interfaces that can be deployed to match your research questions and allow publication of your discoveries.
Thank you I hope we see much to inspire us in this session and the rest of the conference!