230 likes | 341 Views
HIST*4170 Data : Big and Small. 29 January 2013. Today’s Agenda. Blog Updates A Short Introduction to Databases A Big Data Project: People In Motion Special Guest: Dr. Rebecca Lenihan. Blog Highlights. Ambition Consider scalability Consider source availability – local advantage?
E N D
HIST*4170Data: Big and Small 29 January 2013
Today’s Agenda • Blog Updates • A Short Introduction to Databases • A Big Data Project: People In Motion • Special Guest: Dr. Rebecca Lenihan
Blog Highlights • Ambition • Consider scalability • Consider source availability – local advantage? • Keep your eye on the academic value • What do you want to teach? Learn? • Themes: war, sport, family, mapping • Intellectual property/privacy • Resources: • Google Sketchup • To make 3D buildings
Data Deluge • Bit, byte, kilobyte (kB) megabyte (MB), gigabyte, terbyte, petabyte, exabyte, zettabytes.... • Library of Congress = 200 terabytes • “Transferring “Libraries of Congress” of Data” • IP traffic is around 667 exabytes • It’s a deluge... • Ian Milligan “Preparing for the Infinite Archive: Social Historians and the Looming Digital Deluge.” (Mar 23, Tri-U history conference) • “Big Data” • too large for current software to handle • Don’t be intimidated • Not all DH sources (yet)
Introduction to Databases • Database – a system that allows for the efficient storage and retrieval of information • We associate with... • Computers changed a lot • Problems: organization and efficient retrieval • Organization = requires data structure • Efficient Retrieval = requires through algorithms • Potential for Humanities? • ...new problems, questions visualization, and objects worthy of study and reflection.
Database Design • The purpose of a database is to store information about a particular domain and to allow one to ask questions about the state of that domain. • Relational databases are more efficient because they store information separately • Attributes • Relationships • Quamen reading is a nice introduction • Not as complicated as you might think, but following rules is important • We will apply...
New approach: Crowdsourcing • An “online, distributed problem-solving and production model.” • Daren C. Brabham (2008), "Crowdsourcing as a Model for Problem Solving: An Introduction and Cases", Convergence: The International Journal of Research into New Media Technologies14 (1): 75–90 • Cited in Wikipedia, where “Anyone with Internet access can write and make changes to Wikipedia articles...” • reCAPTCHA • Luis von Ahn • Others... • Google?
There are limitations... • Organization • Quality Control • Selection
A Database for Your Project? • Think about how you might use a database • but perhaps not too big! • Databases can be very small and still be DH-worthy • Are there public docs out there that you can digest? • Google Refine • Incorporate a search function into your website? • Resources • MS Excel (spreadsheet) • MS Access (relational database) • Google Refine • Cleaning data
Assignment for Next Week • Reading: TBD (3D guns?) • Help someone else out with their project • Read their blog • Comment and provide detailed feedback • Find a collaborator?
People in Motion:Creating Longitudinal Data from Canadian Historical Census
What we are working towards 1881 Census 1871 Census 1891 Census 1851 Census ‘Unbiased’ links connecting individuals/households over several census years A comprehensive infrastructure of longitudinal data 1901 Census US1880 Census 1906 Census US 1900 Census 1911 Census 1916 Census
Current Work 100% of 1871 Census 100% of 1871 Census 100% of 1871 Census 100% of 1881 Census 100% of 1871 Census Automatic Linking 3,601,663 records 4,277,807 records Partners and collaborators: FamilySearch (Church of Latter Day Saints), Minnesota Population Center, Université de Montréal, Université Laval/CIEQ University of Alberta
Existing (True) Links • Ontario Industrial Proprietors – 8429 links • Logan Township – 1760 links • St. James Church, Toronto – 232 links • Quebec City Boys – 1403 links • Bias concerns • family context • others? Guelph Logan Twp
Attributes for Automatic Linking • Last Name – string • First Name – string • Gender – binary • Birthplace – code • Age – number • Marital status – single, married, divorced, widowed, unknown
Automatic Linkage • The challenges: 1) Identify the same person 2) Deal with attribute characteristics 3) Manage computational expense • The system:
Data Cleaning and Standardization • Cleaning • Names – remove non-alpha numerical characters; remove titles • Age – transform non-numerical representations to corresponding numbers (e.g. 3 months); • All attributes - deal with English/French notations (e.g. days/jours, married/mariee) • Standardization • Birthplace codes and granularity • Marital status
Computational Expense • Very expensive to compare all the possible pairs of records • Computing similarity between 3.5 million records (1871 census) with 4 million records (1881 census) • Run-time estimate of : ( (3.5M x 4M)record pairs x 2 attributes being compared ) / (4M comparisons per second) / 60 (sec/min) / 60 (min/hour) / 24 (hours/day) = 40.5 days. (Big Data)
Managing Computational Expense • Blocking • By first letter of last name • By birthplace • Using HPC • Running the system on multiple processors in parallel
Record Comparison • Comparing Strings • Jaro-Winkler • Edit Distance • Double Metaphone • Age • +/- 2 years • Exact matches • Gender • Birthplace