390 likes | 819 Views
Google Books. Where we're going and how we got here. Jon Orwant Engineering Manager Google Books. Overview. Why and how Google scans books The Google Books settlement From pages to ideas. Google Confidential and Proprietary. Why and How Google Scans Books. Google’s mission.
E N D
Google Books Where we're going and how we got here Jon Orwant Engineering Manager Google Books
Overview Why and how Google scans books The Google Books settlement From pages to ideas Google Confidential and Proprietary
Google’s mission To organize the world’s information and make it universally accessible and useful. Online contentBillions of web pages Offline contentBillions of items becoming indexed Google Confidential and Proprietary
Google Books in a nutshell Google Confidential and Proprietary
Vital stats Scans Number of books scanned: 15M+ Number of pages: 4B Number of words: 2T Libraries: 40+ Publishers: 30K+ Metadata Number of books: 130M Number of records: 4B Number of metadata fields: 1T Google Confidential and Proprietary
Identifying the book Library of Congress Books in Print Lord of the Rings, v.1 The Fellowship of the Ring title author John Roland Reuel Tolkien J.R.R. Tolkien publisher Houghton Mifflin Ballantine Books year 1954 1994
How Google Handles Metadata Collect data from 100+ sources (libraries, commercial aggregators, union catalogs, publishers, retailers) Parse the records into our internal format MARC, ONIX, others... "UVA stores item data and call numbers in 955$a..." Cluster the records into expressions and manifestations Create a "best of" record for each cluster Index and display elements of that record on books.google.com Google Confidential and Proprietary
478 languages Kabardian: 16Khasi: 78Khoisan: 53Khotanese: 21Kikuyu, Gikuyu: 48Kinyarwanda: 77 Kirghiz, Kyrgyz: 702Kimbundu: 14Konkani: 83Komi: 48Kongo: 134Korean: 35905 Kosraean: 10 Kpelle: 6Karachay-balkar: 17Karelian: 28Kru: 26Kurukh: 30Kuanyama: 9Kumyk: 16Kurdish: 220Kutenai: 0Klingon: 3Kalmyk: 26 Kashubian: 14 Kara-kalpak: 102Kabyle: 50Kachin: 18Kalaallisut: 82Kamba: 29Kannada: 2600Karen: 50Kashmiri: 289Kanuri: 25Kawi: 106 Kazakh: 1871
Material content & form <datafield tag="245" ind1=" " ind2=" "> <subfield code="a">[Turkey probe]</subfield><datafield tag="260" ind1=" " ind2=" "> <subfield code="a">Syracuse : Betty Crocker Supplies, ca 1987</subfield><datafield tag="300" ind1=" " ind2=" "> <subfield code="a">1 pointy thing , 46 cm. </subfield> <datafield tag="650" ind1=" " ind2=" "> <subfield code="a">Microwave cookery</subfield> <datafield tag="650" ind1=" " ind2=" "> <subfield code="a">April Fool's Day</subfield>
Parsing Uncertain Dates • 18?? • [196-?] • 1957/8 • late 14th century • finita quarto nonas Januarias [1490] • mense Septembri: Anno Millesimo q[ui]ngentesimo decimonono • mense iulio, anno M.D.XXXX • התשנ״א (Hebrew year 5751 = Gregorian 1990/1 CE) • ١٣٧٣ (either Islamic year 1373 AH = Gregorian 1953/4 CE or Persian year 1373 AP = Gregorian 1994/5 CE)
Google Books Settlement • If approved, resolves lawsuit brought against Google by AAP & AG • Benefits: • Rightsholder control • Snippets => 20% • Library subscriptions • Free terminal in every US public library building • Downloadable books for purchase • Access for the print-disabled • Book Rights Registry: a non-profit organization to find and pay rightsholders • Research corpus
Linguistic Analysis "Research that performs linguistic analysis over the Research Corpus to understand language, linguistic use, semantics and syntax as they evolve over time and across different genres or other classifications of Books."
Books as a corpus of human knowledge • Understand one book • Understand all books • Understand relations between books
Insights into human progress oxide of lead may be thus a heavy fire a striking proof miles distant from terms of peace presents the appearance more than mortal vexation of spirit zeal and devotion lesbian and gay health care professionals abuse and neglect the overall process shift away from the power elite a research project the poor countries probability of failure increased awareness of Old-fashioned trigrams New-fangled trigrams Source: Matthew Gray & Yuan K. Shen Google Confidential and Proprietary
Semantic Stack Google Confidential and Proprietary
Semantic Stack (video remix) Google Confidential and Proprietary
Reframing the Victorians (Cohen & Gibbs, GMU) Google Confidential and Proprietary
Victorian terms Google Confidential and Proprietary
Discipline-specific progress occurs by... ...moving up one level ...or improving the results at one level by creating a reusable data set ...or reasonably using one level as a proxy for a higher level Google Confidential and Proprietary
Reframing the Victorians ...reasonably using one level as a proxy for a higher level Google Confidential and Proprietary
Interdisciplinary progress occurs by... ...moving up one level ...or improving the results at one level ...by creating infrastructure that can be used by others Google Confidential and Proprietary
Meeting the Challenge of Language Change in Text Retrieval with Machine Translation Techniques Intralanguage translations (Efron, U. Illinois) Google Confidential and Proprietary
Intralanguage translations improving the results at one level ...by creating infrastructure that can be used by others Google Confidential and Proprietary
Automatic Identification and Extraction of Structured Linguistic Passages in Texts Grammar inference(Abney & Szymanski, Univ. Michigan) Google Confidential and Proprietary
Grammar inference moving up one level ...by creating infrastructure that can be used by others Google Confidential and Proprietary
Thank You! Q&A