270 likes | 361 Views
Big Data, Linked Data: Classification Research at the Junction 24 th ASIS&T SIG/CR Classification Research Workshop, 2 November 2013. Rebecca Green, OCLC greenre@oclc.org Michael Panzer, OCLC panzerm@oclc.org. The Interplay of Big Data, WorldCat , and Dewey. Roadmap.
E N D
Big Data, Linked Data: Classification Research at the Junction 24th ASIS&T SIG/CR Classification Research Workshop, 2 November 2013 Rebecca Green, OCLC greenre@oclc.org Michael Panzer, OCLC panzerm@oclc.org The Interplay of Big Data, WorldCat, and Dewey
Roadmap • Setting the stage • Big data • WorldCat as big data • Literary warrant and the DDC • “Classification analytics” • Classified works • Access points • Trending topics • Structure of discipline
3 V’s of big data • Volume • Terabytes (10004), petabytes (10005), exabytes (10006), . . . • Number of transactions vs. number of bytes • My big data is not your big data
3 V’s of big data – cont. • Variety • Sources, perspectives, standards • Structured vs. unstructured data • Semantically related datasets • Velocity • Data creation • Data analysis
WorldCat as big data • Variety • Records in MARC Bibliographic Format • Records in MARC Holdings Format • Records in MARC Authority Format (e.g., LCSH, FAST, BISAC, MeSH, VIAF) • Vendor records • WorldCat knowledge base • Institutional registry data • Institution-specific acquisitions, circulation, ILL data
WorldCat as big data • Volume • Bibliographic data: over 300 million records • Holdings data: over 2 billion records • Authority data • LCSH: 26.4 million headings • VIAF: 24.2 million clusters; 21 million links between records
Literary warrant and the DDC • DDC editorial rules call for literary warrant to be taken into account for: • Expansions (i.e., development of new classes) • Reductions (i.e., discontinuing entire classes) • Form of name used in class descriptions • Order in which topics are listed in multitopic caption • Creation of and choice of examples in add instructions • Indexability of topics (print; WebDewey) • Form of name for index entries
Classified works • Periodic profiles of distribution of classified works across the classification to identify: • Expansions: Disciplines/subjects with sufficient literary warrant • Reductions: Classes with insufficient literary warrant
Classified works:Expansion warranted (1) 306.44 Language Including pragmatics Class here anthropological linguistics, ethnolinguistics, sociolinguistics 306.446 Bilingualism and multilingualism 306.449 Language planning and policy 306.449 4–.449 9 Specific continents, countries, localities in modern world Add to base number 306.449 notation 4–9 from Table 2, e.g., language policy of India 306.44954
Classified works:Expansion warranted (2) • Records retrieved in WorldCat searches on dd:306.44* not dd:(306.440* or 306.446* or 306.449*)
Classified works:Reduction warranted (1) 006.33 *Knowledge-based systems . . . 006.336 *Programming for knowledge-based systems 006.336 3 *Programming languages for knowledge- based systems 006.337 Programming for knowledge-based systems for specific types of computers, for specific operating systems, for specific user interfaces 006.338 *Programs for knowledge-based systems
Classified works:Reduction warranted (2) • Records retrieved in WorldCat searches for disjunction of DDC class number and standard subdivisions of number • Duplicates not filtered out of search results for 006.33 • Duplicates filtered out of all other search results
Access points • Analysis of subject heading data in DDC categorized content to identify: • Areas where expansions of new classes should be considered • Additional access points / mappings for DDC classes • Additional topics to be added to class description
Access points: Standing room topics and literary warrant • DDC class 004.678 *Internet Including extranets, virtual private networks Class here World Wide Web • LCSH: 010 ## $a sh 97006102 150 ## $a Extranets (Computer networks) 450 ## $a Virtual private networks (Computer networks) • dd: 004.678* and (hl: extranets w computer w networks) retrieves 69 records
Access points: Topics added to class description 004.6 *Interfacing and communications . . . Including sensor networks . . . 006.22 *Embedded computer systems [formerly 004.1] Class here microcontrollers For a specific aspect of embedded computer systems, see the aspect, e.g., systems analysis and design of embedded computer systems 004.21, wireless sensor networks 004.6, software for embedded systems 005.3
Trending topics • My trending topics are not your trending topics • Twitter—sudden high-magnitude spike in activity • DDC—“quick” achievement of literary warrant threshold + plateaus at steady rate • Trending topic detection vs. new topic detection • Newly minted LCSHs • Chapter/paper titles • Conferences
Trending topics :Conferences • Big data: 29th British National Conference on Databases • 1st Workshop on Architectures and Systems for Big Data • Workshop on big data • Big Data Analytics: First International Conference • The Semantic Web: Semantics and Big Data: 10th International Conference • 2012 workshop on Management of big data systems • 2nd Workshop on Research in the Large : Using App Stores, Wide Distribution Channels and Big Data in UbiComp Research • IEEE International Congress on Big Data • Big Data 2 Knowledge (Workshop)
Trending topics :Chapter/paper titles • Welcome to the big data age • Big Brother and big data around the world • How to make sense of big data? • Business and social implications of big data • Big data and health care • How should big data abuses be addressed? • What is big data? • Does big-data equal big value? • Big-data technologies
Structure of discipline • Analysis of title data in DDC categorized content to identify facet structure of discipline • Retrieve bibliographic records from WorldCat for monographic literature • Isolate title data • Identify noun phrases in the titles • Use conceptual density measure of Agirre & Rigau • Disambiguate noun phrases • Identify appropriate generalizations
The Interplay of Big Data, WorldCat, and Dewey That’s all, folks! -- Thank you = La fin -- Merci beaucoup