180 likes | 310 Views
The Role of “Big Data” in Scientific Publishing. Bradley P. Allen Chief Architect, Elsevier Presentation for panel on “Giving Voice to Content: Emerging Technologies” NFAIS 56 th Annual Conference Philadelphia, PA, USA 2014-02-24. Why the scare quotes?.
E N D
The Role of “Big Data” in Scientific Publishing Bradley P. Allen Chief Architect, Elsevier Presentation for panel on “Giving Voice to Content: Emerging Technologies” NFAIS 56th Annual Conference Philadelphia, PA, USA 2014-02-24
Why the scare quotes? Reference: http://ajharmony.tumblr.com/post/65901268958/mostlysignssomeportents-big-data-is-like, from a quote by Dan Ariely in https://www.facebook.com/dan.ariely/posts/904383595868
Audience poll: current data scales How large is the amount of data your organization currently manages to produce its online products and services? • Gigabytes • Terabytes • Petabytes • Exabytes
What does big data mean to scientific publishing? • Scientific publishing is the act of compressing a universe’s worth of data into small pieces of content that people can consume • In essence, this is the ultimate big data problem • But it is one in which until recently publishers have played a very simple role • That is beginning to change
What are we beginning to do with big data? • Create more useful content by enhancing it with data extracted from content • Make the researcher’s life better by exploiting data about how content is used to improve her experience of using our online applications • Enable research itself by supporting the care and feeding of experimental data at scale
Audience poll: big data use cases Which of these uses of big data is most important for your organization? • Extracting data from content • Improving user experience through usage analytics • Managing experimental data • All of the above • None of the above
Example: collaborative filtering in ScienceDirect • When users look at articles on ScienceDirect, they are provided links to other articles of interest • Related Articles originally implemented using bag-of-words similarity using search engine query • Goal: Increase click-through rate on Recommended Articles over previous Related Articles offering; drive usage, engagement & revenue • Pilot: Ran from March to July 2013, with 9 variants A/B tested with ~5% SD traffic A/B tested • Production: Since Aug 2013 • Inputs • 5 years of SD usage data/events • All SD XML Articles • SNIP2 Journal Rankings ~12M articles 6 billion events Thor Roxie Similarity Co-download matrix Attribute Ranking pii-684259, pii_585346, pii_491635 pii-739156 Daily updates
Audience poll: big data tools and platforms Which big data tools/platforms are you currently using? • Apache Hadoop • A Hadoop distribution (Cloudera, MapR, Amazon EMR, …) • LexisNexis HPCC • Twitter Storm • Rolling our own • None of the above
How big data infrastructure works • All of these tools and platforms basically make the following easy to do • Break data up into many chunks, each of which can fit into memory on a given machine • Send each chunk to a machine where it is processed into chunks containing intermediate results • Combine the intermediate results into a single aggregate data set • Lather, rinse, repeat…
Big data technology issues (in no particular order) • Talent acquisition • What training is needed to make big data platforms usable by our existing teams? • Who/what is a data scientist? • Best practices and design patterns for big data • @nathanmarz’ Lambda Architecture • The proliferation of big data platforms • HPCC, MapR, Cloudera… • Cloud-based vs. hosted solutions • Amazon Elastic MapReduce, Redshift • Data formats and practice for scaling ETL/ELT • Apache Avro, Google Protocol Buffers, zlib-compressed JSON • Numerical computing frameworks for optimization • High-performance computing using GPUs
Can we use big data to enable new business models? • These technologies can yield a wealth of infrastructure, tools, workflows and business models to clone and adapt to the special circumstances of scientific publishing • Big data can open the door to optimizing the value exchange between author, publisher and reader • This will require us to walk away from legacy preconceptions • Ask yourself: is it this way because it was done on paper? • A thought experiment: gold open access as computational advertising
Big data is key to computational advertising Reference: S. Yuan, A.Z. Abidin, M. Sloan and J. Wang. Internet Advertising: An Interplay among Advertisers, Online Publishers, Ad Exchanges and Web Users. arXiv:1206.1754v1 [cs.IR] 8 Jun 2012.
Can big data enable computational publishing? knowledge Authors Researchers credit article inventories $$$ ($) article inventories time & focus $$$ $$ Article exchanges Publishers article inventories The simplified ecosystem of author-pays scientific publishing. Authors spend budget to buy article inventories from article exchanges and publishers; article exchanges serve as matchers for articles and journals; publishers provide valuable information to satisfy and keep researchers; researchers read articles and exchange credit for knowledge from the authors. Note that normally researchers would not receive cash from publishers.
Summary • Big data can play a role in creating new value for researchers and institutions • Ways in which big data is currently exploited in the consumer Internet provide guidance for its use by scientific publishers
Thank You Bradley P. Allen Chief Architect, Elsevier b.allen@elsevier.com @bradleypallen