520 likes | 734 Views
Large-scale information extraction and integration infrastructure for supporting financial decision making (FP7-ICT- 257928 ) http://project-first.eu. Big data analytics. Miha Gr čar 1,2 1 Jožef Stefan Institute 2 Sowa Labs GmbH. Outline.
E N D
Large-scale information extraction and integration infrastructure for supporting financial decision making (FP7-ICT-257928) http://project-first.eu Big data analytics Miha Grčar1,2 1Jožef Stefan Institute 2Sowa LabsGmbH
Outline • What is big data? What caused it? Who should care? • Solving big data problems • Examples Miha Grčar
What is big data? • “How many terabytes?” • We deliberately avoid being specific • Big data refers to datasets that cannot be captured, stored, managed, and/or analyzed by the mainstream storage and processing devices Miha Grčar
What is big data? Miha Grčar
What caused big data?Storage capacity and processing power Source: Hilbert and López, “The world’s technological capacity to store, communicate, and compute information,” Science, 2011 Miha Grčar
What caused big data?Data availability (industry) Source: IDC; US Bureau of Labor Statistics; McKinsey Global Institute analysis Miha Grčar
What caused big data?Data availability (social media and mobile devices) Source: www.creotivo.com
What caused big data?Data availability (sensors) Source: Analyst interviews; McKinsey Global Institute analysis Miha Grčar
What caused big data?Maturity of technologies & tools Emerging, hyped Mature Source: Gartner (July, 2012) Miha Grčar
Who should care about big data? Source: US Bureau of Labor Statistics; McKinsey Global Institute analysis Miha Grčar
Solving big data problems • Distributed infrastructure • Cloud Amazon Elastic Compute Cloud (EC3) • Distributed processing • MapReduce / batches Hadoop • Distributed workflows / streams Twitter Storm • Distributed storage • Distributed FS/DB • NoSQL 1+1= 1+1= 1+1= Miha Grčar
Solving big data problems • Distributed infrastructure • Cloud Amazon Elastic Compute Cloud (EC2) • Distributed processing • MapReduce / batches Hadoop • Distributed workflows / streams Twitter Storm • Distributed storage • Distributed FS/DB • NoSQL Hadoop, MS DryadLINQ, Disco, Misco, Phoenix, Cloud MapReduce, bashreduce, Qizmt… Amazon EC2, Windows Azure, Google Cloud Platform, Cloudwatt… Storm (Twitter), S4 (Yahoo), “Real-time Hadoops”: Impala, HFlame, Spark… Google File System, HDFS, Google Big Table, HBase,Cassandra, MongoDB, CouchDB, Hive… Miha Grčar
Amazon EC2EC2 = ECC = Elastic Compute Cloud • Central part of Amazon.com’s cloud computing service • ~500,000 physical Linux machines • Elastic: possibility to start / stop servers with respect to demand; pay only for running servers • Instances (several examples) • Micro, 1 ECU, 1 Core, 613 MiB • High-Memory XL, 6.5 ECUs, 2 Cores, 17.1 GiB • High-CPU XL, 20 ECUs, 8 Cores, 7 GiB • OS • Windows • Linux • FreeBSD • Storage • Temporary instance-storage • Persistent Elastic Block Storage (EBS) Miha Grčar
MapReduce (Hadoop) Miha Grčar
A bunch of ballots, all mixed up… Map Still mixed up… A B C A B C A B C Reduce Election results: A: 321,015 B: 179,539 C: 201,734
MapReduce (Hadoop) 195005150700+0000 195005151200+0022 195005151800-0011 194903241200+0111 194903241800+0078 1950 0 1950 22 1950 -11 1949 111 1949 78 1950 [ 0, 22, -11 ] 1949 [ 111, 78 ] 1950 [ 22 ] 1949 [ 111 ] merge map reduce sort copy data output Source: Tom White: Hadoop, The Definitive Guide, 3rd Ed., 2012 (O’Reilly & Yahoo! Press) Miha Grčar
MapReduce (Hadoop) Source: Tom White: Hadoop, The Definitive Guide, 3rd Ed., 2012 (O’Reilly & Yahoo! Press) Miha Grčar
Twitter Storm Collate & bind Produce report Sign Send Print Spout Bolt Bolt Bolt Bolt Data source Data sink Data processors Miha Grčar
Twitter StormBasic principle 195005150700+0000 195005151200+0022 195005151800-0011 194903241200+0111 194903241800+0078 Received: 111 Current max: 22 New max: 111 Overwrite 22 with 111 Spout Bolt Bolt 194903241200+0111 111 Data source Data processor Data sink/writer Miha Grčar
Twitter StormTopology Miha Grčar
Twitter StormPipelining and parallelization Stream Parallelization Pipelining Miha Grčar
Examples • Twitter sentiment and volume • Elections • Stock trading • News cohesiveness, volume, and sentiment • Correlation with VIX, CDS • Correlation with big events • Vocabulary in news & blogs • Pump & dump use case Miha Grčar
Slovene elections • 3 candidates, 3 live debates • Sentiment analysis provider: Gama System & our team at JSI • Streamed live, in real time, in prime time during the debates on POP TV • During and after the debates (3 broadcasts), the sentiment chart was shown 5 times (with commentary) Miha Grčar
First live debate Second live debate Third live debate Elections (first round)
Candidates joined by their wives Candidates justifying their wealth Criticizing a questionable pardoning of a criminal Criticizing the gov Supporting the gov Justifying it
“Democratic.” Zver:--“What kind of a political party leader were you if they(party members) didn’t follow your lead?” Pahor:--“Democratic.” Miha Grčar
Polls vs. sentiment vs. outcome • Actual outcome • November 11, 2012 • BorutPahor40% (+4%) • Danilo Türk 36% • Milan Zver24% DeloStik (Delo, 9.11.) 44 / 31 / 25 Mediana (Slovenske novice, 9.11.) 41.67 / 34.72 / 23.61 Ninamedia (Mladina, 9.11.) 43.8 / 33.6 / 22.6 Twitter sentiment “BorutPahor will win” Miha Grčar
Twitter volume andelection results • There’s no such thing as bad publicity. • “We believe that Twitter and other social media reflect the underlying trend in a political race that goes beyond a district’s fundamental geographic and demographic composition. If people must talk about you, even in negative ways, it is a signal that a candidate is on the verge of victory. The attention given to winners creates a situation in which all publicity is good publicity.” • (DiGrazia, McKelvey, Bollen, Rojas: More tweets, more votes: Social media as a quantitative indicator of political behavior, February 2013) Source: Smailović, Kranjc, Juršič, Grčar, Gačnik, Mozetič: MonitoringtheTwitter sentiment duringtheBulgarianelections (2013; to appear) Miha Grčar
We’re looking at the stock of Amazon.com… The blue line shows the stock price. …during 2012. The black line is the 7-day moving average. The green-red line shows whether we profited (green) or not (red) from blindly following the social signals. The red line shows the related Twitter sentiment. A MA zerocross-over serves as a buy or sell signal. Source: Sowa Labs GmbH Miha Grčar
On April 26, 2012 Amazon announced financial results for its first quarter ended March 31, 2012. Amazon has been spending lots of money on expanding its operations, so analysts expected a huge drop in profit for this first quarter. However, Amazon blows analysts’ estimates away. Even though earnings did fall, they didn't decline nearly as much as analysts had feared. • Amazon earned $130 million or 28 cents per share for the quarter that ended March 31. That was a 35% decline from a year ago, but it was much better than the 7 cents per share forecasts from analysts polled by Thomson Reuters. • Based on this news, Amazon shares surged nearly 16% on Friday morning April 27, 2012. Q3 results Q2 results Q1 results Q4/’11 results Source: Sowa Labs GmbH Miha Grčar
The sentiment MA cross-over happens well before the price jump. Source: Sowa Labs GmbH Miha Grčar
We’re looking at the stock of Google… Q3 results …during 2012. Q1 results Q4/’11 results Q2 results • On October 18, 2012, Google’s shares plunged by 9% after the search giant’s third-quarter earnings came in considerably lower than expected. • The results were accidentally released several hours earlier than expected, leading to a halt in the shares’ trading for a time. Source: Sowa Labs GmbH Miha Grčar
The sentiment MA cross-over happens well before the price plunge. Source: Sowa Labs GmbH Miha Grčar
Source: Sowa Labs GmbH Miha Grčar
News cohesiveness and VIX VIX – implied volatility of S&P500 (aka fear index) Source: RudjerBoskovicInstitute, Boston University, Jozef Stefan Institute Miha Grčar
News cohesiveness and CDS CDS – Credit Default Swaps (insurance against default) Source: RudjerBoskovicInstitute, Boston University, Jozef Stefan Institute Miha Grčar
Pump & dump Source: b-next, Goethe Universität, JSI (FIRST) Miha Grčar
Pump & dump Country Black List Industry Black List Black List Company Black List Company Age History Bankrupt Comp_FinInst Pump & Dump Market Segment Market Market Capitalization Financial Instrument Trading Volume Trading Number of Trades Sentiment News Content Source: b-next, Goethe Universität, JSI (FIRST) Miha Grčar
Quick recap (1/3) • Big data: volume, velocity, variety • Enablers • Storage capacity & processing power • Maturity of technologies • Availability of data, e.g., social networks and mobile devices • Mindset • Financial domain: one of the biggest gainers Miha Grčar
Quick recap (2/3) Solving big data problems • Distributed infrastructure • Amazon EC2 • Distributed processing capacity • MapReduce (Hadoop) • Twitter Storm • Distributed storage Miha Grčar
Quick recap (3/3) Examples • Elections • No such thing as bad publicity • Stock trading • Sentiment vs. price, Twitter volume vs. trading volume • News & blogs • Volume & sentiment expose big events • Cohesiveness vs. VIX & CDS • Content and sentiment as inputs into a pump & dump detection model Miha Grčar
Large-scale information extraction and integration infrastructure for supporting financial decision making (FP7-ICT-257928) http://project-first.eu http://www.sowalabs.de (coming really soon!) Miha Grčar