600 likes | 751 Views
Data acquisition and FIRST datasets. Miha Gr čar, Jožef Stefan Institute. Activity in Y3 . Ontology evolution Data acquisition software (DacqPipe) FIRST dataset of news & blogs. Ontology evolution. Dynamic part. (Nearly) Static part. Semantic & lexical resources, IDMS API.
E N D
Data acquisition and FIRST datasets Miha Grčar, Jožef Stefan Institute FIRST Y3 Review Meeting
Activity in Y3 • Ontology evolution • Data acquisition software (DacqPipe) • FIRST dataset of news & blogs FIRST Y3 Review Meeting
Ontology evolution Dynamic part (Nearly) Static part Semantic & lexical resources, IDMS API Topic detection & tracking Active learning Indices, stocks, companies, geo-entities, actors… Topic taxonomies Sentiment vocabulary FIRST ontology FIRST Y3 Review Meeting
Ontology evolution Dynamic part (Nearly) Static part Semantic & lexical resources, IDMS API Topic detection & tracking Active learning* FIRST ontology Models for canyon flow visualization Models for sentiment classification* “Knowledge base” *Smailović, Grčar, Lavrač, Žnidaršič: Stream-based active learning for sentiment analysis in the financial domain (to appear)
Data acquisition pipeline (DacqPipe) Syntactic analysis Semantic preprocessing HTML tokenizer HTML tokenizer Filter OBIE OBIE Language detector DB writer B'plate remover & duplicate detector DB writer B'plate remover & duplicate detector Language detector Filter NLP pipe NLP pipe Read & parse Clean Store Emit • Resembles big data streaming architectures such as Twitter Storm • Running continuously since April 2011 • Several scientific contributions • Boilerplate remover & gold standard dataset • Ontology & ontology-based information extractor • Executable available at http://first.ijs.si/software/DacqPipeJun2013.zip • Source code: https://github.com/project-first/dacqpipe 0MQchannel RSS reader RSS reader DB FIRST Y3 Review Meeting
Data acquisition pipeline (DacqPipe) Syntactic analysis Semantic preprocessing HTML tokenizer HTML tokenizer Filter OBIE OBIE Language detector DB writer B'plate remover & duplicate detector DB writer B'plate remover & duplicate detector Language detector Filter NLP pipe NLP pipe Read & parse Clean Store Emit • Resembles big data streaming architectures such as Twitter Storm • Running continuously since April 2011 • Several scientific contributions • Boilerplate remover & gold standard dataset • Ontology & ontology-based information extractor • Executable available at http://first.ijs.si/software/DacqPipeJun2013.zip • Source code: https://github.com/project-first/dacqpipe 0MQchannel RSS reader RSS reader DB FIRST Y3 Review Meeting
Boilerplate removal FIRST Y3 Review Meeting
Streaming setting FIRST Y3 Review Meeting
Hypothesis Web pages at similar Web addresses share common boilerplate, while main content is unique FIRST Y3 Review Meeting
URL Tree “About us” http://www.bbc.co.uk/sports/story2371.html How many times did I see “About us” in this part of the tree? Stream FIRST Y3 Review Meeting
Evaluation • Dataset • 569,583 time-stamped documents (stream) • 292,053 documents after URL normalization • Oct 24 – Dec 19, 2011; 31 Web sites • Part of the FIRST dataset of news & blogs • Gold standard • 56,436 documents annotated with manually designed regex tailored for specific Web sites FIRST Y3 Review Meeting
Evaluation Reset FIRST Y3 Review Meeting
Gold standard datasethttp://first.ijs.si/urltreedataset FIRST Y3 Review Meeting
Conclusion: Final results of WP3 • Data acquisition pipeline software (DacqPipe) • Since April 2011 • https://github.com/project-first/dacqpipe • FIRST dataset of news & blogs • 219 Web sites; ~15 million unique documents • http://first.ijs.si/FIRSTDataset • FIRST ontology • Semantic + lexical part • Information extraction + sentiment analysis • http://first.ijs.si/FIRSTOntology/y3 FIRST Y3 Review Meeting
Technical Presentations and Demos- Sentiment Analysis - Achim Klein (UHOH), 20 November, Luxembourg
Knowledge-basedSentiment Extraction a) Direct sentiment Example: „I expect the S&P 500 to rise“ positive sentiment Addressed by rules b) Indirect sentiment, using indicators Example: „I think U.S. interest rates will rise“ negative sentiment Addressed by ontology
UC Retail Brokerage/Market Surveillance:EconomicIndicators Advance/Decline Ratio Bear Flag Break Out Double Bottom RSI Support Resistance … Debt to Equity Dividend Yield Earnings to Price Ratio New Products Profit Margin Sales … Interest Rate Inflation M2 Change Rate Durable Goods Orders Unemployment Private Housing New Building Permits …
Example Insights:Unemployment Indicator Official US unemploymentstatisticsreleasedates. RecordGreekunemploymentnumbersreleased.
UC ReputationalRisk:Reputation Indicators (Y3) Charity Donation Education Crime Bullying Slave Professional Talent Manpower Lay off Job cuts Wrongdoers Transparent Responsible Campaign Debt Foreclosure Price-fixing Accountable Tier 1 ratio AML Breach Shady funds Law suit Subsidy Liquidity Customers Subprime Mortgage CDS spread Total numberofindicators: 1451 Positiveandnegative sample indicators per reputationtopic
Reputation Sentiment Classification Performance 71.2% 67.7% 44.9% 23.7% Higher recallof (indirect) sentimentsbymeansofindicators
Reputational Insights: JPMorgan Corporate Governance 19.09.2013 “Scandals cost JPMorgan $1 billion in fines” [REUTERS] Corporate Governance 11.10.2013 “JPMorgan’s Dimon Posts First Loss on $7.2 Billion Legal Cost” [BLOOMBERG] Volume of Corporate Governance
Fuzzy Sentiment Classification 1. Extractsentimentobjects „Apple‘searningsarerising“ „Salesmightdecreasebecauseofthefinancialcrisis“ 2. Classifysentimentper object in eachsentence 3. Generatemachine-learninginput: Sentiments andwordsof all sentencesthatrefertothe same object 4. Two separate document-level machine-learningfuzzyclassifierswith 5 degreesof … positive, (2) negative
Enhanced Gold Standard Corpus (Y3):Retail Brokerage/Market Surveillance 69.0% 62.6% 62.4% 54.2% Improved hybrid sentimentclassifierperformance
Main Results • Deepknowledge-basedsentimentanalysis • Specificto a featureof an objectusingrules(e.g., reputationof a company) • Economicandreputationindicatorsimproveclassifierperformanceandprovidevaluableinsightsforusers • Glass-box approachwith drill-down capabilities • Best paperawardat IEEE • Fuzzyclassifierwith 5 degreesofpositivityandnegativityforbetterdecisionmaking • Fuzzy-level Gold Standard Corpus • Analyzed >3 milliondocuments • Open sourceavailablegit://github.com/project-first/semanticinformationextraction.git
WP6 Technical Presentation & DemosMarko Bohanec, MihaGrčar, Jan Muntermann, Michael Siering Luxembourg, November 20th, 2013
WP6Status End of Y2 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 • Mainly presenting basic stand-alone prototypes • Presentation of the first models • First visualisation components FIRST Y3 Review Meeting
WP6Achievements Y3 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 • Refinements of qualitative models based on domain experts’ feedback • Highly scalable implementations • FIRST pipeline integration • Delivery of D6.3 in M33 T 6.2 / T 6.3 Machine Learning & Qualitative Models • Development of additional and revised visualisation components based on domain experts’ feedback • Highly scalable implementations • Deliveryof D6.4 in M34 T 6.4 Visualisation Components FIRST Y3 Review Meeting
Agenda • Qualitative and quantitative models • Reputational Risk Management • Market Surveillance • Visualizations • Retail Brokerage 29 FIRST Y3 Review Meeting
Reputational Risk Problem Formulation (1/2) General Area: Production and distribution of investment products and services by banks and other financial institutions. Specific Use Case: Assessment of reputational risk (RI) based on assessments of MPS counterparties. Reputational Risk: Risk arising from negative perception on the part of customers, counterparties, shareholders, investors, debt-holders, market analysts, other relevant parties or regulators that can adversely affect a bank’s ability to maintain existing, or establish new, business relationships and continued access to sources of funding. FIRST Y3 Review Meeting
Reputational Risk Problem Formulation (2/2) Goal: to develop • a multi-criteria model for the assessment of MPS reputational risk (RIM) • that serves as the main component of corresponding DSS Approach: expert modeling, qualitative multi-attribute modeling (method DEX) novelties FIRST Y3 Review Meeting
RIM: Main Components FIRST Y3 Review Meeting
RIM: Basic Data Processing FIRST Y3 Review Meeting
RIM: Qualitative Evaluation Aim: qualitative assessment of Reputational Index for one customer and product Model: qualitative hierarchical rule-based DEX model FIRST Y3 Review Meeting
RIM: Aggregation Aim: gradually aggregate qRI1 into the overall Reputation Risk Index (RI): • hierarchical aggregation: Customer → Product → Counterpart → Bank • taking into account relative product volumes and relative customer numbers C/P → PRODUCT PRODUCT → COUNTERPART COUNTERPART → BANK qRI1 FIRST Y3 Review Meeting
RIM Reports: Topmost Level (Bank) FIRST Y3 Review Meeting
RIM Summary Developed and implemented a decision support model component for the assessment of bank reputational risk Approach: expert modeling using a variety of modeling methods (qualitative, quantitative, hierarchical, relational) Novel aspects: • taking into account sentiment assessments of counterparts • advancing the present RI assessment model Benefits for the users: • obtaining a comprehensive RI as time series for different groups (customers, products, counterparts, bank) • ability to analyse and explain assessments at different levels by drilling down through the RIM hierarchy FIRST Y3 Review Meeting
Agenda • Qualitative and quantitative models • Reputational Risk Management • Market Surveillance • Visualizations • Retail Brokerage FIRST Y3 Review Meeting
Problem Formulation:Market Surveillance Pump & Dump market manipulation: Manipulation of the share price by the dissemination of false positive information in order to take profit from an increased price level. FIRST Y3 Review Meeting
Pump & Dump Example (1/2) „Thursday's pick is a story straight out of Hollywood!“ „Sharescan multiply dramatically in value over short time periods.“ „Could this company be the next blockbuster?“ „SAPX - Wake Up, Put It On Your Screen NOW“ Source: http://newsletter.hotstocked.com/newsletters/view/Could_this_company_be_the_next_blockbuster_-92301 FIRST Y3 Review Meeting
Pump & Dump Example (2/2) Seven Arts Entertainment, Inc. (SAPX) Shares Purchased Shares Sold Pump & Dump campaign July, 24th – 28th 2011 > 30 different recommendations Source: Yahoo! Finance FIRST Y3 Review Meeting
How to address Pump & Dump Manipulations? Qualitative Modeling Quantitative Modeling Based on expert knowledge Qualitative attributes Decision problem divided into sub problems Goal: daily assessments Based on machine learning algorithms Quantitative attributes Goal: assessment of single documents FIRST Y3 Review Meeting
Qualitative Multi-Attribute Model Development Country Black List Industry Black List Black List Company Black List Company Age History Bankrupt Comp_FinInst Pump & Dump Market Segment Market Market Capitalization Financial Instrument Trading Volume Trading Number of Trades Sentiment News Content FIRST Y3 Review Meeting
From initial DEXi Model (M24) to Processing of Data Stream (M33) • Initial development of the model structure was distributed as DEXi-files. • Models can be applied within DEXi-environment only (M24). • To address of the models capability to process large-scale data streams, a JAVA-based prototype was implemented (M33). FIRST Y3 Review Meeting
Definition of Data Sources Regulatory Authorities web pages FIRST Y3 Review Meeting
Model Configuration and Evaluation (M24) • Model Configuration • V-high: 3 configurations • High: 9 • Medium: 7 • Low: 5 • V-low: 1 • Evaluation • 1700 OTC-traded companies • Dataset: 01.2012 to 06.2013 (370 trading days) • on average 157 alerts per day for v-high and high Evaluation based on predefined configuration: FIRST Y3 Review Meeting
Reconfiguration of the Rules in Y3 FIRST Y3 Review Meeting
Model Configuration and Evaluation (M33) • Configuration: • V-high: 3 configurations • High: 7 • Medium: 8 • Low: 6 • V-low: 1 • Evaluation: • 1700 OTC-traded companies • Dataset: 01.2012 to 09.2013 (435 trading days) • on average 53 alerts per day for v-high and high Evaluation results based on reconfigured model: FIRST Y3 Review Meeting
Research Impact Alic, I.; Siering, M.; Bohanec, M. (2013) Hot Stock or Not? A Qualitative Multi-Attribute Model to Detect Financial Market Manipulation Proceedings of the 26th Bled eConference; Bled, Slovenia FIRST Y3 Review Meeting
How to address Pump & Dump Manipulations? Qualitative Modeling Quantitative Modeling Based on expert knowledge Qualitative attributes Decision problem divided into sub problems Goal: daily assessments Based on machine learning algorithms Quantitative attributes Goal: assessment of documents FIRST Y3 Review Meeting