1 / 60

Data acquisition and FIRST datasets

Data acquisition and FIRST datasets. Miha Gr čar, Jožef Stefan Institute. Activity in Y3 . Ontology evolution Data acquisition software (DacqPipe) FIRST dataset of news & blogs. Ontology evolution. Dynamic part. (Nearly) Static part. Semantic & lexical resources, IDMS API.

gyula
Download Presentation

Data acquisition and FIRST datasets

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data acquisition and FIRST datasets Miha Grčar, Jožef Stefan Institute FIRST Y3 Review Meeting

  2. Activity in Y3 • Ontology evolution • Data acquisition software (DacqPipe) • FIRST dataset of news & blogs FIRST Y3 Review Meeting

  3. Ontology evolution Dynamic part (Nearly) Static part Semantic & lexical resources, IDMS API Topic detection & tracking Active learning Indices, stocks, companies, geo-entities, actors… Topic taxonomies Sentiment vocabulary FIRST ontology FIRST Y3 Review Meeting

  4. Ontology evolution Dynamic part (Nearly) Static part Semantic & lexical resources, IDMS API Topic detection & tracking Active learning* FIRST ontology Models for canyon flow visualization Models for sentiment classification* “Knowledge base” *Smailović, Grčar, Lavrač, Žnidaršič: Stream-based active learning for sentiment analysis in the financial domain (to appear)

  5. Data acquisition pipeline (DacqPipe) Syntactic analysis Semantic preprocessing HTML tokenizer HTML tokenizer Filter OBIE OBIE Language detector DB writer B'plate remover & duplicate detector DB writer B'plate remover & duplicate detector Language detector Filter NLP pipe NLP pipe Read & parse Clean Store Emit • Resembles big data streaming architectures such as Twitter Storm • Running continuously since April 2011 • Several scientific contributions • Boilerplate remover & gold standard dataset • Ontology & ontology-based information extractor • Executable available at http://first.ijs.si/software/DacqPipeJun2013.zip • Source code: https://github.com/project-first/dacqpipe 0MQchannel RSS reader RSS reader DB FIRST Y3 Review Meeting

  6. Data acquisition pipeline (DacqPipe) Syntactic analysis Semantic preprocessing HTML tokenizer HTML tokenizer Filter OBIE OBIE Language detector DB writer B'plate remover & duplicate detector DB writer B'plate remover & duplicate detector Language detector Filter NLP pipe NLP pipe Read & parse Clean Store Emit • Resembles big data streaming architectures such as Twitter Storm • Running continuously since April 2011 • Several scientific contributions • Boilerplate remover & gold standard dataset • Ontology & ontology-based information extractor • Executable available at http://first.ijs.si/software/DacqPipeJun2013.zip • Source code: https://github.com/project-first/dacqpipe 0MQchannel RSS reader RSS reader DB FIRST Y3 Review Meeting

  7. Boilerplate removal FIRST Y3 Review Meeting

  8. Streaming setting FIRST Y3 Review Meeting

  9. Hypothesis Web pages at similar Web addresses share common boilerplate, while main content is unique FIRST Y3 Review Meeting

  10. URL Tree “About us” http://www.bbc.co.uk/sports/story2371.html How many times did I see “About us” in this part of the tree? Stream FIRST Y3 Review Meeting

  11. Evaluation • Dataset • 569,583 time-stamped documents (stream) • 292,053 documents after URL normalization • Oct 24 – Dec 19, 2011; 31 Web sites • Part of the FIRST dataset of news & blogs • Gold standard • 56,436 documents annotated with manually designed regex tailored for specific Web sites FIRST Y3 Review Meeting

  12. Evaluation Reset FIRST Y3 Review Meeting

  13. Gold standard datasethttp://first.ijs.si/urltreedataset FIRST Y3 Review Meeting

  14. Conclusion: Final results of WP3 • Data acquisition pipeline software (DacqPipe) • Since April 2011 • https://github.com/project-first/dacqpipe • FIRST dataset of news & blogs • 219 Web sites; ~15 million unique documents • http://first.ijs.si/FIRSTDataset • FIRST ontology • Semantic + lexical part • Information extraction + sentiment analysis • http://first.ijs.si/FIRSTOntology/y3 FIRST Y3 Review Meeting

  15. Technical Presentations and Demos- Sentiment Analysis - Achim Klein (UHOH), 20 November, Luxembourg

  16. Knowledge-basedSentiment Extraction a) Direct sentiment Example: „I expect the S&P 500 to rise“ positive sentiment  Addressed by rules b) Indirect sentiment, using indicators Example: „I think U.S. interest rates will rise“ negative sentiment  Addressed by ontology

  17. UC Retail Brokerage/Market Surveillance:EconomicIndicators Advance/Decline Ratio Bear Flag Break Out Double Bottom RSI Support Resistance … Debt to Equity Dividend Yield Earnings to Price Ratio New Products Profit Margin Sales … Interest Rate Inflation M2 Change Rate Durable Goods Orders Unemployment Private Housing New Building Permits …

  18. Example Insights:Unemployment Indicator Official US unemploymentstatisticsreleasedates. RecordGreekunemploymentnumbersreleased.

  19. UC ReputationalRisk:Reputation Indicators (Y3) Charity Donation Education Crime Bullying Slave Professional Talent Manpower Lay off Job cuts Wrongdoers Transparent Responsible Campaign Debt Foreclosure Price-fixing Accountable Tier 1 ratio AML Breach Shady funds Law suit Subsidy Liquidity Customers Subprime Mortgage CDS spread Total numberofindicators: 1451 Positiveandnegative sample indicators per reputationtopic

  20. Reputation Sentiment Classification Performance 71.2% 67.7% 44.9% 23.7%  Higher recallof (indirect) sentimentsbymeansofindicators

  21. Reputational Insights: JPMorgan Corporate Governance 19.09.2013 “Scandals cost JPMorgan $1 billion in fines” [REUTERS] Corporate Governance 11.10.2013 “JPMorgan’s Dimon Posts First Loss on $7.2 Billion Legal Cost” [BLOOMBERG] Volume of Corporate Governance

  22. Fuzzy Sentiment Classification 1. Extractsentimentobjects „Apple‘searningsarerising“ „Salesmightdecreasebecauseofthefinancialcrisis“ 2. Classifysentimentper object in eachsentence 3. Generatemachine-learninginput: Sentiments andwordsof all sentencesthatrefertothe same object 4. Two separate document-level machine-learningfuzzyclassifierswith 5 degreesof … positive, (2) negative

  23. Enhanced Gold Standard Corpus (Y3):Retail Brokerage/Market Surveillance 69.0% 62.6% 62.4% 54.2%  Improved hybrid sentimentclassifierperformance

  24. Main Results • Deepknowledge-basedsentimentanalysis • Specificto a featureof an objectusingrules(e.g., reputationof a company) • Economicandreputationindicatorsimproveclassifierperformanceandprovidevaluableinsightsforusers • Glass-box approachwith drill-down capabilities • Best paperawardat IEEE • Fuzzyclassifierwith 5 degreesofpositivityandnegativityforbetterdecisionmaking • Fuzzy-level Gold Standard Corpus • Analyzed >3 milliondocuments • Open sourceavailablegit://github.com/project-first/semanticinformationextraction.git

  25. Thankyou

  26. WP6 Technical Presentation & DemosMarko Bohanec, MihaGrčar, Jan Muntermann, Michael Siering Luxembourg, November 20th, 2013

  27. WP6Status End of Y2 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 • Mainly presenting basic stand-alone prototypes • Presentation of the first models • First visualisation components FIRST Y3 Review Meeting

  28. WP6Achievements Y3 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 • Refinements of qualitative models based on domain experts’ feedback • Highly scalable implementations • FIRST pipeline integration • Delivery of D6.3 in M33 T 6.2 / T 6.3 Machine Learning & Qualitative Models • Development of additional and revised visualisation components based on domain experts’ feedback • Highly scalable implementations • Deliveryof D6.4 in M34 T 6.4 Visualisation Components FIRST Y3 Review Meeting

  29. Agenda • Qualitative and quantitative models • Reputational Risk Management • Market Surveillance • Visualizations • Retail Brokerage 29 FIRST Y3 Review Meeting

  30. Reputational Risk Problem Formulation (1/2) General Area: Production and distribution of investment products and services by banks and other financial institutions. Specific Use Case: Assessment of reputational risk (RI) based on assessments of MPS counterparties. Reputational Risk: Risk arising from negative perception on the part of customers, counterparties, shareholders, investors, debt-holders, market analysts, other relevant parties or regulators that can adversely affect a bank’s ability to maintain existing, or establish new, business relationships and continued access to sources of funding. FIRST Y3 Review Meeting

  31. Reputational Risk Problem Formulation (2/2) Goal: to develop • a multi-criteria model for the assessment of MPS reputational risk (RIM) • that serves as the main component of corresponding DSS Approach: expert modeling, qualitative multi-attribute modeling (method DEX) novelties FIRST Y3 Review Meeting

  32. RIM: Main Components FIRST Y3 Review Meeting

  33. RIM: Basic Data Processing FIRST Y3 Review Meeting

  34. RIM: Qualitative Evaluation Aim: qualitative assessment of Reputational Index for one customer and product Model: qualitative hierarchical rule-based DEX model FIRST Y3 Review Meeting

  35. RIM: Aggregation Aim: gradually aggregate qRI1 into the overall Reputation Risk Index (RI): • hierarchical aggregation: Customer → Product → Counterpart → Bank • taking into account relative product volumes and relative customer numbers C/P → PRODUCT PRODUCT → COUNTERPART COUNTERPART → BANK qRI1 FIRST Y3 Review Meeting

  36. RIM Reports: Topmost Level (Bank) FIRST Y3 Review Meeting

  37. RIM Summary Developed and implemented a decision support model component for the assessment of bank reputational risk Approach: expert modeling using a variety of modeling methods (qualitative, quantitative, hierarchical, relational) Novel aspects: • taking into account sentiment assessments of counterparts • advancing the present RI assessment model Benefits for the users: • obtaining a comprehensive RI as time series for different groups (customers, products, counterparts, bank) • ability to analyse and explain assessments at different levels by drilling down through the RIM hierarchy FIRST Y3 Review Meeting

  38. Agenda • Qualitative and quantitative models • Reputational Risk Management • Market Surveillance • Visualizations • Retail Brokerage FIRST Y3 Review Meeting

  39. Problem Formulation:Market Surveillance Pump & Dump market manipulation: Manipulation of the share price by the dissemination of false positive information in order to take profit from an increased price level. FIRST Y3 Review Meeting

  40. Pump & Dump Example (1/2) „Thursday's pick is a story straight out of Hollywood!“ „Sharescan multiply dramatically in value over short time periods.“ „Could this company be the next blockbuster?“ „SAPX - Wake Up, Put It On Your Screen NOW“ Source: http://newsletter.hotstocked.com/newsletters/view/Could_this_company_be_the_next_blockbuster_-92301 FIRST Y3 Review Meeting

  41. Pump & Dump Example (2/2) Seven Arts Entertainment, Inc. (SAPX) Shares Purchased Shares Sold Pump & Dump campaign July, 24th – 28th 2011 > 30 different recommendations Source: Yahoo! Finance FIRST Y3 Review Meeting

  42. How to address Pump & Dump Manipulations? Qualitative Modeling Quantitative Modeling Based on expert knowledge Qualitative attributes Decision problem divided into sub problems Goal: daily assessments Based on machine learning algorithms Quantitative attributes Goal: assessment of single documents FIRST Y3 Review Meeting

  43. Qualitative Multi-Attribute Model Development Country Black List Industry Black List Black List Company Black List Company Age History Bankrupt Comp_FinInst Pump & Dump Market Segment Market Market Capitalization Financial Instrument Trading Volume Trading Number of Trades Sentiment News Content FIRST Y3 Review Meeting

  44. From initial DEXi Model (M24) to Processing of Data Stream (M33) • Initial development of the model structure was distributed as DEXi-files. • Models can be applied within DEXi-environment only (M24). • To address of the models capability to process large-scale data streams, a JAVA-based prototype was implemented (M33). FIRST Y3 Review Meeting

  45. Definition of Data Sources Regulatory Authorities web pages FIRST Y3 Review Meeting

  46. Model Configuration and Evaluation (M24) • Model Configuration • V-high: 3 configurations • High: 9 • Medium: 7 • Low: 5 • V-low: 1 • Evaluation • 1700 OTC-traded companies • Dataset: 01.2012 to 06.2013 (370 trading days) • on average 157 alerts per day for v-high and high Evaluation based on predefined configuration: FIRST Y3 Review Meeting

  47. Reconfiguration of the Rules in Y3 FIRST Y3 Review Meeting

  48. Model Configuration and Evaluation (M33) • Configuration: • V-high: 3 configurations • High: 7 • Medium: 8 • Low: 6 • V-low: 1 • Evaluation: • 1700 OTC-traded companies • Dataset: 01.2012 to 09.2013 (435 trading days) • on average 53 alerts per day for v-high and high Evaluation results based on reconfigured model: FIRST Y3 Review Meeting

  49. Research Impact Alic, I.; Siering, M.; Bohanec, M. (2013) Hot Stock or Not? A Qualitative Multi-Attribute Model to Detect Financial Market Manipulation Proceedings of the 26th Bled eConference; Bled, Slovenia FIRST Y3 Review Meeting

  50. How to address Pump & Dump Manipulations? Qualitative Modeling Quantitative Modeling Based on expert knowledge Qualitative attributes Decision problem divided into sub problems Goal: daily assessments Based on machine learning algorithms Quantitative attributes Goal: assessment of documents FIRST Y3 Review Meeting

More Related