300 likes | 429 Views
Core Information Processing Technologies Technical Presentation & Demos. Miha Gr čar ( Dep artment of Knowledge Technologies, Jožef Stefan Institute ) Achim Klein (University of Hohenheim). Technical WPs. WP1 & WP8. WP2 & WP7. Architecture, Integration & Scaling Strategy. UC#1
E N D
Core Information Processing TechnologiesTechnical Presentation & Demos Miha Grčar(DepartmentofKnowledge Technologies, Jožef Stefan Institute) Achim Klein (University of Hohenheim) Luxembourg, November 2011
Technical WPs WP1 & WP8 WP2 & WP7 Architecture, Integration & Scaling Strategy UC#1 Market Surveillance UC#2 Reputational Risk management UC#3 Online Retail Brokerage Domain-independent GUI (Open Source) WP3 WP4 WP6 Management Dissemination & Exploitation Data Acquisition Data Acquisition Ontology Infrastructure Information Extraction Sentiment Analysis Decision Support Infrastructure We are here WP5 Information Integration Data, Information & Knowledge Base WP9 WP10 FIRST Y1 Review Meeting
Data acquisition pipeline (Dacq) Boilerplate remover Boilerplate remover Boilerplate remover ZeroMQ emitter Semantic annotator Duplicate detector Natural language preproc. Language detector Language detector Duplicate detector Natural language preproc. Semantic annotator ZeroMQ emitter Language detector Duplicate detector Semantic annotator ZeroMQ emitter Natural language preproc. Loadbalancing RSS reader RSS reader . . . . . . processingpipelines RSS reader One readerper site FIRST Y1 Review Meeting
Data acquisition pipeline (Dacq) Demo video (3:20) FIRST Y1 Review Meeting
Data acquisition pipeline Boilerplate remover Boilerplate remover Boilerplate remover Duplicate detector Language detector Duplicate detector Semantic annotator ZeroMQ emitter Semantic annotator Language detector ZeroMQ emitter Language detector Duplicate detector Natural language preproc. Semantic annotator ZeroMQ emitter Natural language preproc. Natural language preproc. RSS reader RSS reader . . . . . . RSS reader FIRST Y1 Review Meeting
Boilerplate removal Demo video (1:30)
Data acquisition pipeline Boilerplate remover Boilerplate remover Boilerplate remover Duplicate detector Language detector Duplicate detector Semantic annotator ZeroMQ emitter Semantic annotator Language detector ZeroMQ emitter Language detector Duplicate detector Natural language preproc. Semantic annotator ZeroMQ emitter Natural language preproc. Natural language preproc. RSS reader RSS reader . . . . . . RSS reader FIRST Y1 Review Meeting
Language detection • Motivation: language-specific text analysis components • Relatively simple problem • Solutions based on word or character sequences (language models) • Side effects: removes “garbage” and can be used to identify code page • Our implementation based on frequencies of character sequences Demo video (0:45) FIRST Y1 Review Meeting
Data acquisition pipeline Boilerplate remover Boilerplate remover Boilerplate remover Duplicate detector Language detector Duplicate detector Semantic annotator ZeroMQ emitter Semantic annotator Language detector ZeroMQ emitter Language detector Duplicate detector Natural language preproc. Semantic annotator ZeroMQ emitter Natural language preproc. Natural language preproc. RSS reader RSS reader . . . . . . RSS reader FIRST Y1 Review Meeting
Near-duplicate detection • Why is this a difficult problem? • We are dealing with millions of documents – cannot afford to compare every document with every document • We are also looking for near-duplicates, not only exact matches • Overlooked boilerplate “produces” false near-duplicates Demo video (1:00) FIRST Y1 Review Meeting
Near-duplicate detection • Existing approaches like SimHash, shingling and sketching, SpotSigs… • Apart from SpotSigs, they require “clean” documents • Hard to interpret similarity value (how many characters, words, sentences?) • Developing a novel solution to remove boilerplate and detect duplicates [with clear interpretation] in the same framework FIRST Y1 Review Meeting
Technical WPs WP1 & WP8 WP2 & WP7 Architecture, Integration & Scaling Strategy UC#1 Market Surveillance UC#2 Reputational Risk management UC#3 Online Retail Brokerage Domain-independent GUI (Open Source) WP3 WP4 WP6 Management Dissemination & Exploitation Information Extraction Data Acquisition Ontology Infrastructure Ontology Infrastructure Information Extraction Sentiment Analysis Decision Support Infrastructure We are here WP5 Information Integration Data, Information & Knowledge Base WP9 WP10 FIRST Y1 Review Meeting
FIRST ontology SentimentObject FinancialInstrument Index Stock_Index Stock Company Country Countries Companies Constituents (stocks) Seedindices FIRST Y1 Review Meeting
FIRST ontology :NASDAQ_100 a :Stock_Index ; rdfs:label "NASDAQ-100" . :MICROSOFT a :Stock ; rdfs:label "MICROSOFT CORP COM USD0.00000625" ; :memberOf :NASDAQ_100 . :MICROSOFT_CORP a :Company ; rdfs:label "Microsoft Corp." ; :issues :MICROSOFT . :USA a :Country ; rdfs:label "USA" . :MICROSOFT_CORP :locatedIn :USA . :MICROSOFT_CORP :hasGazetteer :MICROSOFT_CORP_Gazetteer . :MICROSOFT_CORP_Gazetteer :hasTerm "Microsoft Corp" ; :hasTerm "Microsoft Corporation" ; :hasStopWord "CORP" ; :hasStopWord "CORPORATION" ; a :Gazetteer . Microsoft Corporation is engaged in developing, licensing and supporting a range of software products and services. Microsoft also designs and sells hardware, and delivers online advertising to the customers. Microsoft Corp FIRST Y1 Review Meeting
FIRST ontology OrientationPhrase Financial Instrument correlationDefinition InfluencesObject Correlation Definition Sentiment Object objectHasCorrelationDefinition Company correlationDefinitionInfluencesIndicator correlationDefinition InfluencesFeature featureHasCorrelationDefinition indicatorHas CorrelationDefinition Technical Indicator MacroIndicator Price Feature Fundamental Volatility MicroIndicator Reputation
Annotation pipeline Boilerplate remover Boilerplate remover Boilerplate remover Duplicate detector Language detector Duplicate detector Semantic annotator Language detector ZeroMQ emitter ZeroMQ emitter Semantic annotator Language detector Duplicate detector Natural language preproc. Semantic annotator ZeroMQ emitter Natural language preproc. Natural language preproc. Ontology-basedsemanticannotation RSS reader Demo video (3:00) RSS reader . . . . . . RSS reader FIRST Y1 Review Meeting
Technical WPs WP1 & WP8 WP2 & WP7 Architecture, Integration & Scaling Strategy UC#1 Market Surveillance UC#2 Reputational Risk management UC#3 Online Retail Brokerage Domain-independent GUI (Open Source) WP3 WP4 WP6 Management Dissemination & Exploitation Data Acquisition Ontology Infrastructure Information Extraction Sentiment Analysis Sentiment Analysis Decision Support Infrastructure We are here WP5 Information Integration Data, Information & Knowledge Base WP9 WP10 FIRST Y1 Review Meeting
Sentiment Analysis • Object: Sentiment in financial web texts • Problem: Classificationofsentimentorientationwithrespecttoexpectedfuture … • pricechangeoffinancialinstruments • volatilitychangeoffinancialinstruments • reputationchangeofcompanies • Approach: Knowledge-basedsentimentclassification • Startingatthesentence-level • Specifictofeaturesofobjects(e.g., reputationof a company)
Example Shortterm: uptrendSupport for the SPX remains at 848 and then 789, with resistance at 912 and then 935. Short term momentum was overbought during the rally early in the week and is now displaying a positive divergence at friday's lows. Should the market fail to hold this pivot (SPX 840) in the days and weeks ahead the uptrend is likely over. Longterm: bear marketThe Cycle wave bear market of October 2007 continues. Thus far, equity markets worldwide have declined on average about 50%. The opportunity still remains for the US and World economies to avoid a devastatingSupercycle bear market like that of 1929-1932. • Ambiguity: ”The low clarity of messages implies that quite often people would be likely to disagree on the classification” [Das and Chen 2007]. • Identificationanddifferentiationofobjects (andfeatures) • Relationshipsofindicators (e.g., earnings) andobjects http://caldaroew.spaces.live.com/Blog/cns!D2CB8C5EBA2ADE86!27847.entry
Manual Sentiment Annotation Topic FIRST Y1 Review Meeting
FIRST Knowledge-basedSentiment Analysis Approach 1. Identify 2. Extract 3. Classifysentimentorientation {positive, negative} forall sentence-levelsentiments 4. Aggregate Rules,Ontology Allsentimentobjectsandfeatures Support for the SPX remains at 848 and then 789, with resistance at 912 … Support for the SPX remains at 848 and then 789, with resistance at 912 … Rules,Ontology All sentence-levelsentiments Scoring All sentiments in onedocument Document-levelsentiment score Averaging All sentimentscoresfor a givenday Sentiment Index [-1,1]
Sentence-levelSentiment Classification a) directly Example: „I expecttheS&P 500 torise“ positivesentiment Addressedbyrules b) indirectly,via an indicator Example: „I thinkU.S. interestrateswill rise“ negativesentiment Addressedbyontology
Example Text: S&P 500 http://business.financialpost.com/2011/10/04/economic-uncertainty-could-fan-volatility/ Oct 4, 2011 – 3:24 PM ET The fourth quarter began on Monday with the broad S&P 500 on the precipice of a bear market and investors lacking confidence in either European or U.S. policymakers being able to stem the disquiet surrounding the debt crisis.Wall Street typically defines a bear market as a drop of 20 percent or more from a recent high. Volatility is at its most persistently elevated since the financial crisis of 2008, as measured by the popular VIX , or CBOE Volatility Index. Barring a knock-out U.S. earnings period in the next month, it could remain high, and investors should brace for wild swings and more down days.
Sentiment Sentences on Price Change of S&P 500 Negative sentiment about the future price change of the S&P 500 FIRST Y1 Review Meeting
Sentiment Sentences on Volatility Change of S&P 500 Positive sentiment about the future volatility change of the S&P 500 FIRST Y1 Review Meeting
Document-level Sentiment with respect to multiple objects/features
Initial Experiment Results • Accuracyofknowledge-basedsentimentclassification vs. standardmachinelearningmethods • Small manuallyclassifiedcorpus • Result: 7% moreaccurate • Portfolio selectionexperiment • Usesentimenttoselect Dow Jones stocks • Result: Excessreturnsseempossible More information in paper: „Extracting Investor Sentiment from Weblog Texts: A Knowledge-basedApproach“, published in IEEE CEC 2011 conferenceproceedings FIRST Y1 Review Meeting
Main Y1 Achievements • Data acquisitionsoftwarerunning • Sentiment analysisfor web texts • Sentence-level, andspecifictofeaturesofobjects • Initial experimentresultsare promising • Ontologyavailable (~4000 instances) • Sentence-levelannotatedcorpusavailable(900 documentsandgrowing) • DeliveredasD3.1andD4.1 • Bookchapteron dataacquisition in preparation • Paper on sentimentextraction (bestpaperawardat CEC 2011 conference) FIRST Y1 Review Meeting
Next Steps • Improveontologyandgazetteers • Usecorpustoimprovesentimentclassification • Increasethroughputofsentimentextraction FIRST Y1 Review Meeting
Thankyou 30