1 / 16

Ontology-based Information Extraction for Business Intelligence

Horacio Saggion, Adam Funk , Diana Maynard, Kalina Bontcheva Natural Language Processing Group University of Sheffield United Kingdom. Ontology-based Information Extraction for Business Intelligence. Outline. The MUSING Project Ontology-based IE

Download Presentation

Ontology-based Information Extraction for Business Intelligence

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Horacio Saggion, Adam Funk, Diana Maynard, Kalina Bontcheva Natural Language Processing Group University of Sheffield United Kingdom Ontology-based Information Extraction for Business Intelligence

  2. Outline The MUSING Project Ontology-based IE MUSING Natural Language Processing Technology MUSING applications Customisation Results Conclusions & Future Work

  3. MUSING project Business Intelligence (BI) is the process of finding, gathering, aggregating, and analysing information for decision making Many systems in BI are portals which allow business analysts access to information It is the work of the business analyst to dig into the documents in order to extract useful facts for decision making MUSING is a 7th Framework Programme Project from the European Commission which promotes the adoption of BI tools based on semantic-based knowledge and content systems Analytical techniques traditionally used in BI rely on structured information and hardly ever use qualitative information which the industry is keen in using (e.g. opinions) One of the goals of MUSING is to use structured as well as unstructured information for decision making

  4. Ontology-based Information Extraction (OBIE) Information extraction (IE) is a technology which extracts key pieces of information from text generic: identify specific name mentions in text (person names, location names, money, etc.) specific: populate a structured representation (e.g. template) with “strings” from text (e.g., full information on a joint venture) OBIE is the process of finding in text and other sources concepts, instances, and relations expressed in an Ontology

  5. Ontology-based Information Extraction (OBIE) • Extracting information about a company requires for example identify the Company Name; Company Address; Parent Organization; Shareholders; etc. • These associated pieces of information should be asserted as properties values of the company instance • Statements for populating the ontology need to be created ( “Alcoa Inc” hasAlias “Alcoa”; “Alcoa Inc” hasWebPage “http://www.alcoa.com”, etc.)

  6. Ontology-based Information Extraction in MUSING DATA SOURCE PROVIDER DOMAIN EXPERT ONTOLOGY CURATOR USER DOCUMENT MUSING ONTOLOGY DOCUMENT COLLECTOR USER INPUT DOCUMENT MUSING APPLICATION MUSING DATA REPOSITORY REGION SELECTION MODEL ONTOLOGY-BASED DOCUMENT ANNOTATION ECONOMIC INDICATORS REGION RANK ENTERPRISE INTELLIGENCE COMPANY INFORMATION ANNOTATED DOCUMENT REPORT DOMAIN EXPERT ONTOLOGY POPULATION KNOWLEDGE BASE INSTANCES & RELATIONS

  7. Data Sources and Ontology Data sources are provided by MUSING partners and include balance sheets, company profiles, press data, web data, etc. (some private data) Il Sole 24 ORE, CreditReform data Companies’ web pages (main, “about us”, “contact us”, etc.) Wikipedia, CIA Fact Book, etc. Ontology is manually developed through interaction with domain experts and ontology curators It extends the PROTON ontology and covers the financial, international, and IT operative risk domain

  8. Partial Ontology View

  9. Natural Language Processing Technology The OBIE system for English is being developed using the GATE system (http://gate.ac.uk); the German and Italian systems are based on Sprout tools developed by DFKI GATE components used include: tokeniser, sentence splitter; parts-of-speech tagger; morphological analyser; parsers; etc. GATE comes with an extraction system called ANNIE, it targets only a small fraction of the MUSING application domain

  10. Natural Language Processing Technology Main components adapted for MUSING applications are gazetteer lists and grammars used for named entity recognition New components include an ontology mapping component – entities are mapped into specific classes in the given ontology a component creates RDF statements for ontology population based on the application specification for example create a company instance with all its properties as found in the text

  11. Cross-source Coreference One important problem to be addressed in extraction from multiple source is deciding if a person name – or any other name - occurring in two different sources refer to the same individual. Given a set of documents containing a given person name we apply an agglomerative clustering algorithm, at the end documents referring to the same individual belong in the same cluster The algorithm uses vector representations of the documents (terms and weights) We experimented with two types of terms: words and entity names and our results indicate that a representation using one specific type of name (i.e., Organization) achieves state-of-the-art performance however performance varies depending on the data set

  12. MUSING Applications A number of applications have been specified to demonstrate the use of semantic-based technology in BI – some examples include Collecting Company Information from multiple multilingual sources (English, German, Italian) to provide up-to-date information on competitors Identifying Chances of success in regions in a particular country Identify appropriate partners to do business with Creation of a Joint Ventures Database from multiple sources

  13. Region Selection Application Given information on a company and the desired form of internationalisation (e.g., export, direct investment, alliance) the application provides a ranking of regions which indicate the most suitable places for the type of business A number of social, political geographical and economic indicators or variables such as the surface, labour costs, tax rates, population, literacy rates, etc. of regions have to be collected to feed an statistical model

  14. Region Selection Application Data sources used for the OBIE application are statistics from governmental sources and available region profiles found on the Web (e.g. Wikipedia) Gazetteer lists contain location names and associated information together with keywords to help identify the key information Grammars use contextual information and named entities to identify the target variables “unemployment rate of 25% (2001)” Extraction performance obtained: F-score > 80%

  15. Region Selection Application: example Tamil Nadu ... Population (2001) 62,405,679 (6) Density 478/km ... <rdf> <indicator:Measurement rdf:ID="Measurement_91567"> <indicator:hasValue>478</indicator:hasValue> <indicator:hasPoliticalRegion rdf:resource=".../int/region#TamilNadu" /> <indicator:hasIndicator rdf:resource=".../int/indicator#DENS" /> <time:hasTimeSlice xmlns:time=".../general/time#"> <time:TimeSlice rdf:ID="TimeSlice_40715"> <time:hasTemporalEntity> <time:ProperInstantYear rdf:ID="ProperInstantYear_57895"> <time:year rdf:datatype="#int">2001</time:year> </time:ProperInstantYear> </time:hasTemporalEntity> </time:TimeSlice> </time:hasTimeSlice> </indicator:Measurement> </rdf>

  16. Conclusions and Future Work MUSING integrates ontology-based extraction as a useful tool for Business Intelligence The NLP applications analyse documents and populate a knowledge base A number of practical applications have been defined which will use the KB’s stored facts. Extraction technology and performance so far is promising Our future work will concentrate on the full problem of ontology population including a cross-source coreference mechanism the identification of qualitative information (such as opinions) e.g. to model company reputation moving from a rule-based system to a machine learning approach

More Related