Ontology-based Information Extraction for Business Intelligence

Horacio Saggion, Adam Funk, Diana Maynard, Kalina Bontcheva Natural Language Processing Group University of Sheffield United Kingdom Ontology-based Information Extraction for Business Intelligence

Outline The MUSING Project Ontology-based IE MUSING Natural Language Processing Technology MUSING applications Customisation Results Conclusions & Future Work

MUSING project Business Intelligence (BI) is the process of finding, gathering, aggregating, and analysing information for decision making Many systems in BI are portals which allow business analysts access to information It is the work of the business analyst to dig into the documents in order to extract useful facts for decision making MUSING is a 7th Framework Programme Project from the European Commission which promotes the adoption of BI tools based on semantic-based knowledge and content systems Analytical techniques traditionally used in BI rely on structured information and hardly ever use qualitative information which the industry is keen in using (e.g. opinions) One of the goals of MUSING is to use structured as well as unstructured information for decision making

Ontology-based Information Extraction (OBIE) Information extraction (IE) is a technology which extracts key pieces of information from text generic: identify specific name mentions in text (person names, location names, money, etc.) specific: populate a structured representation (e.g. template) with “strings” from text (e.g., full information on a joint venture) OBIE is the process of finding in text and other sources concepts, instances, and relations expressed in an Ontology

Ontology-based Information Extraction (OBIE) • Extracting information about a company requires for example identify the Company Name; Company Address; Parent Organization; Shareholders; etc. • These associated pieces of information should be asserted as properties values of the company instance • Statements for populating the ontology need to be created ( “Alcoa Inc” hasAlias “Alcoa”; “Alcoa Inc” hasWebPage “http://www.alcoa.com”, etc.)

Ontology-based Information Extraction in MUSING DATA SOURCE PROVIDER DOMAIN EXPERT ONTOLOGY CURATOR USER DOCUMENT MUSING ONTOLOGY DOCUMENT COLLECTOR USER INPUT DOCUMENT MUSING APPLICATION MUSING DATA REPOSITORY REGION SELECTION MODEL ONTOLOGY-BASED DOCUMENT ANNOTATION ECONOMIC INDICATORS REGION RANK ENTERPRISE INTELLIGENCE COMPANY INFORMATION ANNOTATED DOCUMENT REPORT DOMAIN EXPERT ONTOLOGY POPULATION KNOWLEDGE BASE INSTANCES & RELATIONS

Data Sources and Ontology Data sources are provided by MUSING partners and include balance sheets, company profiles, press data, web data, etc. (some private data) Il Sole 24 ORE, CreditReform data Companies’ web pages (main, “about us”, “contact us”, etc.) Wikipedia, CIA Fact Book, etc. Ontology is manually developed through interaction with domain experts and ontology curators It extends the PROTON ontology and covers the financial, international, and IT operative risk domain

Partial Ontology View

Natural Language Processing Technology The OBIE system for English is being developed using the GATE system (http://gate.ac.uk); the German and Italian systems are based on Sprout tools developed by DFKI GATE components used include: tokeniser, sentence splitter; parts-of-speech tagger; morphological analyser; parsers; etc. GATE comes with an extraction system called ANNIE, it targets only a small fraction of the MUSING application domain

Natural Language Processing Technology Main components adapted for MUSING applications are gazetteer lists and grammars used for named entity recognition New components include an ontology mapping component – entities are mapped into specific classes in the given ontology a component creates RDF statements for ontology population based on the application specification for example create a company instance with all its properties as found in the text

Cross-source Coreference One important problem to be addressed in extraction from multiple source is deciding if a person name – or any other name - occurring in two different sources refer to the same individual. Given a set of documents containing a given person name we apply an agglomerative clustering algorithm, at the end documents referring to the same individual belong in the same cluster The algorithm uses vector representations of the documents (terms and weights) We experimented with two types of terms: words and entity names and our results indicate that a representation using one specific type of name (i.e., Organization) achieves state-of-the-art performance however performance varies depending on the data set

MUSING Applications A number of applications have been specified to demonstrate the use of semantic-based technology in BI – some examples include Collecting Company Information from multiple multilingual sources (English, German, Italian) to provide up-to-date information on competitors Identifying Chances of success in regions in a particular country Identify appropriate partners to do business with Creation of a Joint Ventures Database from multiple sources

Region Selection Application Given information on a company and the desired form of internationalisation (e.g., export, direct investment, alliance) the application provides a ranking of regions which indicate the most suitable places for the type of business A number of social, political geographical and economic indicators or variables such as the surface, labour costs, tax rates, population, literacy rates, etc. of regions have to be collected to feed an statistical model

Region Selection Application Data sources used for the OBIE application are statistics from governmental sources and available region profiles found on the Web (e.g. Wikipedia) Gazetteer lists contain location names and associated information together with keywords to help identify the key information Grammars use contextual information and named entities to identify the target variables “unemployment rate of 25% (2001)” Extraction performance obtained: F-score > 80%

Region Selection Application: example Tamil Nadu ... Population (2001) 62,405,679 (6) Density 478/km ... <rdf> <indicator:Measurement rdf:ID="Measurement_91567"> <indicator:hasValue>478</indicator:hasValue> <indicator:hasPoliticalRegion rdf:resource=".../int/region#TamilNadu" /> <indicator:hasIndicator rdf:resource=".../int/indicator#DENS" /> <time:hasTimeSlice xmlns:time=".../general/time#"> <time:TimeSlice rdf:ID="TimeSlice_40715"> <time:hasTemporalEntity> <time:ProperInstantYear rdf:ID="ProperInstantYear_57895"> <time:year rdf:datatype="#int">2001</time:year> </time:ProperInstantYear> </time:hasTemporalEntity> </time:TimeSlice> </time:hasTimeSlice> </indicator:Measurement> </rdf>

Conclusions and Future Work MUSING integrates ontology-based extraction as a useful tool for Business Intelligence The NLP applications analyse documents and populate a knowledge base A number of practical applications have been defined which will use the KB’s stored facts. Extraction technology and performance so far is promising Our future work will concentrate on the full problem of ontology population including a cross-source coreference mechanism the identification of qualitative information (such as opinions) e.g. to model company reputation moving from a rule-based system to a machine learning approach

Ontology-based Information Extraction for Business Intelligence

Ontology-based Information Extraction for Business Intelligence

Presentation Transcript

Integration of Information Extraction with an Ontology

Ontology-based Extraction of Information from the Internet

Graph-Based Methods for “Open Domain” Information Extraction

Graph-Based Methods for “Open Domain” Information Extraction

Business Intelligence Product information

Ontology-based Information Extraction

Integration of Information Extraction with an Ontology

Scalable Ontology-Based Information Systems

A Framework for Extraction Plans and Heuristics in an Ontology-Based Data-Extraction System

Ontology-based information extraction: progresses and perspectives of the Ex tool

Deployment and Evaluation Issues in Ontology-Based Information Extraction

Ontology-Aware Information Extraction gate.ac.uk/ Hamish Cunningham, Kalina Bontcheva

Knowledge Representation and Extraction for Business Intelligence

BOEMIE: Bootstrapping Ontology Evolution with Multimedia Information Extraction

Populating a Database from Parallel Texts using “Ontology-based” Information Extraction

Metadata for Web-based Information Management through Ontology

Structure Based Information Extraction (SBIE)

Graph-Based Methods for “Open Domain” Information Extraction

Scalable Ontology-Based Information Systems

Ontology-Based Information Systems