The Medical Information System - MedISys eHealth 2009 Second International ICST Conference on Electronic Healthcare for the 21st century September 23-25, 2009 - Istanbul, Turkey Erik van der Goot & the OPTIMA team (OPensource Text Information Mining and Analysis ) European Commission – Joint Research Centre (JRC)Institute for the Protection and Security of the Citizen (IPSC)erik.van-der-goot@jrc.ec.europa.eu
MedISys - Overview Objective: Provide open source data collection and analysis for surveillance and epidemiology Replace manual scanning of multiple newspapers and web portals Support national and international Public Health (PH) organisations to monitor issues of Public Health concern (e.g. CBRN) Functionality: Gather, filter, classify, extract and aggregate health-related information Monitor trends, detect breaking news Visualise analysis results Alert users Allows customised views In combination with RNS tool, allows manual moderation.
Background - History Based on JRC’s Europe Media Monitor (EMM) technology (EMM live since 2002; http://emm.newsbrief.eu). On request / initiative of the EC’s Directorate General for Health and Consumer Protection (DG SANCO). Password-protected service for Public Health bodies since 2005. Public service since early 2007 (http://medusa.jrc.it/, restricted functionality).
Background - Media Monitoring • EU Commission Media Monitoring (until 2001/2002) • Traditional cut and paste for printed press only • Monitoring of incoming news wires (e.g. Reuters, AFP) • Simple keyword based filtering of wires • Manual selection of printed press items • Human classification of items • Potential problems • Not ‘real-time’ for mainstream media: printed press typically once a day • Limited coverage: not all media is printed • Inaccurate and incomplete classification: subjective and limited number of categories • Labour intensive and expensive: limited number of articles per reviewer per day, requires topical knowledge and requires language knowledge
EMM History • New Challenges (as seen in 2002) • Enlargement (+10 countries): more media, more languages • More use of electronic publishing (media) • Electronic distribution of results (web+mobile) • Automatic alerting functions • New approach: EMM - a one stop shop for Media Monitoring • Facilitate (not replace) human Media Monitoring activities • Extend monitoring beyond the traditional news wires (Internet). • Improve coverage, number of languages, analysis. • Apply automatic categorization and analysis to all sources • Provide new services like automatic e-mail, sms, mobile editions etc. • Provide editorial system to manage the information and produce newsletters etc. • Important: EMM is notYet Another Internet Search Engine
EMM System Features • Automatic language recognition • Based on continuously updated language specific frequency tables • Automated information/entity extraction • 400.000 persons and organizations based on continuously updated list of entities, many language specific synonyms. • Geotagging • Based on homegrown harmonised multilingual geo-data set, about 600.000 place name variants in most languages covered by EMM, mostly national capitals, regional capitals and provincial capitals. • Improved Categorization Engine • Boolean combinations, proximity, wildcards • Support for Arabic and similar (automatic noun-prefix processing) Support for Chinese and similar (no whitespace) • Tonality/Sentiment • Simple bag of words approach, range from very negative to very positive, corrected for long term source bias, interesting for following reporting trends per category
… more features • Duplicate detection • Metadata categorization • Allows selection of articles based on any previously assigned meta-data. • Automated information linking • Incremental topic based clustering and storytracking, geolocation. • 10 minute interval incremental clustering on last 4 hours worth of news.(Top Stories on front page) • Automatic detection of breaking news • Cluster growth rate • Flux of articles per category • Indexing • Index full text and most metadata. • Statistics/Trend analysis • Quantitative analysis of reporting. Maintain simple count statistics.
…and more features • Event extraction • Language independent event grammars used to parse clusters using language dependent resources to fill the grammar slots. • Currently for 5 languages (en, fr, it, pt, ru), violent events, humanitarian events
2002 2004 2006 2009 Development time line EMM/RNS Domain specific application MediSys Continuous development New features NewsExplorer First version 2005 EMM System redesign Redesign based on EMM RNS redesign
MediSys System Overview MediSys Newsbrief NewsDesk Service (a.k.a. RNS) Editorial Interface EMM Open Source Monitoring Engine
Problems to solve • Find relevant information • Millions of new articles/blogs/items/tweets published on Internet each day • Deliver the information to the right user • Allow for many (possibly overlapping) categories to meet specific needs • Timely • Right now if possible • In short: Deliver targeted information timely to the right user
Approach • Wide coverage • Many sources • Local, Regional, National and International coverage • Many languages • Multilinguality & cross-lingual information access • Fast coverage • High frequency monitoring of sites, some sites every 5 minutes • Overcome the information overflow • Categorization, aggregation, duplicate identification, clustering • Customisability of MedISys NewsBrief • Search functions • RNS tool for manual moderation and targeted dissemination
Input data ~ 2200 Sources (world-wide, but primary focus on Europe) ~ 4,000 HTML web pages+RSS feeds ~ 100 specialist medical sites ~ 20 commercial newswires Specialist pay-for sources (LexisMed) 24/7, near continuous monitoring ~80,000 new articles/items per day Converts dirty html with adverts, menus, html tags, ‘related stories’, etc. into clean and standardised Unicode-encoded RSS format Use RSS when available Perform full content analysis
MedISys – Current subscribers and users include … Supranational organisations Directorate General Health and Consumer Protection (SANCO) European Centre for Disease Control, Stockholm (ECDC) European Food Safety Authority (EFSA) World Health Organisation (WHO) National Public Health organisations Swiss Federal Office of Public Health Icelandic Ministry of Health Spanish Ministry of Sanitation & Ministry of Health and Consumer Protection Institut de Veille Sanitaire (France) Global Public Health Intelligence Network (Canada) Danish Emergency Management Agency Italian Ministry of Health and Ministry of Defence Dutch Institute of Public Health & Food and Consumer Product Safety Authority The (general?) public Currently ~ 1000 visitors, ~ 37000 hits per day on public system
Locations mentioned in MedISys medical articles across languages English - French Spanish - Portuguese Importance of multilingual information gathering Italian - German
Influenza-A-Virus influenzavirus tipo A swine-origin influenza sjevernoameričk gripe pandemia influenzale mexicaanse griep мексиканск грипп североамериканск грипп pandemija svinjske sjevernoameričke gripe grippe nouvelle gripă porcină svinjski grip sikainfluenssa svininfluensa Schweineinfluenza Porzine Influenza Schweinegrippe influenza porcina prasečí chřipka Multilingual and cross lingual analysis (1) Barack Obama (Eu,yo) Barak Obama (az,wo) Барак Обама (ba,uk) باراك أوباما (ar) باراك اوباما (ar,fa) Барак Хуссейн Обама (ru) Baraque Obama (pt) バラク・オバマ (ja) บารัค โอบามา (th) Բարաք Օբամա (hy) ބަރަކް އޮބާމާ (dv) באראק אבאמא (yi) ברק אובאמה (he) 贝拉克·奥巴马 (zh) ބަރާކް އޮބާމާ (dv) بارک اوبامہ (ur) • Data processing layer: • Detect ‘known entities’ across languages using large multilingual set of name variants (updated daily) • Geo-locate the articles using large multilingual geo-database • Apply content based categorization using multilingual category definitions
Multilingual and cross lingual analysis (2) • Data presentation layer: • ‘Convenience’ links to external Machine Translation programs, where available. • Display of other MedISys categories, of persons and organisations found in text. • Display on-line English translation of Chinese and Arabic
Aggregation of multilingual information Documents from all languages get classified according to the same countries and categories. An increase of the number of media reports on any country-category combination is detected, independently of the reporting language. Graphs and alerts may show events not yet reported in your own language.
Detection using statistics • Detect abnormal flux of reporting for a particular country/category combination
News Clusters mostly about CategorySat. 02-05-2009, Influenza A
PULS Event detection Results from Helsinki University
Category definitions – Example: haemorrhagic fever • Terms (single or multi-word) • Cumulative weights with threshold • Case forcing • Upper case characters in pattern only match uppercase in text (useful for acronyms etc.) • Wild cards • Single letters (_) • Zero, one or more letters (%) • Adjacent words (+) • Boolean combinations of term lists • And, or, not • Using proximity operator (within X words)
Customisability of MedISys Add more news sources or new categories, e.g. Events: Cricket World Cup, Rugby World Cup, UEFA Euro 2008 New diseases Other classes, e.g. deliberate release of chemicals (on request of recognised users/partners) Output formats: web pages, email alerts, or RSS feed to integrate into your environment. Email alerts: daily vs. breaking news only for daily notification: specify hour for breaking news: level-dependent User-selected languages only
Rapid News Service - RNS (restricted to subscribed users) Allows MedISys users to further customise their view of the news Selection of specific languages and feeds Allows human moderation Manual selection of news items Drag and drop compilation of newsletters Allows moderators to forward news items to user groups Allows user management Via SMS alerts, emails or newsletters Shows overview of relative activity of each category over time
RNS moderation: Editing interface for newsletter Manual selection of news items, drag and drop compilation of newsletters.
RNS moderation: Alert overview page Time line shows overview of relative activity of each category over time.
MedISys - Summary High coverage: helps monitor a large number of multilingual media reports. Includes tools to help beat the information overflow: via clustering, duplicate detection; categorization; information aggregation; visualisation; mapping further means are being implemented: e.g. multiligual medical event extraction Special features of MedISys: Fully automatic (moderation possible) Real time (10-minute updates), 24/7 High multilinguality (43 languages) Multilingualinformation aggregation Part of EMM family of applications, active team: much new functionality to come.