210 likes | 339 Views
Introduction to MedIEQ. Quality Labelling of Medical Web content using Multilingual Information Extraction http://zeus.iit.demokritos.gr/medieq. Martin Labský labsky@vse.cz Knowledge Engineering Group (KEG) University of Economics Prague (UEP). Purpose of MedIEQ.
E N D
Introduction to MedIEQ Quality Labelling of Medical Web content using Multilingual Information Extraction http://zeus.iit.demokritos.gr/medieq Martin Labskýlabsky@vse.cz Knowledge Engineering Group (KEG) University of Economics Prague (UEP) WP6 – Information Extraction
Purpose of MedIEQ • Medical web sites are increasingly popular • Content strongly affects users’ decisions • Therefore, quality labeling is very important • Agencies invest large effort into labeling websites manually • We develop tools to minimize their effort • Tools will be multi-lingual, will support different and evolving labeling criteria WP6 – Information Extraction
Agenda • Partners • Description of relevant work packages [3] • Web content collection, Information Extraction, Lexical and semantic resources • Goals, tasks, partners • Existing tools (to be extended) • New tools (to be developed) • Existing resources (to be made accessible) • Milestones & deliverables • References • Questions WP6 – Information Extraction
Partners • Agencies • WMA: Web Médica Acreditata (Es) • assigns a quality label that is shown on medical websites • websites ask for the label, are suggested changes, then get it • AQUMED: Agency for Quality Labeling in medicine (De) • maintains a web directory organized by topics • only good-quality websites are present • Developers • NCSR Demokritos and I-Sieve (spin-off) (Gr) • UEP: University of Economics Prague (Cz) • UNED: National University of Distance Education (Es) • HUT: Helsinki University of Technology (Fi) WP6 – Information Extraction
Web Content Collection (WP5) WP6 – Information Extraction
Website monitoring • Regular visits to labeled website • Checking pages • for relevant changes • which changes are relevant? • manual rules, machine learning... • alert agency when significant changes occur • or, increase the website’s (web page’s) priority in a list of to-be-checked resources • show what has changed, suggest solution • Needed by WMA, AQuMed WP6 – Information Extraction
Web focused crawling • Find new medical websites • Use multiple existing search engines • specify lists of keywords / keyphrases • give sample “similar” documents • use Google/Yahoo API and filter their results • NCSR already has a focused crawler • we should contribute to its development • Needed by WMA WP6 – Information Extraction
Website spidering • Walk pages of a single website • Classify each page • in order to choose relevant docs for quality labeling • e.g. contact page, page containing treatment description, page with sponsors • use machine learning, e.g. based on a bag-of-words (unigram, bigram) document representation • Spidering strategy • which documents belong together (e.g. page 1/7) • which links to follow next • NCSR has a spider • uses classifiers from Weka for doc classification • we should contribute WP6 – Information Extraction
Information Extraction (WP6) WP6 – Information Extraction
IE introduction • Documents to extract from • pages retrieved & classified by spider • from known websites • from crawler • monitored labeled pages that have changed • Information to be extracted • derived from agencies’ labeling criteria • e.g. contact information of responsible persons, sponsor names, privacy warning texts... • Questions • how much human intervention needed? • complexity of label sets to be supported? • methodology of porting to a new language? WP6 – Information Extraction
Example extracted information I. • Transparency and honesty • site provider (company name, contact) • site purpose, type of target audience • funding (grants, sponsors) • Authority • source citation for information provided, its type and date • names and credentials of all information providers • Privacy and data protection • privacy policy description • Timeliness of information • dates of publication/modification • Accountability • names (and roles) of people responsible for presented information • editorial policy description WP6 – Information Extraction
Example extracted information II. • Content • medical terms, e.g. disease and drug names • statements recommending a certain product/method • advertisements • disallowed combinations (e.g. advertisement for X adjacent to an article related to X) • Formal • mandatory statements (e.g. importance of physical examination, privacy warnings when posting data into chats) WP6 – Information Extraction
Sources of extraction knowledge • Training data • scarcity will be a problem for most extracted attributes • different types: labeled documents, sample extracted data, data previously extracted from the same website, domain dictionaries • Extraction patterns • induced (semi)automatically from scarce training data • or even authored manually • Background domain knowledge • relations between extracted attributes, cardinalities ... • e.g. typically just one company is the web site’s provider, but there are often multiple sponsors • Web site structure • exploit common formatting of a group of documents within a website • exploit common formatting used for a particular type of extracted data across different websites WP6 – Information Extraction
IE tools • Ex (UEP) • IE system under development using “extraction ontologies” • extracts instances from semi-structured documents • utilizes training data + manually defined patterns, includes spider • old version based on HMMs – http://eso.vse.cz/~labsky/client/ • Named entity recognizer (UNED) • extracts dates, person/institution names • 3rd party IE tools • wrapper management systems • e.g. LP2-based IE tool or annotation editor from Sheffield WP6 – Information Extraction
Website assessment • Check website’s technical correctness • SEO (findability in search engines with respect to some keyphrases) • accessibility (possibility of font enlargement, blind access, pages hidden deep in website structure, color schemes perceivable by anybody) • formal correctness (dead links, violations of HTML standards, failure to display well under at least the 3 most popular browsers) • Check non-technical correctness • e.g. typos, “clear, easy-to-understand language” • more: check for black-listed phrases, claims, etc. WP6 – Information Extraction
Website assessment tools • Relaxed (UEP) • HTML validator based on Relax NG and Schematron patterns • can perform formal checks of website content beyond DTDs • http://relaxed.sourceforge.net/ • SEO tool (UEP) • could Honza’s SEO tool be extended? WP6 – Information Extraction
IE Deliverables • Duration: M1-M28 • Deliverables • D8: Methodology & architecture of IE (M9) • D9.1: First version of IE toolkit (M15) • D9.2: Final version of IE toolkit (M24) WP6 – Information Extraction
Lexical and semantic resources(WP7) WP6 – Information Extraction
Lexical and semantic resources • Sp, De, En, Cz, Gr, Fi, Catalan (7!) • We are in charge of Cz, De(!) • Semantic • thesauri, ontologies (MESH) • lists of cures, vaccine names, lists of medical companies, illnesses, diagnoses • generic ontologies and translation dictionaries (e.g. Eurowordnet) • Lexical • lemmatizers/morphology analyzers, part-of-speech taggers, chunkers, syntactic parsers • medical document collections (for classification) WP6 – Information Extraction
References • MedIEQ: • http://www.iit.demokritos.gr/~vangelis/MedIEQ/ • http://zeus.iit.demokritos.gr/medieq • Related projects: • WRAPIN http://debussy.hon.ch/cgi-bin/Wrapin/ClientWrapin.pl • Quatro http://www.quatro-project.org/DC2005.htm • CROSSMARC http://www.iit.demokritos.gr/skel/crossmarc/ • Relaxed: • http://badame.vse.cz/validator/ • Ex: • http://eso.vse.cz/~labsky/doc/ex.pdf • Ellogon: • http://www.ellogon.org/ WP6 – Information Extraction
Questions • ? WP6 – Information Extraction