Information Extraction (WP6)

Information Extraction (WP6) Martin Labský MedIEQ meeting Helsinki, 24th October 2006

Agenda • Initial criteria set • Additional criteria • Information extraction toolkit • Extraction engines • IET demo • Next steps

Initial criteria set – viewed as classes • Resource (1,2,5,6,7,8) { • title, URL, last update, language, • MESH topic, target audience } • Author or Responsible (3,4) { • name, address, phone, e-mail } • 6. MESH keywords • 9. virtual consultation • 10. advertisement • 11. seal 2 extractable classes identified 4 standalone attributes to be extracted

Additional criteria – described • Information sources • references to literature (citations) • identified as a whole (no author, title etc. segmentation) • Links to medical organisations • scientific orgs, self-help groups, related websites • name, contact info extracted as for Author/Responsible • Sponsors • name, contact info extracted as for Author/Responsible • sponsor’s policy (free text) extracted in addition • Content provider • name, contact info • provider’s profile (free text) typically from ‘about’ page • Privacy Policy • textual description of what may be done with collected data • Accessibility • identify violation of certain Web Accessibility Initiative criteria

Putting the criteria together Contact Resource MESH keyword name title virt. con. segment address URI advertisement phone last update e-mail language seal www address MESH topic information source target audience language privacy policy accessibility warning Author Content provider Responsible profile Sponsor Medical org. initial criteria policy additional criteria

Information extraction toolkit - architecture INFORMATION EXTRACTION TOOLKIT WP5 Repository of previously extracted items user components admin components Annotation tool Pre-processor UI WP4 Labeling schemas IE Engines Data Model Manager IE Engine 0 (NER) IE Engine 3 (STA) WP7 Labeled corpora (type B) IE Engine 2 (ML) IE Engine 1 (EXO) UI Expert’s domain and extraction knowledge Integrator Task Manager MUA Annotated documents Extracted attributes, instances Documents with assigned n-best classes UI Visualiser WP5 Repository of previously extracted items WP5

Information extraction toolkit – document flow classified document extract attributes, extract instances based on attributes, add them to document select extraction model based on document class Pre-processor IE Engine 0 (NER) IE Engine 3 (STA) IE Engine 2 (ML) IE Engine 1 (EXO) extracted attributes and instances extract attributes, add them to document

document flow Extraction engines • 3rd party (NER): LingPipe, Annie, BiOs, JET ... • extract attributes • state: tested by UNED • ML extractor • extract attributes • state: developed at NCSR • Statistical text extractor • needed to extract free text paragraphs of certain kind e.g. “about company text”, “privacy policy description” • state: future work; TKK will be the owner • Ex (extraction ontology) extractor • extract attributes • extract instances based on identified attributes • state: developed at UEP

Demo • Information Extraction Toolkit • extraction task management • task = documents + ex.model + ex.engine • definition, load, save, run, monitor progress • can use any IE engine which implements the Engine interface • showing preliminary UI (to be replaced by AQUA) • Ex (extraction ontologies) • contact information sample

Next steps • Integration of more extraction engines into IET • Integration of IET into AQUA • Improve • precision and recall • efficiency

Information Extraction (WP6)

Information Extraction (WP6)

Presentation Transcript

Information Extraction

Information Extraction

Information Extraction

Information Extraction

information extraction

Information Extraction

Information Extraction

Information Extraction

Information Extraction

Information Extraction

Information Extraction

Information Extraction

Cancer Registries and Rare Cancers: Data quality and supplementary information

Information Extraction

Information Extraction

Information Extraction (WP6)

Information Extraction

Information Extraction

Information Extraction