100 likes | 108 Views
Information Extraction (WP6). Martin Labsk ý MedIEQ meeting Helsinki, 24th October 2006. Agenda. Initial criteria set Additional criteria Information extraction toolkit Extraction engines IET demo Next steps. Initial criteria set – viewed as classes. Resource (1,2,5,6,7,8) {
E N D
Information Extraction (WP6) Martin Labský MedIEQ meeting Helsinki, 24th October 2006
Agenda • Initial criteria set • Additional criteria • Information extraction toolkit • Extraction engines • IET demo • Next steps
Initial criteria set – viewed as classes • Resource (1,2,5,6,7,8) { • title, URL, last update, language, • MESH topic, target audience } • Author or Responsible (3,4) { • name, address, phone, e-mail } • 6. MESH keywords • 9. virtual consultation • 10. advertisement • 11. seal 2 extractable classes identified 4 standalone attributes to be extracted
Additional criteria – described • Information sources • references to literature (citations) • identified as a whole (no author, title etc. segmentation) • Links to medical organisations • scientific orgs, self-help groups, related websites • name, contact info extracted as for Author/Responsible • Sponsors • name, contact info extracted as for Author/Responsible • sponsor’s policy (free text) extracted in addition • Content provider • name, contact info • provider’s profile (free text) typically from ‘about’ page • Privacy Policy • textual description of what may be done with collected data • Accessibility • identify violation of certain Web Accessibility Initiative criteria
Putting the criteria together Contact Resource MESH keyword name title virt. con. segment address URI advertisement phone last update e-mail language seal www address MESH topic information source target audience language privacy policy accessibility warning Author Content provider Responsible profile Sponsor Medical org. initial criteria policy additional criteria
Information extraction toolkit - architecture INFORMATION EXTRACTION TOOLKIT WP5 Repository of previously extracted items user components admin components Annotation tool Pre-processor UI WP4 Labeling schemas IE Engines Data Model Manager IE Engine 0 (NER) IE Engine 3 (STA) WP7 Labeled corpora (type B) IE Engine 2 (ML) IE Engine 1 (EXO) UI Expert’s domain and extraction knowledge Integrator Task Manager MUA Annotated documents Extracted attributes, instances Documents with assigned n-best classes UI Visualiser WP5 Repository of previously extracted items WP5
Information extraction toolkit – document flow classified document extract attributes, extract instances based on attributes, add them to document select extraction model based on document class Pre-processor IE Engine 0 (NER) IE Engine 3 (STA) IE Engine 2 (ML) IE Engine 1 (EXO) extracted attributes and instances extract attributes, add them to document
document flow Extraction engines • 3rd party (NER): LingPipe, Annie, BiOs, JET ... • extract attributes • state: tested by UNED • ML extractor • extract attributes • state: developed at NCSR • Statistical text extractor • needed to extract free text paragraphs of certain kind e.g. “about company text”, “privacy policy description” • state: future work; TKK will be the owner • Ex (extraction ontology) extractor • extract attributes • extract instances based on identified attributes • state: developed at UEP
Demo • Information Extraction Toolkit • extraction task management • task = documents + ex.model + ex.engine • definition, load, save, run, monitor progress • can use any IE engine which implements the Engine interface • showing preliminary UI (to be replaced by AQUA) • Ex (extraction ontologies) • contact information sample
Next steps • Integration of more extraction engines into IET • Integration of IET into AQUA • Improve • precision and recall • efficiency