1 / 10

Information Extraction (WP6)

Information Extraction (WP6). Martin Labsk ý MedIEQ meeting Helsinki, 24th October 2006. Agenda. Initial criteria set Additional criteria Information extraction toolkit Extraction engines IET demo Next steps. Initial criteria set – viewed as classes. Resource (1,2,5,6,7,8) {

kato
Download Presentation

Information Extraction (WP6)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Information Extraction (WP6) Martin Labský MedIEQ meeting Helsinki, 24th October 2006

  2. Agenda • Initial criteria set • Additional criteria • Information extraction toolkit • Extraction engines • IET demo • Next steps

  3. Initial criteria set – viewed as classes • Resource (1,2,5,6,7,8) { • title, URL, last update, language, • MESH topic, target audience } • Author or Responsible (3,4) { • name, address, phone, e-mail } • 6. MESH keywords • 9. virtual consultation • 10. advertisement • 11. seal 2 extractable classes identified 4 standalone attributes to be extracted

  4. Additional criteria – described • Information sources • references to literature (citations) • identified as a whole (no author, title etc. segmentation) • Links to medical organisations • scientific orgs, self-help groups, related websites • name, contact info extracted as for Author/Responsible • Sponsors • name, contact info extracted as for Author/Responsible • sponsor’s policy (free text) extracted in addition • Content provider • name, contact info • provider’s profile (free text) typically from ‘about’ page • Privacy Policy • textual description of what may be done with collected data • Accessibility • identify violation of certain Web Accessibility Initiative criteria

  5. Putting the criteria together Contact Resource MESH keyword name title virt. con. segment address URI advertisement phone last update e-mail language seal www address MESH topic information source target audience language privacy policy accessibility warning Author Content provider Responsible profile Sponsor Medical org. initial criteria policy additional criteria

  6. Information extraction toolkit - architecture INFORMATION EXTRACTION TOOLKIT WP5 Repository of previously extracted items user components admin components Annotation tool Pre-processor UI WP4 Labeling schemas IE Engines Data Model Manager IE Engine 0 (NER) IE Engine 3 (STA) WP7 Labeled corpora (type B) IE Engine 2 (ML) IE Engine 1 (EXO) UI Expert’s domain and extraction knowledge Integrator Task Manager MUA Annotated documents Extracted attributes, instances Documents with assigned n-best classes UI Visualiser WP5 Repository of previously extracted items WP5

  7. Information extraction toolkit – document flow classified document extract attributes, extract instances based on attributes, add them to document select extraction model based on document class Pre-processor IE Engine 0 (NER) IE Engine 3 (STA) IE Engine 2 (ML) IE Engine 1 (EXO) extracted attributes and instances extract attributes, add them to document

  8. document flow Extraction engines • 3rd party (NER): LingPipe, Annie, BiOs, JET ... • extract attributes • state: tested by UNED • ML extractor • extract attributes • state: developed at NCSR • Statistical text extractor • needed to extract free text paragraphs of certain kind e.g. “about company text”, “privacy policy description” • state: future work; TKK will be the owner • Ex (extraction ontology) extractor • extract attributes • extract instances based on identified attributes • state: developed at UEP

  9. Demo • Information Extraction Toolkit • extraction task management • task = documents + ex.model + ex.engine • definition, load, save, run, monitor progress • can use any IE engine which implements the Engine interface • showing preliminary UI (to be replaced by AQUA) • Ex (extraction ontologies) • contact information sample

  10. Next steps • Integration of more extraction engines into IET • Integration of IET into AQUA • Improve • precision and recall • efficiency

More Related