1 / 6

Julius Information Extractor

Extract information from text by building spelling and context rules, learning new rules iteratively, and using additional features like prefixes, suffixes, POS substitution, and window bounds selection. Can be used with web search snippets. Includes GUI tools for labeling and statistics viewing. Works well on small datasets but struggles with larger corpora. May not always benefit from web context due to noise.

dpat
Download Presentation

Julius Information Extractor

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Julius Information Extractor June 14, 2006 Kyle Woodward Lee-Ming Zen

  2. The Problem • There is a lot of text and information out there, but not a whole lot of tagging. How can we extract information a user is interested in without knowing anything beforehand?

  3. Approach • Based upon AT&T system • Build up “spelling” and “context” rules • Iteratively learn new rules by labeling and examining labels by jumping from one set of rules to the other • Additional features • We used a fixed length prefix and suffix to augment the context • Substituted POS instead of a full grammar parse for context • Window bounds selection to determine tag size • Web • Use information from web search snippets

  4. Rules • Rules are a set of features for a particular labeling with weights for each feature • e.g. allcap, contains, full-string, etc.

  5. What’s Cool • Generality • No restrictions on the type of data it runs against • No preassumed notions about the domain • GUI tools • Labeler • Statistics viewer • Works • Works well on small data sets

  6. What’s Not • Fails at larger corpora • Generality tradeoff means not being able to exploit certain information • Web context does not necessarily help due to noise

More Related