Julius Information Extractor

Julius Information Extractor June 14, 2006 Kyle Woodward Lee-Ming Zen

The Problem • There is a lot of text and information out there, but not a whole lot of tagging. How can we extract information a user is interested in without knowing anything beforehand?

Approach • Based upon AT&T system • Build up “spelling” and “context” rules • Iteratively learn new rules by labeling and examining labels by jumping from one set of rules to the other • Additional features • We used a fixed length prefix and suffix to augment the context • Substituted POS instead of a full grammar parse for context • Window bounds selection to determine tag size • Web • Use information from web search snippets

Rules • Rules are a set of features for a particular labeling with weights for each feature • e.g. allcap, contains, full-string, etc.

What’s Cool • Generality • No restrictions on the type of data it runs against • No preassumed notions about the domain • GUI tools • Labeler • Statistics viewer • Works • Works well on small data sets

What’s Not • Fails at larger corpora • Generality tradeoff means not being able to exploit certain information • Web context does not necessarily help due to noise

Julius Information Extractor

Julius Information Extractor

Presentation Transcript

JULIUS CAESAR

Julius Caesar

Julius Bloznalis

Julius Caesar

Julius Caesar

Julius Caesar

Julius Keleras

Julius Caesar

Julius

Julius Caesar

Julius

Julius Caesar

Citation Extractor

Feature Extractor

Julius

Comment Extractor

Julius Caesar Background Information

Hydrocarbon extractor

BHO extractor

Email Extractor

Best b2b information extractor from Linkedin | Linkedin Company Extractor

Lead Extractor