100 likes | 253 Views
Information Extraction Introduction. Sunita Sarawagi. Definition. “Information Extraction refers to the automatic extraction of structured information such as entities, relationships between entities, and attributes describing entities from unstructured sources.” .
E N D
Information ExtractionIntroduction SunitaSarawagi
Definition “Information Extraction refers to the automatic extraction of structured information such as entities, relationships between entities, and attributes describing entities from unstructured sources.” • Enables richer forms of queries • Facilitates source integration and queries spanning sources
IE: Multidisiplinary • Roots in NLP • Now many communities • Machine learning • Information retrieval • Databases • Web (web science) • Document analysis • Sarawagi’s categorization of methods • Rule-based • Statistical • Hybrid models
Applications • News Tracking • Customer Care (e.g., unstructured data from insurance claim forms) • Data Cleaning (e.g., converting address strings into structured strings) • Classified Ads • Personal Information Management • Scientific (e.g., bio-informatics) • Citation Databases • Opinion Databases (e.g., enhanced if organized along structured fields) • Community Websites (e.g., conferences, projects, events) • Comparison Shopping • Ad Placement (e.g., product ads next to text mentioning the product) • Structured Web Search • Grand Challenge • Allow structured search queries involving entities and their relationships over the WWW
Types of Structure Extracted • Entities • Relationships • Adjective Descriptors • Structures • Aggregates • Lists • Tables • Hierarchies
Types of Unstructured Sources • Granularity • Record or Sentence • Paragraphs • Documents • Heterogeneity • Machine Generated Pages • Partially Structured Domain Specific • Open Ended
Input Resources for Extraction • Structured Databases “In many applications unstructured data needs to be integrated with structured databases.” • Labeled Unstructured Text • Labeling for machine learning • Labeling to establish ground truth • Preprocessor Libraries (NLP tools) • Sentence analyzer to identify sentence boundaries • Part of speech tagger • Parser to group tagged text into phrases • Dependency analyzer (subject/object) • Formatted text (table & list structures) • Lexical Resources (e.g., WordNet)
Output of Extraction • Identify all instances in the unstructured text • Populate a database For both, the core extraction work remains the same
Challenges • Accuracy (foremost challenge) • Diversity of Clues Required to be Successful • Inherent complexity demands combining evidence • Optimally combining is non-trivial • Problem—far from solved • Difficulty of Detecting Missed Extractions • Recall: percent of actual entities extracted correctly – but without ground truth, can’t know the actual entities • Precision: percent of extracted entities that are correct – easier to tune, can usually know correct/incorrect. • Increased Complexity of Structures Extracted (e.g., parts of a blog that assert an opinion)
Challenges (continued) • Running Time • Lots of documents – just finding the set from which to extract is challenging • Expensive processing steps to apply to many documents • Other System Issues • Dynamically changing sources • Data integration (when extracting the same objects from different sites) • Extraction errors • Attaching confidence • But computing the confidence is non-trivial