210 likes | 344 Views
2. Introduction. What is Information Extraction (IE) ?To select desired fields from the given data, by extracting common patterns that appear along with the information.To automate such a process.To make the process efficient by reducing the training data required, so as to restrict the cost.. 3.
E N D
1. Information Extraction -Introduction and Tools V.G.Vinod Vydiswaran
Roll no. 02329011
M.Tech (1st Year)
KReSIT, IITBombay
29th October 2002
Guided by : Prof. S. Sarawagi
2. 2 Introduction What is Information Extraction (IE) ?
To select desired fields from the given data, by extracting common patterns that appear along with the information.
To automate such a process.
To make the process efficient by reducing the training data required, so as to restrict the cost.
3. 3 Motivation Abundant online data available.
Most IE systems specific to single information resource.
IE models usually hand-coded, and hence error-prone.
Data available either in structured form or in highly verbose content. Proper filters needed.
4. 4 Types of Data Based on text styles:
Structured data
Semi-Structured text
Plain text
Based on information to the model:
Labeled
Unlabeled
5. 5 Structured Data Relational Data
Data in databases, in tables
HTML Tags
Query responses translated into Relational form using Wrappers
Usually hand-coded and very specific to information resource
6. 6 Wrapper Induction Wrapper
Procedure extracting tuples from a particular information source
A function from page to set of tuples
Induction
Task of generalizing from labeled examples to a hypothesis function of labeling instances
7. 7 Wrapper Identification ExtractCCs (page P) { skip past first occurrence of <P> in P while next <B> is before next <HR> in P { for each (lk, rk) ? {(<B>,</B>), (<I>, </I>)} { skip past next occurrence of lk in P extract attribute from P to next occurrence of rk } } return extracted tuples } <HTML><HEAD> <TITLE>Country Codes</TITLE> </HEAD> <BODY> <B>Some Country Codes</B> <P> <B>Congo</B> <I>242</I><BR> <B>Egypt</B> <I>20</I><BR> <B>India</B> <I>91</I><BR> <B>Spain</B> <I>34</I><BR> <HR> <B>End</B> </BODY> </HTML>