320 likes | 468 Views
Semiautomatic Generation of Resilient Data-Extraction Ontologies. Yihong Ding Data Extraction Group Brigham Young University Sponsored by NSF. Wrapper-Driven Data Extraction. Web data extraction Obtain user-specified information from Web documents Wrapper
E N D
Semiautomatic Generation of Resilient Data-Extraction Ontologies Yihong Ding Data Extraction Group Brigham Young University Sponsored by NSF
Wrapper-Driven Data Extraction • Web data extraction • Obtain user-specified information from Web documents • Wrapper • Convert implicit HTML data into explicit formatted data • Data-source-specified, high performance • Examples: • SoftMealy, STALKER, WIEN, Omini, ROADRUNNER, …
? / next_token ? / ε _U s<U,U> / ε s<N,N> / ε ? / ε U etc. s<b,U> / “U=” + next_token s<U,N> / “N=” + next_token b _N s<b,N> / “N=” + next_token N ? / ε ? / next_token Common Problem of Wrappers SoftMealy <LI> <A HREF="…"> Mani Chandy </A>, <I>Professor of Computer Science</I> and <I>Executive Officer for Computer Science</I> • Resiliency • fixed domain • changeable layout • Scalability • unchanged existing wrapper • extendable domain and functions
Structure Object sets Relationship sets Participation constraints Data frames Pros: resilient and scalable Cons: hard to create Knowledge requirements Tedious and error-prone work Car [-> object]; Car [0:1] has Make [1:*]; Make matches [10] constant { extract "\baudi\b"; }; end; Car [0:1] has Model [1:*]; Model matches [25] constant { extract "80"; context "\baudi\S*\s*80\b"; }; end; Car [0:1] has Mileage [1:*]; Mileage matches [8] constant {extract "\b[1-9]\d{0,2}k"; substitute "[kK]" -> "000";}; end; Car [0:1] has Price [1:*]; Price matches [8] constant { extract "[1-9]\d{3,6}"; context "\$[1-9]\d{3,6}";}; end; Data-Extraction Ontology
Sample Documents Human Brain Concepts of Interest Data-Extraction Ontology Knowledge Base Concepts with Relations Motif of Ontology Generation
Thesis Statement • Given: knowledge base • Input: sample Web pages of interest • Output: a data-extraction ontology for the domain of interest • Between input and output: this is the work of this thesis
test documents training documents pre-processing clean records interact if necessary Data Extraction Ontology Concept Selection Relation Retrieval Constraint Discovery Extraction Processing Integrated Knowledge Base Results Storage Result Evaluation pre-processing Knowledge Sources Ontology-Generation Procedure
Primary Knowledge Source • Requirements • Available • General in coverage • Rich in meaningful relationship • Encoded in or easily converted to XML • Mikrokosmos (K) Ontology • Developed by NMSU jointly with U.S. DoD • Contains over 5000 concepts • Connects to an average 14 links per concept • Represented in XML format
Integrated Knowledge Base KNOWLEDGE BASE K Ontology Lexicons Data-Frame Library Synonym Dictionary (WordNet)
test documents training documents pre-processing clean records interact if necessary Data Extraction Ontology Concept Selection Relation Retrieval Constraint Discovery Extraction Processing Integrated Knowledge Base Results Storage Result Evaluation pre-processing Knowledge Sources Ontology-Generation Procedure
Domain Specification • Training documents • Data-rich • Narrow in topic breadth • Preprocessing
Example – Car Advertisement Record 1: 00 GrandAM SE, Sunfire Red, CD, AC, PW, PLGreat Condition, $10,800, Call 798-3446 Record 2: 02 Buick Century Custom, Pwr Seat, Nada Retail 13,695 221-1250 Record 3: 02 Buick Century, lo mi, mint cond, $11,999. 373-4445 dlr# 2755 Record 4: 00 Buick Century Stk# HU7159 Green $9,319, 714-2200To Apply By Phone, 1-877-228-9486, OREM Utah
test documents training documents pre-processing clean records interact if necessary Data Extraction Ontology Concept Selection Relation Retrieval Constraint Discovery Extraction Processing Integrated Knowledge Base Results Storage Result Evaluation pre-processing Knowledge Sources Ontology-Generation Procedure
Concept Selection • Selection strategies • Compare a string with the name of a concept • Compare a string with the values belonging to a concept • Apply data-frame recognizers to recognize a string KB <PHONE-NR> 00 Buick Century Stk# HU7159 Green $9,319, 714-2200To Apply By Phone, 1-877-228-9486, OREM Utah
<PRICE> <MILEAGE> by keyword identification price Concept Selection • Reasons of conflict • Synonymy • Polysemy • Conflict resolution • Same-string only one meaning • Favor longer over shorter • Context decides meaning KB 02 Buick Century Custom, Pwr Seat, Nada Retail13,695 221-1250.
test documents training documents pre-processing clean records interact if necessary Data Extraction Ontology Concept Selection Relation Retrieval Constraint Discovery Extraction Processing Integrated Knowledge Base Results Storage Result Evaluation pre-processing Knowledge Sources Ontology-Generation Procedure
Relationship Retrieval KB <AUTOMOBILE> <MILEAGE> <YEAR> <PRICE> <PHONE-NR> <AUDIO-MEDIA-ARTIFACT> <CENTURY>
test documents training documents pre-processing clean records interact if necessary Data Extraction Ontology Concept Selection Relation Retrieval Constraint Discovery Extraction Processing Integrated Knowledge Base Results Storage Result Evaluation pre-processing Knowledge Sources Ontology-Generation Procedure
<AUTOMOBILE> <AUTOMOBILE> <PRICE> <PRICE> Constraint Discovery 02 Buick Century, lo mi, mint cond, green, pwr seat, $11,999. 373-4445 dlr# 2755 AUTOMOBILE [0:1] IsA.ARTIFACT.CostofProduction PRICE [1:1] 00 Buick Century Stk# HU7159 Green $9,319, 714-2200To Apply By Phone, 1-877-228-9486, OREM Utah
test documents training documents pre-processing clean records interact if necessary Data Extraction Ontology Concept Selection Relation Retrieval Constraint Discovery Extraction Processing Integrated Knowledge Base Results Storage Result Evaluation pre-processing Knowledge Sources Ontology-Generation Procedure
Ontology Generation concept nodes object sets paths relationship sets discovered constraints participation constraints concept recognizers data frames
Automatically Generated Ontology -- Car Advertisement (01) {Automobile [-> object];} (02) {Automobile [0:1] has Mileage [1:1];} (03) {Automobile [0:1] IsA.ARTIFACT.CostOfProduction Price [1:1];} (12) {Price [1:1] IsA.SCALARATTRIBUTE.MeasuredIn.MEASURINGUNIT.Subclasses Year [0:*];} (20) {Automobile [0:1] relatesTo PhoneNr [1:*] relatesTo ArtifactPart [1:*] relatesTo Mileage [1:*] relatesTo Truck [1:*] relatesTo AudioMediaArtifact [1:*] relatesTo CommunicationDevice [1:*] relatesTo ControlEvent [1:*] relatesTo TravelEvent [1:*];}
test documents training documents pre-processing clean records interact if necessary Data Extraction Ontology Concept Selection Relation Retrieval Constraint Discovery Extraction Processing Integrated Knowledge Base Results Storage Result Evaluation pre-processing Knowledge Sources Ontology-Generation Procedure
Updating Strategies • Remove all bad relationship sets • Modify remaining incorrect relationship sets • Substitute incorrect object sets • Reduce long n-ary relationship sets • Fix participation constraints • Adjust names or re-arrange sequences • Add new relationship sets
Final Ontology Car [-> object] Car [0:1] has Year [1:*] Car [0:1] has Mileage [1:*] Car [0:1] has Price [1:*] PhoneNr [1:*] is for Car [0:1] PhoneNr [0:1] has Extension [1:*] Car [0:*] has Feature [1:*] Car [0:1] has Make [1:*] Car [0:1] has Model [1:*]
Evaluation Criteria • Basic measures • POG (Precision of Ontology Generation) • ROG (Recall of Ontology Generation) • Human constraints • PROG (Pseudo-ROG) • Comparing with an expert-created ontology • Knowledge base constraints • EPROG (Effective-PROG) • Correctness dependency • DEPROG (Dependent-EPROG) • For example: relationship sets depends on object sets
Discussion of Results • Bottleneck: cannot generate what not in the knowledge base • Object sets • Concept-selection procedure works well • Desired concept not shown in training records • Rarely occurring concept not severe even if we don’t fix the error • Example: extension • Aggregation and union • USAddressCity, USAddressState, USAddressZipCode Location • CropPlant, AnimalProduct, FruitFoodStuff AgriculturalProduct • Close-meaning concepts: FurniturePart Furnished
Discussion of Results • Relationship sets • Binary relationship sets over 95% • Most errors due to incorrectly generated object sets • Semantically incorrect relationship sets • Price IsA.SCALARATTRIBUTE.MeasuredIn.MEASURINGUNIT.Subclasses Year • n-ary relationship sets (usually huge) • Participation constraints • Error due to lack of training examples • How much is enough?
Knowledge Base Extensibility • Add SALT -- a new knowledge source • Successfully integrated into existing KB • Sample new relationship set (DOE abstract domain) • CrudeOil IsA.PHYSICALOBJECT.Location.PLACE.Subclasses Nation
Conclusion • Experimented with knowledge-base construction and extension • Standardized application domain specification • Generated data-extraction ontologies from a specified domain and an integrated knowledge base • Showed DEPROG results of more than 70% on average and over 90% for well-defined domains
Future Work • Build a general-purpose knowledge source for data-extraction usage • Study more about data frames • Can a system correctly identify concepts with data frames? • Can a system update a data frame to fit a special situation? • Can a system generate a data frame from a collection of information of interest?