700 likes | 1.08k Views
Ontology Generation, Information Harvesting and Semantic Annotation For Machine-Generated Web Pages. Cui Tao PhD Dissertation Defense. Motivation. Birth date of my great grandpa Price and mileage of red Nissans, 1990 or newer Protein and amino acids information of gene cdk-4?
E N D
Ontology Generation, Information Harvesting and Semantic Annotation For Machine-Generated Web Pages Cui Tao PhD Dissertation Defense
Motivation • Birth date of my great grandpa • Price and mileage of red Nissans, 1990 or newer • Protein and amino acids information of gene cdk-4? • US states with property crime rates above 1%
“cdk-4" Search the Hidden Web • The Hidden Web: • Hidden behind forms • Hard to query
Query for Data • The Hidden Web: • Hidden behind forms • Hard to query Find the protein and the animo-acids information for gene “cdk-4"
A Web of Pages A Web of Knowledge • Web of Knowledge • Machine-“understandable” • Publicly accessible • Queriable by standard query languages • Semantic annotation • Domain ontologies • Populated conceptual model • Problems to resolve • How do we create ontologies? • How do we annotate pages for ontologies?
Contributions of Dissertation Work • Web of Pages Web of Knowledge • Knowledge & meta-knowledge extraction • Reformulation as machine-“understandable” knowledge • Automatic & semi-automatic solutions via: • Sibling tables (TISP/TISP++) • User-created forms (FOCIH)
Automatic Annotation with TISP(Table Interpretation with Sibling Pages) • Recognize tables (discard non-tables) • Locate table labels • Locate table values • Find label/value associations
Recognize Tables Layout Tables (discard) Data Table Nested Data Tables
Find Label/Value Associations Example: (Identification.Gene model(s).Protein, Identification.Gene model(s).2) = WP:CE28918 1 2
Interpretation Technique:Sibling Page Comparison Almost Same
Interpretation Technique:Sibling Page Comparison Different Same
Technique Details • Unnest tables • Match tables in sibling pages • “Perfect” match (table for layout discard ) • “Reasonable” match (sibling table) • Determine & use table-structure pattern • Discover pattern • Pattern usage • Dynamic pattern adjustment
Table Structure Patterns • Regularity Expectations: • (<tr><(td|th)> {L} <(td|th)> {V})n • <tr>(<(td|th)> {L})n • (<tr>(<(td|th)> {V})n)+ • … Pattern combinations are also possible.
Table Structure Patterns <tr>(<(td|th)> {L})n (<tr>(<(td|th)> {V})n)+
TISP++ • Automatic ontology generation • Automatic information annotation
Ontology Generation – OSM • Object set: table labels • Lexical: labels that associate with actual values • Non-lexical: labels that associate with other tables • Relationship set: table nesting • Constraints: updates based on observation
Ontology Generation – OWL • Object set: OWL class • Relationship set: OWL object property • Lexical object set: • OWL data type property • Different annotation properties to keep track of the provenance
Query the Data Find the protein and the animo-acids information for gene “cdk-4"
TISP Evaluation • Applications • Commercial: car ads • Scientific: molecular biology • Geopolitical: US states and countries • Data: > 2,000 tables in 35 sites • Evaluation • Initial two sibling pages • Correct separation of data tables from layout tables? • Correct pattern recognition? • Remaining tables in site • Information properly extracted? • Able to detect and adjust for pattern variations?
Experimental Results Table recognition: correctly discarded 157 of 158 layout tables Pattern recognition: correctly found 69 of 72 structure patterns Extraction and adjustments: 5 path adjustments and 34 label adjustments all correct
TISP++ Performance • Performance depends on TISP • TISP test set • Generates all ontologies correctly • Annotates all information in tables correctly
Form-based Ontology Creation and Information Harvesting (FOCIH) • Personalized ontology creation by form • General familiarity • Reasonable conceptual framework • Appropriate correspondence • Transformable to ontological descriptions • Capable of accepting source data • Automated ontology creation • Automated information harvesting
Almost Ready to Harvest • Need reading path: DOM-tree structure • Need to resolve mapping problems • Pattern recognition • Instance recognition
regular expression for decimal number left context right context Pattern & Instance Recognition
Pattern & Instance Recognition list pattern, delimiter is “,”
Pattern & Instance Recognition list pattern, delimiter is regular expression for percentage numbers and a comma
Pattern & Instance Recognition list pattern, delimiter is regular expression for percentage numbers and a comma