210 likes | 411 Views
Semi-Automatic Semantic Annotation for Hidden-Web Tables. Cui Tao & David W. Embley Data Extraction Research Group Department of Computer Science Brigham Young University. Supported by NSF. “cdk-4". Semantic Annotation. The Hidden Web: Hidden behind forms Hard to query.
E N D
Semi-Automatic Semantic Annotation for Hidden-Web Tables Cui Tao & David W. Embley Data Extraction Research Group Department of Computer Science Brigham Young University Supported by NSF
“cdk-4" Semantic Annotation • The Hidden Web: • Hidden behind forms • Hard to query www.deg.byu.edu
Semantic Annotation • The Hidden Web: • Hidden behind forms • Hard to query to find the protein and the animo-acids information for gene “cdk-4" www.deg.byu.edu
Semantic Annotation • The Hidden Web: • Hidden behind forms • Hard to query • Semantic annotation • Machine-”understandable” • Publicly accessible www.deg.byu.edu
System Overview • Initial semantic annotation • Manually annotate a sample page • With respect to a selected ontology • Table interpretation • Automatic • Tables from hidden web pages • Final semantic annotation • Automatic • Annotate interpreted tables www.deg.byu.edu
Initial Semantic Annotation • SMORE: Semantic Markup, Ontology and RDF Editor [Maryland information and network dynamics lab] www.deg.byu.edu
Table Interpretation • Table interpretation • Locate label and value • Pair label-value pairs • Remember path • TISP – Table Interpretation by Sibling Pages www.deg.byu.edu
TISP www.deg.byu.edu
Interpretation Technique: Sibling Page Comparison Same www.deg.byu.edu
Interpretation Technique: Sibling Page Comparison Almost Same www.deg.byu.edu
Interpretation Technique: Sibling Page Comparison Different Same www.deg.byu.edu
Interpretation Technique: Sibling Page Comparison Structure Pattern of a Table Label Path = Identification.Gene model(s).Gene Model Xpath = html[1]/…/table[3]/tr[1]/td[2]/table[1]/tr[6]/td[2]/table[1]/tr[2]/td[1] www.deg.byu.edu
Annotation Protein Name Protein Name Protein Name Protein Name Protein Name www.deg.byu.edu
Annotation – Split Nucleotide Size Nucleotide Size Nucleotide Size Nucleotide Size Nucleotide Size www.deg.byu.edu
Annotation – Merge Protein Information Protein Information Protein Information www.deg.byu.edu
Annotation—Union Name Name www.deg.byu.edu
Annotation—Selection Molecular Function Molecular Function www.deg.byu.edu
Generated RDF Annotation www.deg.byu.edu
Querying Annotated Data to find the protein and the animo-acids information for gene “cdk-4" www.deg.byu.edu
Summary • Semi-automatic semantic annotation for hidden web tables • Facilitate large-scale annotation to the web www.deg.byu.edu