360 likes | 523 Views
The Third International Semantic Web Conference - ISWC 2004 November 07 – 11, 2004, Hiroshima, Japan. From Tables To Frames. Aleksander Pivk 1,2 , Philipp Cimiano 2 , York Sure 2 1 Jozef Stefan Institute, Ljubljana, Slovenia 2 AIFB Institute, University of Karlsruhe, Karlsruhe. 09.11.2004.
E N D
The Third International Semantic Web Conference - ISWC 2004 November 07 – 11, 2004, Hiroshima, Japan From Tables To Frames Aleksander Pivk1,2, Philipp Cimiano2, York Sure2 1Jozef Stefan Institute, Ljubljana, Slovenia 2 AIFB Institute, University of Karlsruhe, Karlsruhe 09.11.2004
Outline • Motivation • Foundation: Table Model • Methodology • Evaluation • Conclusion • Future Work
Motivation • problem: well-known annotation bottleneck • solution: automatic metadata generation • goal: describe the semantics of tables in model-theoretic-way (F-Logic) • tables with different structure but same meaning (should) have the same representation • benefit: enable e.g. query answering • all conferences where ‘prof. Studer’ is in PC • all tours to COUNTRY at DATE where price<AMOUNT
Foundation: Table Model • dimensions of table model [Hurst’00] • graphical (image processing) • physical (inter-cell relative location) • structural (organization of cells indicating their navigational relationship) • functional (purpose of regions in terms of data access) • two functional cell types: A-cell and I-cell • two functional I-cell roles: data and access • semantic (relation between cell content, structure and orientation) • frame makes explicit • the meaning of the cell contents (F-Logic concepts) • the functional dimension of the table (method signature) • the semantic dimension of the table (frame structure) • example:
LEGEND: LEGEND: A A - - cell cell I I - - cell (access) cell (access) I I - - cell (data) cell (data) Table model
2-Dimensional 1-Dimensional Simple Table Classes
2. Partition labels Complex Table Classes 1. Over-expanded labels 3. Combination – running example
Methodology • the methodology instantiates stepwise the table model • main differences: • do not consider graphical component • extent semantic component
Cleaning & Norm. • construct an initial matrix structure • DOM tree • cleaning: syntactic errors (CyberNeko HTML parser) • normalization: aligning the table, resorting cells spanning multiple rows/columns (colspan, rowspan) • example:
Structure Detection • detecting table orientation: • rely on similarity of cells (size, content, token types) • intuition: • if rows are similar, then orientation is vertical (top-to-down) • if columns are similar, then orientation is horizontal (left-to-right) • initialize logical units and regions • split table into LUs • group same-sized, similar cells into regions within LUs
Discovery of Regions • do while (distribution in LU not uniform)(explanation of uniformity: logical unit consists of logical sub-units where each sub-unit includes only regions of same size and orientation) • choose the best coherent region • used to propagate and normalize the neighboring regions • normalize logical sub-unit • choose neighboring regions (i.e. only within same rows for vertical orientation) • example:
Building FTM • functional table model • regions as nodes arranged in a tree • properties of leaf nodes: • are only regions consisting exclusively of I-cells • are assigned their functional role (access, data) • are assigned two semantic labels: • label describing the content of the region (instances) • label as a combination of a region label and parent A-cell nodes labels • inner nodes are either regions consisting of A-cells or ‘connection’ nodes (e.g. root) • construction of FTM • bottom-up approach (from lowest logical unit upwards) • description through an example
<label> <label> <label> <label> <role> <role> <role> <role> AdultAdultAdultChildChildChild Single RoomDouble RoomExtra BedOccupationNo Occupat… Extra Bed 35,45032,50030,55025,800/22,900 2,5101,4307201,430720360 Building FTM • type of the (colored) logical unit = I-cells only • regions are turned into leaves • semantic labels and roles are set to a default value
<label> <label> <label> <label> <role> <role> <role> <role> Class/Price Economic Extended AdultAdultAdultChildChildChild Single RoomDouble RoomExtra BedOccupationNo Occupat… Extra Bed 35,45032,50030,55025,800/22,900 2,5101,4307201,430720360 Building FTM • type of the (colored) logical unit = A-cells only • regions turned into inner nodes and connected to appropriate sub-nodes (leaves)
Connection Node <label> <label> <label> <label> <label> <label> Class/Price Economic Extended access <role> access <role> data data AdultAdultAdultChildChildChild DP9LAX01AB 01.05.2004 - 30.09.2004 Single RoomDouble RoomExtra BedOccupationNo Occupat… Extra Bed 35,45032,50030,55025,800/22,900 2,5101,4307201,430720360 Building FTM • type of the (colored) logical unit = special case • close a subtree by inserting a ‘connection’ node which reflects a logical separation in the table (transition from a LU with only A-cells to a LU with I-cells) • assign functional roles to leaves within a connected sub-tree: • functional role access assigned to all consecutive leaves (from left) that together form a unique identifier (key); other leaves assign functional role data • (possible) change of reading orientation in the new logical unit
Root <label> <label> <label> <label> <label> <label> access access data data data data Connection Node Tour Code Valid … DP9LAX01AB … 01.05.2004 - 30.09.2004 … … Class/Price Economic Extended Building FTM • type of the (colored) logical unit = A-cells only • regions turned into inner nodes and connected to appropriate sub-nodes (leaves) • finally, connect all unconnected nodes to a root node
Building FTM • recapitulation of FTM: • consider multiple-level sub-trees for merging • conditions: same tree structure and at least one level of matching A-cells • merging step: • merge nodes at the same position and level (leaf and inner nodes) • if merged inner nodes (A-cells) are not equal • find a semantic label of a new merged node • create a new leaf node (with A-cells as values) • assign functional role of the new leaf to access • example:
Connection Node Connection Node <label> <label> <label> <label> <label> <label> access access access access data data Class/Price Economic Extended AdultAdultAdultChildChildChild AdultAdultAdultChildChildChild Single RoomDouble RoomExtra BedOccupationNo Occupat… Extra Bed Single RoomDouble RoomExtra BedOccupationNo Occupat… Extra Bed 35,45032,50030,55025,800/22,900 2,5101,4307201,430720360 Class Price <label> <label> access data EconomicExtended 35,45032,50030,55025,800/22,900 2,5101,4307201,430720360 Building FTM
Semantic Enriching of FTM • find semantic labels for regions by consulting: • Wordnet lexical ontology: use synsets to find hypernyms • GoogleSets service: additonal way to find synonyms • transformations of region’s cell labels: • punctuation removal • stopword removal • compute IDF (document is a cell) for each word, and filter out the ones with value lower than treshold • select words that appear at the end of the labels (nominal head in the nominal compound is at the end) • query GoogleSets with the remaining words to filter out the ones that are not mutually similar
Person Room Date <label> access access data data AdultAdultAdultChildChildChild Single RoomDouble RoomExtra BedOccupationNo Occupat… Extra Bed 01.05.2004 - 30.09.2004 DP9LAX01AB Type access EconomicExtended Semantic Enriching of FTM • assign each leaf its semantic label that describes the content (instances) of the region Root Connection Node Tour Code Valid Class Price <label> data 35,45032,50030,55025,800/22,900 2,5101,4307201,430720360
Root Person Room <label> Date Connection Node Valid Tour Code access data access data AdultAdultAdultChildChildChild 01.05.2004 - 30.09.2004 DP9LAX01AB Single RoomDouble RoomExtra BedOccupationNo Occupat… Extra Bed Type Class Price access EconomicExtended <label> Code DateValid data TypePrice 35,45032,50030,55025,800/22,900 2,5101,4307201,430720360 PersonClass RoomClass Price Final FTM • (final) semantic labels of leaves: • label is a combination of a region label and parent A-cell nodes labels
Map FTM to a Frame • method is a tuple • frame is a pair • generation of a frame • create method m for every leaf node, which functional role is data • parameters of m are all leaf nodes with functional role access,where they must be located on the same level of m’s sub-tree or on m’s parent path towards root node • set range for m according to the syntactic token type of its region • names for parameters and methods are obtained from a final FTM • example: Tour [ Code => ALPHANUMERIC; DateValid => DATE; Price (PersonClass, RoomClass, TypePrice) => LARGE_NUMBER].
Evaluation • task: • for each table compare automatically generated frame against two manually created frames • measure in terms of Precision, Recall and F-measure • dataset: • consists of 21 tables: 3 tables for each simple table class (1D, 2D) and 5 tables for each complex table class • tourism domain • annotators: • 14 subjects • each subject had to annotate 3 tables, each belonging to a different table class • (14x3=21x2=42)
Evaluation • performed along following 4 functions: - example: [m1 (X, Y) => INTEGER] vs. [method1 (X, YY, W)=>INTEGER] • syntactic correctness: • how well the functional dimension of the table is captured (SynC=2/3) • strict comparison: • calculate how identical are nameM , rangeM , and PMidentifiers of methods (P=2/4, R=2/5) • soft comparison: • for soft matching we used a combination of TFIDF and Jaro-Wrinkler string distance scheme [Cohen et al., 2003] • calculate soft matching for identifiers of methods (P=3/4, R=3/5, where ‘Y’≈‘YY’) • conceptual comparison: • conceptually equivalent identifiers have been determined (i.e. ‘RegionType’=‘Region’=‘Location’) • calculate conceptual matching for identifiers of methods(P=4/4, R=4/5, where ‘m1’≈‘method1’)
Evaluation • performed from 2 aspects: • average: consider all frames • maximum: choose only the best manually created frame for each generated frame • results:
Conclusion • shown that our methodology stepwise instantiates the underlying table model • experiments show that: • from conceptual point of view the system gets appropriate names for frames in almost 75% • it gets totally identical names in more than 50% • we demonstrated and evaluated the successful automatic generation of frames from HTML tables
Future Work • generate one (most general) frame from multiple tables • reduction of complexity • population of ontologies with instances • show feasibility of approach in practical setting • use given ontology as background knowledge
TNX
Inter-annotator agreement • max (FX)=Fconceptual ≈60% • only 2 totally identical frames (2/21=9.52%) • only 5 identical frames from a conceptual view (5/21=23.81%) • this 5 tables cover all 1D class tables and 2 (out of 3) 2D class tables • possible reasons for low agreements: • the annotators did not follow the guidelines precisely • the task itself is hard • the annotation guidelines were not clear/detailed enough • actual results:
Example 1 Tour [ Name (Code) => TOKEN Price (Code) => CURRENCY Hotel (Code) => TOKEN Meal (Code) => TOKEN ] ------------------------------------------------------- Tour [ TourCode => ALPHANUMERIC TourName => TOKEN Price => CURRENCY Hotel => TOKEN Meal => TOKEN ] ------------------------------------------------------- TourCode [ TourName => TOKEN Price => CURRENCY Hotel => ALPHANUMERIC Meal => ALPHANUMERIC ] • Generated Frame • Annotator 1: • Annotator 2:
Example 2 Trip[ Cost (TimePeriod) => CURRENCY Insurance (TimePeriod) => CURRENCY ] ------------------------------------------------------- Trip[ Cost(Duration) => CURRENCY Insurance(Duration) => CURRENCY ] ------------------------------------------------------- Trip[ Duration=>ALPHANUMERIC DurationType=>ALPHANUMERIC Cost=>CURRENCY Insurance=>CURRENCY ] • Generated Frame: • Annotator 1: • Annotator 2:
Example 3 Transportation[ Description (Transportation) => STRING HalfDay (Transportation) => CURRENCY FullDay (Transportation) => CURRENCY HoursHakone (Transportation)=> CURRENCY ] ------------------------------------------------------- Transportation [ Vehicle => ALPHANUMERIC Seats => NUMBER WheelChairs => NUMBER JumpSeats => NUMBER Baggage => NUMBER Toilet => NUMBER Duration(TourType) => NUMBER Cost(TourType) => CURRENCY ] • Generated Frame: • Annotator 1: