1 / 36

From Tables To Frames

The Third International Semantic Web Conference - ISWC 2004 November 07 – 11, 2004, Hiroshima, Japan. From Tables To Frames. Aleksander Pivk 1,2 , Philipp Cimiano 2 , York Sure 2 1 Jozef Stefan Institute, Ljubljana, Slovenia 2 AIFB Institute, University of Karlsruhe, Karlsruhe. 09.11.2004.

Download Presentation

From Tables To Frames

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The Third International Semantic Web Conference - ISWC 2004 November 07 – 11, 2004, Hiroshima, Japan From Tables To Frames Aleksander Pivk1,2, Philipp Cimiano2, York Sure2 1Jozef Stefan Institute, Ljubljana, Slovenia 2 AIFB Institute, University of Karlsruhe, Karlsruhe 09.11.2004

  2. Outline • Motivation • Foundation: Table Model • Methodology • Evaluation • Conclusion • Future Work

  3. Motivation • problem: well-known annotation bottleneck • solution: automatic metadata generation • goal: describe the semantics of tables in model-theoretic-way (F-Logic) • tables with different structure but same meaning (should) have the same representation • benefit: enable e.g. query answering • all conferences where ‘prof. Studer’ is in PC • all tours to COUNTRY at DATE where price<AMOUNT

  4. Foundation: Table Model • dimensions of table model [Hurst’00] • graphical (image processing) • physical (inter-cell relative location) • structural (organization of cells indicating their navigational relationship) • functional (purpose of regions in terms of data access) • two functional cell types: A-cell and I-cell • two functional I-cell roles: data and access • semantic (relation between cell content, structure and orientation) • frame makes explicit • the meaning of the cell contents (F-Logic concepts) • the functional dimension of the table (method signature) • the semantic dimension of the table (frame structure) • example:

  5. LEGEND: LEGEND: A A - - cell cell I I - - cell (access) cell (access) I I - - cell (data) cell (data) Table model

  6. 2-Dimensional 1-Dimensional Simple Table Classes

  7. 2. Partition labels Complex Table Classes 1. Over-expanded labels 3. Combination – running example

  8. Methodology • the methodology instantiates stepwise the table model • main differences: • do not consider graphical component • extent semantic component

  9. Cleaning & Norm. • construct an initial matrix structure • DOM tree • cleaning: syntactic errors (CyberNeko HTML parser) • normalization: aligning the table, resorting cells spanning multiple rows/columns (colspan, rowspan) • example:

  10. Structure Detection • detecting table orientation: • rely on similarity of cells (size, content, token types) • intuition: • if rows are similar, then orientation is vertical (top-to-down) • if columns are similar, then orientation is horizontal (left-to-right) • initialize logical units and regions • split table into LUs • group same-sized, similar cells into regions within LUs

  11. Discovery of Regions

  12. Discovery of Regions • do while (distribution in LU not uniform)(explanation of uniformity: logical unit consists of logical sub-units where each sub-unit includes only regions of same size and orientation) • choose the best coherent region • used to propagate and normalize the neighboring regions • normalize logical sub-unit • choose neighboring regions (i.e. only within same rows for vertical orientation) • example:

  13. Building FTM • functional table model • regions as nodes arranged in a tree • properties of leaf nodes: • are only regions consisting exclusively of I-cells • are assigned their functional role (access, data) • are assigned two semantic labels: • label describing the content of the region (instances) • label as a combination of a region label and parent A-cell nodes labels • inner nodes are either regions consisting of A-cells or ‘connection’ nodes (e.g. root) • construction of FTM • bottom-up approach (from lowest logical unit upwards) • description through an example

  14. <label> <label> <label> <label> <role> <role> <role> <role> AdultAdultAdultChildChildChild Single RoomDouble RoomExtra BedOccupationNo Occupat… Extra Bed 35,45032,50030,55025,800/22,900 2,5101,4307201,430720360 Building FTM • type of the (colored) logical unit = I-cells only  • regions are turned into leaves • semantic labels and roles are set to a default value

  15. <label> <label> <label> <label> <role> <role> <role> <role> Class/Price Economic Extended AdultAdultAdultChildChildChild Single RoomDouble RoomExtra BedOccupationNo Occupat… Extra Bed 35,45032,50030,55025,800/22,900 2,5101,4307201,430720360 Building FTM • type of the (colored) logical unit = A-cells only  • regions turned into inner nodes and connected to appropriate sub-nodes (leaves)

  16. Connection Node <label> <label> <label> <label> <label> <label> Class/Price Economic Extended access <role> access <role> data data AdultAdultAdultChildChildChild DP9LAX01AB 01.05.2004 - 30.09.2004 Single RoomDouble RoomExtra BedOccupationNo Occupat… Extra Bed 35,45032,50030,55025,800/22,900 2,5101,4307201,430720360 Building FTM • type of the (colored) logical unit = special case  • close a subtree by inserting a ‘connection’ node which reflects a logical separation in the table (transition from a LU with only A-cells to a LU with I-cells) • assign functional roles to leaves within a connected sub-tree: • functional role access assigned to all consecutive leaves (from left) that together form a unique identifier (key); other leaves assign functional role data • (possible) change of reading orientation in the new logical unit

  17. Root <label> <label> <label> <label> <label> <label> access access data data data data Connection Node Tour Code Valid … DP9LAX01AB … 01.05.2004 - 30.09.2004 … … Class/Price Economic Extended Building FTM • type of the (colored) logical unit = A-cells only  • regions turned into inner nodes and connected to appropriate sub-nodes (leaves) • finally, connect all unconnected nodes to a root node

  18. Building FTM • recapitulation of FTM: • consider multiple-level sub-trees for merging • conditions: same tree structure and at least one level of matching A-cells • merging step: • merge nodes at the same position and level (leaf and inner nodes) • if merged inner nodes (A-cells) are not equal • find a semantic label of a new merged node • create a new leaf node (with A-cells as values) • assign functional role of the new leaf to access • example:

  19. Connection Node Connection Node <label> <label> <label> <label> <label> <label> access access access access data data Class/Price Economic Extended AdultAdultAdultChildChildChild AdultAdultAdultChildChildChild Single RoomDouble RoomExtra BedOccupationNo Occupat… Extra Bed Single RoomDouble RoomExtra BedOccupationNo Occupat… Extra Bed 35,45032,50030,55025,800/22,900 2,5101,4307201,430720360 Class Price <label> <label> access data EconomicExtended 35,45032,50030,55025,800/22,900 2,5101,4307201,430720360 Building FTM

  20. Semantic Enriching of FTM • find semantic labels for regions by consulting: • Wordnet lexical ontology: use synsets to find hypernyms • GoogleSets service: additonal way to find synonyms • transformations of region’s cell labels: • punctuation removal • stopword removal • compute IDF (document is a cell) for each word, and filter out the ones with value lower than treshold • select words that appear at the end of the labels (nominal head in the nominal compound is at the end) • query GoogleSets with the remaining words to filter out the ones that are not mutually similar

  21. Person Room Date <label> access access data data AdultAdultAdultChildChildChild Single RoomDouble RoomExtra BedOccupationNo Occupat… Extra Bed 01.05.2004 - 30.09.2004 DP9LAX01AB Type access EconomicExtended Semantic Enriching of FTM • assign each leaf its semantic label that describes the content (instances) of the region Root Connection Node Tour Code Valid Class Price <label> data 35,45032,50030,55025,800/22,900 2,5101,4307201,430720360

  22. Root Person Room <label> Date Connection Node Valid Tour Code access data access data AdultAdultAdultChildChildChild 01.05.2004 - 30.09.2004 DP9LAX01AB Single RoomDouble RoomExtra BedOccupationNo Occupat… Extra Bed Type Class Price access EconomicExtended <label> Code DateValid data TypePrice 35,45032,50030,55025,800/22,900 2,5101,4307201,430720360 PersonClass RoomClass Price Final FTM • (final) semantic labels of leaves: • label is a combination of a region label and parent A-cell nodes labels

  23. Map FTM to a Frame • method is a tuple • frame is a pair • generation of a frame • create method m for every leaf node, which functional role is data • parameters of m are all leaf nodes with functional role access,where they must be located on the same level of m’s sub-tree or on m’s parent path towards root node • set range for m according to the syntactic token type of its region • names for parameters and methods are obtained from a final FTM • example: Tour [ Code => ALPHANUMERIC; DateValid => DATE; Price (PersonClass, RoomClass, TypePrice) => LARGE_NUMBER].

  24. Evaluation • task: • for each table compare automatically generated frame against two manually created frames • measure in terms of Precision, Recall and F-measure • dataset: • consists of 21 tables: 3 tables for each simple table class (1D, 2D) and 5 tables for each complex table class • tourism domain • annotators: • 14 subjects • each subject had to annotate 3 tables, each belonging to a different table class • (14x3=21x2=42)

  25. Evaluation • performed along following 4 functions: - example: [m1 (X, Y) => INTEGER] vs. [method1 (X, YY, W)=>INTEGER] • syntactic correctness: • how well the functional dimension of the table is captured (SynC=2/3) • strict comparison: • calculate how identical are nameM , rangeM , and PMidentifiers of methods (P=2/4, R=2/5) • soft comparison: • for soft matching we used a combination of TFIDF and Jaro-Wrinkler string distance scheme [Cohen et al., 2003] • calculate soft matching for identifiers of methods (P=3/4, R=3/5, where ‘Y’≈‘YY’) • conceptual comparison: • conceptually equivalent identifiers have been determined (i.e. ‘RegionType’=‘Region’=‘Location’) • calculate conceptual matching for identifiers of methods(P=4/4, R=4/5, where ‘m1’≈‘method1’)

  26. Evaluation • performed from 2 aspects: • average: consider all frames • maximum: choose only the best manually created frame for each generated frame • results:

  27. Conclusion • shown that our methodology stepwise instantiates the underlying table model • experiments show that: • from conceptual point of view the system gets appropriate names for frames in almost 75% • it gets totally identical names in more than 50% • we demonstrated and evaluated the successful automatic generation of frames from HTML tables

  28. Future Work • generate one (most general) frame from multiple tables • reduction of complexity • population of ontologies with instances • show feasibility of approach in practical setting • use given ontology as background knowledge

  29. TNX

  30. Inter-annotator agreement • max (FX)=Fconceptual ≈60% • only 2 totally identical frames (2/21=9.52%) • only 5 identical frames from a conceptual view (5/21=23.81%) • this 5 tables cover all 1D class tables and 2 (out of 3) 2D class tables • possible reasons for low agreements: • the annotators did not follow the guidelines precisely • the task itself is hard • the annotation guidelines were not clear/detailed enough • actual results:

  31. Example 1

  32. Example 1 Tour [ Name (Code) => TOKEN Price (Code) => CURRENCY Hotel (Code) => TOKEN Meal (Code) => TOKEN ] ------------------------------------------------------- Tour [ TourCode => ALPHANUMERIC TourName => TOKEN Price => CURRENCY Hotel => TOKEN Meal => TOKEN ] ------------------------------------------------------- TourCode [ TourName => TOKEN Price => CURRENCY Hotel => ALPHANUMERIC Meal => ALPHANUMERIC ] • Generated Frame • Annotator 1: • Annotator 2:

  33. Example 2

  34. Example 2 Trip[ Cost (TimePeriod) => CURRENCY Insurance (TimePeriod) => CURRENCY ] ------------------------------------------------------- Trip[ Cost(Duration) => CURRENCY Insurance(Duration) => CURRENCY ] ------------------------------------------------------- Trip[ Duration=>ALPHANUMERIC DurationType=>ALPHANUMERIC Cost=>CURRENCY Insurance=>CURRENCY ] • Generated Frame: • Annotator 1: • Annotator 2:

  35. Example 3

  36. Example 3 Transportation[ Description (Transportation) => STRING HalfDay (Transportation) => CURRENCY FullDay (Transportation) => CURRENCY HoursHakone (Transportation)=> CURRENCY ] ------------------------------------------------------- Transportation [ Vehicle => ALPHANUMERIC Seats => NUMBER WheelChairs => NUMBER JumpSeats => NUMBER Baggage => NUMBER Toilet => NUMBER Duration(TourType) => NUMBER Cost(TourType) => CURRENCY ] • Generated Frame: • Annotator 1:

More Related