370 likes | 536 Views
The XML-based Enterprise Information Portal Solutions Company. Extracting Knowledge from XML Documents Using Topic Maps. Eric Freese Director of Professional Services - Midwest Region ISOGEN International/DataChannel Knowledge Technologies 2001 – Austin, TX 7 March 2001. Premise.
E N D
The XML-based Enterprise Information Portal Solutions Company
Extracting Knowledge from XML Documents Using Topic Maps Eric Freese Director of Professional Services - Midwest Region ISOGEN International/DataChannel Knowledge Technologies 2001 – Austin, TX 7 March 2001
Premise • Rules and procedures can be established that allow automated harvesting of information from structured documents (XML) into a knowledge base by using the structure and the relationships between the structural components • Topic maps can be used as the interchange and management model for knowledge bases • New knowledge can be inferred within a knowledge base using defined inference rules
Overview • Late Breaking News • Topic Maps • Knowledge Representation/Semantic Networks • Topic Map Constructs for Semantic Networks • SemanText - Example Application • Conclusions
Late Breaking News • RDF and Topic Maps • XML Topic Maps (XTM)
Topic Maps and Semantic Nets • A Topic Map is a mechanism for describing and representing data about the structure and content of an information set, using topics, associations, and occurrences. • A Semantic Network is a knowledge representation technique consisting of nodes and links.
Topic Maps • ISO/IEC 13250:2000 Document description and processing languages – Topic Maps • TopicMaps.Org – XML Topic Maps (XTM) • Topic maps are optimized for navigation of large amounts of data • They are similar to indexes in the paper publishing world • A topic map can also be compared to a glossary, cross-reference, thesaurus, or catalog
Topics • Topics are the basic building blocks of topic maps • A topic is anything a user wants to describe • A topic can have zero or many links to occurrences within an information set • A topic can be used to aggregate all the information about a subject within the information set • Topics are categorized using topic types • Topics can have multiple types • Types are defined using topics
Family Tree Example Topics
Family Tree Example husband child
Associations • Associations relate topics together • They express a semantic relationship between topics • Association can be defined as an instance of a specific topic • Topics are members of and have roles within associations • Association role types are topics
Family Tree Example is parent of/ is child of is spouse of is sibling of
Occurrences • Occurrences provide links from the topic map into the information set • Occurrences also provide an internal means for describing topics in the topic map • An occurrence can have only one type • Occurrence roles are topics
Topic Scopes and Themes • Themes can be defined which can be used to group topics on a broader scale than types • Themes can also be viewed as filters for topic information • Scopes can be assigned to topic characteristics, associations and occurrences which call the themes into effect • Themes and scopes are used to disambiguate topics
Semantic Network Architecture • A semantic network is drawn as a series of nodes connected by links • Nodes represent objects, concepts, or situations within a specific domain • Links represent relationships between nodes • Specialized computer languages (such as Prolog) have been developed which can model and process the logic within a semantic network • A semantic network can be used as the basis for the development of fact and rules within an expert system
Associative Properties • The links within a semantic network may have the following properties: • Reflexive - topic can have the association applied to itself • Symmetric - association is true no matter the position of the topics – topics are often of the same or related types • Transitive - association can be derived based on other associations
Examples • Reflexive Spouse is married to spouse • Symmetric Husband is married to wife AND Wife is married to husband • Transitive Fathers are parents AND Eric is a father SO Eric is a parent
Semantic Network Relationships • Typically binary – one node at the end of each link • N-ary relationships can be broken down into binary relationships • Austin, Texas is a city in the United States. = • Austin,Texas is a city • Geographic regions (cities) are located in geographic regions (countries) • United States is a country
Topic Maps vs. Semantic Networks • Commonalities between topic maps and semantic networks: • Both are organized into a network of information nodes or modules. • Both allow the user to model links between the nodes. • Both allow the user to attach semantic information to the nodes and the links. • One basic difference: • Topic maps focus on navigation between topics. • Semantic networks focus on the links/associations between the nodes and the knowledge represented by the linked nodes.
Harvesting Knowledge from Structured Information • XML provides a way of attaching semantics to pieces of information through markup • Markup can be used to define or identify topic types • Element names • Attribute values • Associations between different pieces of information can be determined by structural relationships • XPath can be used to denote the structural components
Topic Map Constructs for Semantic Nets • Published Subject Identifiers • Topic Map Templates/Association Templates • Type Hierarchies/Ontologies • Association Types • Association Properties • Association Occurrences • Inference Rules
Published Subject Identifiers (PSIs) • Allows an identifier to be attached to a subject so that it can unambiguously be named and referenced • XTM identifies a core set of PSIs for the main building blocks for topic maps as well as selected association types • Two topics which are related to the same subject are merged automatically http://www.topicmaps.org/xtm/1.0/psi1.xtm#superclass-subclass http://www.topicmaps.org/xtm/1.0/psi1.xtm#superclass http://www.topicmaps.org/xtm/1.0/psi1.xtm#subclass
Templates/Schemas • Define semantics contained within an association • Define constraints on the creation of semantically valid topic map structures • Provide roadmaps for creation of topic map structures • Defined using regular topic maps syntax • Future work may include definition of extents • Cardinality • Time/Date
Templates/Schemas – cont. <topic id="marriage.schema"> <instanceOf><topicRef xlink:href="#association.class"/></instanceOf> <instanceOf><topicRef xlink:href="#schema"/></instanceOf> <baseName><baseNameString>Marriage</baseNameString></baseName> <occurrence> <instanceOf><topicRef xlink:href="#association.property"/></instanceOf> <resourceRef xlink:href="#reflexive"/> </occurrence> <occurrence id="minimum.spouses"> <instanceOf><topicRef xlink:href="#minimum.occurrences"/></instanceOf> <resourceData>2</resourceData> </occurrence> <occurrence id="maximum.spouses"> <instanceOf><topicRef xlink:href="#maximum.occurrences"/></instanceOf> <resourceData>2</resourceData> </occurrence> </topic>
Templates/Schemas – cont. <association> <instanceOf><topicRef xlink:href="#marriage.schema"/></instanceOf> <scope><topicRef xlink:href="#schema"/></scope> <member> <roleSpec><topicRef xlink:href="#spouse"/></roleSpec> <resourceRef xlink:href="#minimum.spouses"/> <resourceRef xlink:href="#maximum.spouses"/> </member> </association>
Type Hierarchies/Ontologies • Hierarchies allow ontologies to be developed by which additional knowledge can inferred simply through hierarchical inheritance • Can use templates to control or enhance the ontology
Type Hierarchies/Ontologies – cont. <topic id="person"> <instanceOf><topicRef xlink:href="#topic.class"/></instanceOf> <baseName><baseNameString>Person</baseNameString></baseName> </topic> <topic id="male"> <instanceOf><topicRef xlink:href="#topic.class"/></instanceOf> <baseName><baseNameString>Male</baseNameString></baseName> </topic> <topic id="eric"> <instanceOf><topicRef xlink:href="#male"/></instanceOf> <instanceOf><topicRef xlink:href="#person"/></instanceOf> <baseName><baseNameString>Eric</baseNameString></baseName> </topic>
Association Types • ISO 13250 implicitly specifies class/instance associations • XTM specifies, through PSIs, class/instance and superclass/subclass • Other examples • Component/object • Member/collection • Portion/mass • Feature/activity • Place/area • Phase/process
Association Properties • Transitivity, reflexivity, symmetry properties can be attached to associations • Allows special processing and understanding to occur when using associations
Association Occurrences • Topic maps center more on topics where other knowledge management schemes concentrate more on associations or relationships between topics • In topic maps, associations can have topics defined which reify them • Reification of associations allows them to have occurrences
Inference Rules • Inference rules allow new topics and associations to be created based on the existence of others • Rules can be stored and managed using topic map syntax
Inference Rules – cont. <association> <instanceOf><topicRef xlink:href="#inference.rule"/></instanceOf> <scope><topicRef xlink:href="#inference.rule.schema"/></scope> <member> <roleSpec><topicRef xlink:href="#inference.rule.condition"/></roleSpec> <topicRef xlink:href="#ir.parent.in.family.N345"/> <topicRef xlink:href="#ir.parent.in.family.N456"/> <topicRef xlink:href="#ir.sibling.in.family.N567"/> </member> <member> <roleSpec><topicRef xlink:href="#inference.rule.statement"/></roleSpec> <topicRef xlink:href="#ir.cousin.N678"/> </member> </association>
SemanText: Using Topic Maps for Knowledge Representation • 100% pure Python system developed to demonstrate the joining of topic maps and semantic networks • Uses tmproc, wxPython, PyXML • Enables creation, modification, querying of topic map structures • Semantic networks structures with entities and relationships • Inference engine built in where user can add rules which create new topic map structures • Development is continuing
Future SemanText Plans • Implement XTM • Implement scopes, themes • Implement merge – hard vs. soft • Integration with grove-based system to allow point-and-click input from multiple data formats • Hooks to natural language tools • Voice input/output using VoiceML • Graphical output such as VRML or SVG • Textual output such as Open E-book, PalmOS, WML
Conclusions • SemanText demonstrates that information can be harvested using the markup from XML documents in order to build a knowledge base • It demonstrates that the topic map architecture can be used to interchange semantic network information • It also demonstrates that topic maps can be used to feed a semantic network • It demonstrates that topic map syntax can be used to extend the topic map paradigm • Schemas, templates, inference rules
Q & A SemanText available from www.semantext.com Questions or comments welcome at: ISOGEN International/DataChannel 1611 W. County Road B, Suite 204 St. Paul, MN 55113 USA Voice: 1.651.636.9100 - Fax: 1.651.636.9191 eric@isogen.com www.isogen.com - www.datachannel.com