310 likes | 474 Views
XML on Semantic Web. Outline. The Semantic Web Ontology XML Probabilistic DTD References. The Semantic Web (1/4). The first generation Web The second generation Web : current Web The third generation Web : Semantic Web
E N D
Outline • The Semantic Web • Ontology • XML • Probabilistic DTD • References
The Semantic Web (1/4) • The first generation Web • The second generation Web:current Web • The third generation Web:Semantic Web • The conceptual structuring of the Web in an explicit machine-readable way • Requirements:Universal expressive power、Support for syntactic Interoperability、Support for Semantic Interoperability
The Semantic Web (2/4) • Syntactic interoperability talks about parsing the data, and semantic interoperability means to define mappings between unknown terms and known terms in the data • Semantic interoperability:requires standards syntactic form of document and semantic content • A further representation and inference layer is needed on top of the currently available layers of the WWW:Ontology
Ontology (1/5) • An explicit machine-readable specification of a shared conceptualization • Crucial role:representation of a shared conceptualization of a particular domain • reusable • find pages that contain syntactically different but semantically similar words • Construct:concepts (which are usually organized by taxonomies), relations, functions, axioms, instances
Ontology (3/5) • Concepts: • Be anything about which something is said • Also known as classes (XOL, RDF(s), OIL, DAML+OIL), objects (OML), categories (SHOE) • Taxonomies: • used to organize ontological knowledge using generalization and specialization relationships through which simple and multiple inheritance could be applied
Ontology (4/5) • Relations and functions: • An interaction between concepts of the domain and attributes • Be called relations in SHOE、OML, roles in OIL • Functions are a special kind of relation • Axioms: • Constraining information, verifying correctness, deducting new information • Also known as assertions (OML), rule, logic
Ontology (5/5) • Instances: • Represent elements in the domain attached to a specific concept • Measurement of the expressiveness: • XOL, RDF(s), SHOE, OML, OIL, DAML+OIL
XML (1/7) • As a serialization syntax for other markup language, ex:SMIL、XOL、SHOE • As semantic markup of Web-pages • As a uniform data-exchange format
XML (2/7) • Universal expressive power:anything can be encoded in XML if a grammar can be defined for it • Syntactic interoperability:XML parser can parse any XML data and is usually a reusable component • Semantic interoperability:there is no way of recognizing a semantic unit from a particular domain of interest (not yet widely recognized)
XML (4/7) • Data exchange: • Build a model of the domain of interest • From the domain model a DTD or an XMLs is constructed • Advantage:reusability of the parsing software components • There exists multiple possibilities to encode a given domain model into a DTD, so the direct connection from the DTD to the domain model is lost and it cannot be easily reconstructed
XML (6/7) • A direct mapping based on the different DTDs is not possible • So we have to define the mappings between the different domain models, then between the different DTDs: • Reengineering of the original Domain Model from the DTD or XML Schema • Establishing mappings between the entities in the domain model • Defining translation procedures for XML Documents • Using a more suitable formalism than pure XML can save much of the additional effort
Probabilistic DTD(1/11) • Describes the most likely orderings of XML tags and that contains statistical properties for each tag • Utilize association rule discovery algorithm and sequence mining techniques
Probabilistic DTD (2/11) • Objectives:tagging all text documents and deriving an appropriate preliminary flat XML DTD • A knowledge discovery in textual databases (KDT) process to build clusters of semantically similar text units and then new documents can be converted into XML documents
Probabilistic DTD (3/11) • UML schema:are initially conceived by experts serves as a reference for the DTD, but there is no guarantee that the final DTD will be contained in or contain this schema • KDT process: • Tagging initial text documents • Domain knowledge constitutes such as thesaurus、preliminary UML schema, input to process • Pre-processing • Iterative clustering • Post-processing • Establishing a probabilistic DTD
Probabilistic DTD (5/11) • Pre-processing: • Setting the level of granularity • NLP processing such as tokenization、normalization、word stemming • Building text unit descriptors—a reduced feature space(now are chosen by engineer) • Mapping all text units into Boolean vectors of this feature space • Extract named entity
Probabilistic DTD (6/11) • Clustering: • Performed in multiple iterations, each iteration outputs a set of clusters • All text unit vectors are clustered • Partition clusters into “acceptable” and “unacceptable” according to quality criteria • Members of “unacceptable” are input data to the next iteration
Probabilistic DTD (7/11) • Post-processing: • “acceptable” clusters are semi-automatically assigned a label • Ultimately, cluster labels are determined by the engineer • All default cluster labels are derived from text unit descriptors • Automatically derived XML DTD from XML tags
Probabilistic DTD (9/11) • Establishing a probabilistic DTD: • Deriving the most likely ordering of the tags • Computing the statistically properties of each tag inside the document type definition • Deriving the ordering of the tags • Backward Construction of DTD Sequences:builds “maximal” sequences • Forward sequence construction
Probabilistic DTD (10/11) • Backward Construction of DTD Sequences • Starts with an arbitrary tag ﺡand then identifies the tag most likely to appear before it • If no such tag exists, then shifts to the next sequence. If there is one, then the next iteration starts. If there are k tags, then duplicates k incomplete sequences. • Each tag Xi leading to ﺡ with a confidence Ci • If there is a Ci larger than the others, then Xi is the predecessor of ﺡ in the sequence • If C0 where is the confidence where ﺡ has no predecessor is largest, then ﺡ is the first element • Confidence is the tag’s TagSupport multiplied by the accuracy
References • The Semantic Web—on the respective Roles of XML and RDF • Stefan Decker, Frank van Harmelen, Jeen Broekstra, Michael Erdmann, Dieter Fensel, Ian Horrocks, Michel Klein, Sergey Melnik • Intelligent Information Agent with Ontology on the Semantic Web • Weihua Li • Ontology Languages for the Semantic Web • Asuncion Gomez-Perez, Oscar Corcho • Extraction of Semantic XML DTDs from Texts Using Data Mining Techniques • Karsten Winkler, Myra Spiliopoulou