200 likes | 234 Views
A framework supporting various languages, integrating with Knowledge Net Algorithm for document analysis. Perform morphological and semantic analysis, optimize resulting graph, and save results efficiently.
E N D
Saint-Petersburg State University TEMPLATE-DRIVEN KNOWLEDGE MINING.KNOWLEDGEPROSPECTOR.NET Speaker Alexey L. Smolyakov Project team (Knowledge.Net)Anton V. NovikovMaxim V. SigalinAlexey L. Smolyakov Dmitry G. Cherepanov Scientific Adviserprof. Vladimir V. Safonov
Project goals • Flexible framework • Supporting different languages • Integration withKnowledge.Net
Algorithm • Getting documents and first-step text analysis • Morphological analysis of text blocks • Semantic analysis of entities sets using templates • Optimizing resulting graph • Saving results
Getting documents and first-step text analysis • Getting documents from providers • Divide document into articles (just text, list, table etc.) • Divide text into blocks … Текстовый формат – это очень гибкий путь для описания различных типов информации… 1) Один 2) Два 3) Три Страна. Столица. Англия. Лондон. Украина. Киев.
Morphological analysis of text blocks Word(«Documents») • Language recognition • Morphological form recognition using dictionaries • Creating entities Russian English … MRD XML … «Documents» current m. f. : Noun, plural «Document» base m. f.: Noun, singular EntityClass(«Document»)
Morphological analysis >Entities types>“Simple”entities • Entity “separator". Example «.,;:!?()[]{}…» • Entity “unknown" • Entity “changeable". Example «good» • Entity “relationship". Example «Planet Earth is LESS then Sun»
Morphological analysis >Entities types>“True”entities • Entity “class" (class). Example «document». • Entity “property".Example «useful». • Entity “datatype". • Datetime • Integer
Semantic analysis >Goals • Creating relationships between entities • Creating new entities • Adding true entities into resulting graph Class(«house») Property-Class Subclass Property(«comfortable») Class(«building») Property-Class Property(«brick»)
Semantic analysis >Relationship types • Relationship between property and class • Relationship “subclass” • Relationship “subproperty” • Relationship “equality” • Relationship between two classes • Relationship “conditional rule”
Semantic analysis >Template description • Priority • Pattern • Handlers <Template Priority="10000" Pattern="#E.P #E.C ,? а? значить #E.P"> <Handler Name=“PropertyRelationship" Arguments="0, 1" /> <Handler Name="PropertyRelationship" Arguments="5, 1" /> <Handler Name="ConditionalRule" Arguments="1, 0, 5" /> </Template>
Semantic analysis >Pattern description • Logical operands: «&»(and), «|»(or), «^»(not). • Occurrence:not set (once), «+», «*», «?» • #E.P, #E.C, #E.S, #E.U, #E.Int, #E.DateTime • #M.Noun, #M.Adjective, #M.Verb, … • #W.Month, #W.Number, … - words holder • #H.Class, …- clauses holder [#E.P #M.Adjective]+ [#E.C #M.Noun]
Semantic analysis >Pattern description>Words holder <WordHolder Name="Month"> <Item Word=“JANUARY" Value="1" /> <Item Word=“FEBRUARY" Value="2" /> <Item Word=“MARCH" Value="3" /> ... </WordHolder> Clauses holder <ClauseHolder Name="Class"> <Item Pattern="[#E.P #M.Adjective]* #E.C" Index="1" /> <Item Pattern="[#E.P #M.Adjective] , [#E.P #M.Adjective] #E.C" Index="2" /> </ClauseHolder>
Semantic analysis >Handlers • Replace • Create datetime entity • Create «property-class» relationship • Create «subclass» relationship • Create «subproperty» relationship • Create «conditional rule» relationship • Create «class-class» relationship
Semantic analysis >Creating relationships Property(«useful») Class(«document») + <Template Priority=“4" Pattern="[#E.P #M.Adjective]+ [#E.C #M.Noun]"> <Handler Name=“PropertyRelationship" Arguments="0, 1" /> </Template> = «property-class» relationship Property(«useful») Class(«document»)
Semantic analysis >Creatingnew entities Integer(«7») Class(«December») Integer(«2006») Class(«Year») + <Template Priority="11000" Pattern="#E.INT #W.Month #E.INT year"> <Handler Name="Replace" From="0" Count="4" > <CreateEntityHandler Name="CreateDateTime« Arguments="day=0, month=1, year=2" /> </Handler> </Template> = Datetime (7.12.2006)
Optimizing resulting graph Class(«vehicle») • Removing redundant «subclass» relationships • Removing redundant relationships between properties and classes Subclass Subclass Property-class Class(«transport») Property(«fast») subclass Property-class Class(«bus»)
Saving results • Saving acquired knowledge into Knowledge.Net format • Into OWL • Saving (and loading) knowledge from own binary format files
Current project status • Developed working prototype • Created test temples • Attached «Mrd» dictionary (Russian and English)
Plans • Supportcreating «compound» entities (compound from several words: «creation of human hands») • Functionalityextension (adding new entities, relationships, templates, handlers, …) • Program for generating templates • Developing good examples
? Contact information: smlkvalex@mail.ru http://www.knowledge-net.ru http://polyhimnie.math.spbu.ru