260 likes | 423 Views
Methods for the Automatic Construction of Topic Maps. Eric Freese, Senior Consultant ISOGEN International. Agenda. Automatic Construction from Structured Documents Automatic Construction from Unstructured Documents. Contextual Harvesting.
E N D
Methods for the Automatic Construction of Topic Maps Eric Freese, Senior Consultant ISOGEN International
Agenda • Automatic Construction from Structured Documents • Automatic Construction from Unstructured Documents
Contextual Harvesting • Markup can provide clues about the information within a document • Largely dependent on semantic markup • Takes advantage of nesting within elements • Rules can be developed for harvesting data to build topic map constructs • Rules could then be applied to similar types of documents
DTD/Schema Development to Support Harvesting • DTDs like HTML are mostly useless to a harvesting system • Flat structures make associations between elements more difficult • New DTD/schema development should take possible knowledge harvesting into account
Content-Based Harvesting • Combination of contextual and natural language harvesting • Text is parsed and clues within the text are used to harvest knowledge. • HTML documents where labels are included in the text could be processed this way
NLP Strategies • Named Entity recognition • A list of entities (people, companies, places, etc.) is defined • Programs parse a corpus of information to identify entities • Limited to the completeness of the entity list
NLP Strategies – cont. • Concept extraction • A list of key words can be defined much like the named entity strategy • Common strings may also be identified and suggested as new concepts • More processing intensive than named entity
NLP Strategies – cont. • Taxonomic classification • Documents are analysed and classified according to a human-defined taxonomy • Specialized programs must be developed that are able to understand the taxonomy • Must also be able to process synonyms and related concepts
NLP Strategies – cont. • Discourse analysis • Programs are developed that attempt to understand the meaning of a text • Analyze the parts of speech using a lexicons and rules to attempt to derive the meanings and usages of words
Steps in Processing Natural Language • Tokenization • Part of Speech Tagging • Bracketing • Identification of useful structures
Tokenization • GOAL: Prepare text for processing by a natural language processing system • Flowing paragraph text is broken into sentence units • Sentence units are broken into word tokens • Word tokens are prepared for part of speech processing • Contractions and other constructs receive special processing • n’t, ’s
Part of Speech (POS) Tagging • GOAL: Identify the part(s) of speech for each word token • Lexicon – list of words with the possible POS tags for each word • EXAMPLE: • sound - NNP JJ NN VB • The Sound of Music, Puget Sound • a sound decision • the sound of silence • sound the alarm
POS Tagging – cont. • POS tagging is difficult • Exception processing often required for phrases and grouped words • EXAMPLE: • Time flies like an arrow. • Fruit flies like an apple.
Bracketing • GOAL: Identify groupings of words into phrases and the hierarchical relationship of phrases to one another • A set of rules is used to identify how different parts of speech and phrases can be combined to form larger phrases. • EXAMPLE: • [NP, DT, JJ, NN] • Noun phrase can consist of a determiner followed by adjective followed by a noun
Benefits of the phased approach • The separation of functions allows this approach to be applied to any language. • A lexicon is developed for the language • Rules for language construction are defined • Generic engine is able to process the data • The separation of the lexicon and the rules base allows the model to be modified/improved as the corpus of text grows.
Putting it all together • EXAMPLE: The red ball rolled down the hill. • Tokenization and POS tagging • The DT • red JJ NNP • ball NN • rolled VBD VBN • down RB IN RBR VBP JJ NN RP • hill NN
Putting it all together – cont. • EXAMPLE: The red ball rolled down the hill. • Bracketing rules • [S, NP, VP] • [NP, DT, JJ, NN] • [VP, VBD, PP] • [PP, IN, NP] • [NP, DT, NN] • RESULT (using XML for bracketing): <S><NP><DT>The</DT><JJ>red</JJ><NN>ball</NN> </NP><VP><VBD>rolled</VBD><PP><IN>down</IN> <NP><DT>the</DT><NN>hill</NN></NP></VP></S>
Harvesting Considerations • GIGO rule in effect • The harvesting process only hastens topic map construction • Only some of the topic map merging rules are applicable • Limited prospect of meaningful subject identities • Humans must still participate in the process of knowledge organization in order to maintain quality • Selective inclusion in the topic map/knowledge base
Q & A Questions or comments welcome at: ISOGEN International 1611 W. County Road B, Suite 204 St. Paul, MN 55113 USA Voice: 1.651.636.9180 - Fax: 1.651.636.9191 eric@isogen.com www.isogen.com
Demonstrations • SemanText – open source • Ontopia Knowledge Suite – commercial
Harvesting Knowledge using SemanText • Contextual Harvesting • Natural Language-Based Harvesting
Contextual Harvesting in SemanText • XML markup is used as in NLP to identify document structures • Users write rules for harvesting information into topic maps structures
Harvesting XML Content • Data can be harvested from any XML document (including RDF) into a topic map without the need to develop specialized programs • Specification language is based on Xpath • In future, will also record the location from which the data was harvested as an occurrence
Harvesting RDF Content • RDF structures can be harvested in order to build topic maps • Demonstrates the possibility of interoperability between the two models • Rules can be established for flavors of RDF (e.g. Dublin Core, DAML/OIL) • Allows any document using the tagging scheme to be harvested • Only binary associations can be generated
NLP in SemanText • The initial lexicon included within SemanText is based on a lexicon derived from the Penn Treebank tagging of the Brown corpus (1 million+ words) and a very large sample from the Wall Street Journal (approx. 3 million words) • SemanText provides the ability to identify and add new words to the lexicon through a GUI.
NLP in SemanText – cont. • SemanText uses a public domain parser to tokenize flowing text and identify the appropriate POS for each word token based on a set of bracketing rules. • XML is used to denote the bracketing • This XML markup allows SemanText to process natural language using its contextual-based harvesting capability