Textual Information Clustering and Visualization for Knowledge Discovery and Management

Textual Information Clusteringand Visualization for Knowledge Discovery and Management Xavier Polanco URI-INIST-CNRS

Introduction • We are concerned with the design and development of computer-based information analysis tools • Cluster analysis, computational linguistics and artificial intelligence techniques are combined

On the technology side • An information analysis computer-based system is • an integrated environment that somehow assisted a user • in carrying out the complex process of converting information from the textual data sources to knowledge

Information Analysis System Lexicons or terminological resources French or English text-data Dataset or Corpus Clustering and Mapping DBMS-R Term Extraction And Indexation Bibliometric statistics WWW Server SDOC HENOCH NEURODOC MIRIAD ILC Mac PC WS

Home Pages Intranet Extranet

Plan • Text Mining • Cluster Analysis • Visualization or Mapping • Knowledge Discovery • Knowledge Management

Textual Information • Big amount of information is available in textual form in databases and online sources • In this context, manual analysis and effective extraction of useful information are not possible • It is relevant to provide automatic tools for analyzing large textual collections

Text Mining • Text mining consists of extraction information from hidden patterns in large text-data collections • The results can be important both: • for the analysis of the collection, and • for providing intelligent navigation and browsing methods

Process • The text mining process can be organized roughly into five-major steps: • Data Selection • Term Extraction and Filtering • Data Clustering and Classification • Mapping or Visualization • Result Interpretation • Iterative and interactive process

Natural Language Processing • Experience shows that linguistic engineering approach insures a higher performance of the data mining algorithms • Part-of-speech tagging (tagging texts), and lemmatization are tasks generally admit

The approach • Our approach to text mining is based on extracting meaningful terms from documents • In this presentation, the focus is on the term extraction process, and • The need of the organization of the generated terms in a taxonomy

The main tasks • Term extraction or acquisition • Indexation • Human control and screening • Indexing quality control • Index screening  clustering phase

Language Engineering Natural Language Engineering System Lexicons Text-DB Indexed Corpus Lexicons: Management and Linguistic Processing Texts: Part-of-speech tagging, lemmatization, and indexation

Variation

Taxonomy • A taxonomic structure should improve text mining • Considering the clustering techniques that might be used in text mining. One must be mindful that more taxonomic classifying capabilities would be incorporated into text mining • A taxonomic classifying capability might also facilitate cluster interpretation by giving the user some kind of rules

Clustering • Clustering is a descriptive task where one seeks to identify a finite set of categories • Clustering is used to segment a database into subsets or clusters • Clustering means finding the clusters themselves from a given set of data

Natural Language Engineering System Lexicons Text-DB Indexed Corpus Clustering Process Similarity Measures: s(x,y) Clustering Algorithm D(n,p) C(m,p) Dissimilarity Measures: d(x,y)

Documents  Keywords KW1 KW2 KW3 KW4 KW5 KW6 D1 1 0 1 0 1 1 D2 1 0 1 0 1 1 D3 0 1 0 1 0 0 D4 1 0 0 1 0 1 Di  KWj = {1,0} Di KWj = {1, 2, …, n} C1 = ({D1,D2}{KW1,KW3,KW5,KW6}) C2 = ({D4}{KW1,KW4,KW6}) C3 = ({D3}{KW2,KW4})

Clustering Algorithms • Major families of clustering methods: • Sequential algorithms • Hierarchical algorithms • Agglomerative algorithms • Divisive algorithms • Fuzzy clustering algorithms

Information Analysis Process • The text-data information analysis is divided into two phases: • Cluster generation • Map display of clusters • A hypertext user interface enables the analyst to explore and interpret results

Example Antibiotic Resistance 2 DB 4025 documents (1998-1999) Data 30 Medicine Molecular Biology Clusters Map Hypertext

Information Visualization • Definition : The use of computer-supported, interactive, visual representation of abstract data to amplify the acquisition or use of knowledge(Card et al., 1999) • Visual artifacts aid human thought • The progress of civilization can be read in the invention of visual artifacts, from writing to mathematics, to maps, to diagrams, to visual computing

Process • Raw Data  Data Tables • Data Tables  Clustering • Clustering  Visual Structures : Map • Visual Structures  Views

Visual Structures • Data Tables are mapped to Visual Structures, which augment a spatial substrate with marks and graphical properties to encode information • A Graphic Representation is said to be expressive if all and only the data in the Data Table are also represented in the Visual Structure • A Graphic Representation is said to be more effective if it is faster to interpret

Map Display • We are concerned with map display of the clusters • A problem of particular interest is how to visualize data set with many variables: • Multivariate-Data are clustered, and • Clusters are mapped

Mapping tools • For mapping, we use the following techniques: • Density and Centrality Diagrams • Principal Component Analysis (PCA) • Multi-Layer Perceptrons (MLP) • Self-Organizing Maps (SOM) • Multi-SOMs

Multi-Layer Perceptron 1 • ISE=||s-x||2 prion proteins Wcij Wsjk s1 x1 scrapie sk xi human disease spongiform encephalopathy mankind Wc(p,2) Ws(2,p) xp sp CJD

First Hidden Layer Input Layer Output Layer x 1 y 1 C(m,p) x y p p Second Hidden Layer (Cartography) Polarizer node Multi-Layer Perceptron 2 protein infection resistance Agrobacterium plasmids

Raw Data Processing System Graphic-Hypertext User Interface Pre-processing DB SOMPACK Post-processing MAPS MULTISOM Java Application Multi-SOM Platform

3 Multi-Self-Organizing Map Display Maps associated to 5 viewpoints : Map 1  Plants Map 2  Plant Parts Map 3  Pathogen Agents Map 4  Genetic Techniques Map 5  Patenting Firms 5 4 2 1 Rice Area Activated Use of the inter-Map Communication Mechanism

Knowledge Discovery • KD is informally defined as the extraction of useful knowledge from databases or large amounts of data • One of the most important research topics in KD is the rule discovery or extraction • The discovered knowledge is usually expressed in the form of « if-then » rules

Association Rules • Association rules can be seen as one of the key tasks of KDD • The intuitive meaning of an association rule X  Y, where X and Y are keywords or descriptors, is : “a document set containing keyword X is likely to also contain keyword Y”

Example • In a given a food-industry corpus: • “98% of the documents which are interested on apple juice does it related with the chromatography analytic technique” • X  Y : “apple juice  chromatography”

The Galois Lattice • Our current research includes an approach based on the lattice structure to discover concepts and rules to the objects (documents) and their properties (keywords) • The Galois lattice approach is also known as conceptual clustering

The concept lattice Given the context (D1,T1) where D1 = {d1,d2,d3,d4} & T1 = {t1,t2,t3,t4,t5,t6} Hasse Diagram C1:(D1,Ø) R t1 t2 t3 t4 t5 t6 d1 1 0 1 0 1 1 d2 1 0 1 0 1 1 d3 0 1 0 1 0 0 d4 1 0 0 1 0 1 C2:({d1,d2,d4},{t1,t6} C3:({d3,d4},{t4} C4:({d1,d2},{t1,t3,t5,t6} C5:({d4},{t1,t4,t6} C6:({d3},{t2,t4} Table: The input relation R = documents  keywords C7:(Ø, T1) The formal concept C4 has two own terms {t3,t5} and two inherited terms {t1,t6}

Association Rules Extraction • The formal concept C4 makes it possible the following rules • R1 : t3  t1  t6 • R2 : t5  t1  t6 • R3 : t3  t5 • The interpretation of the R1 and R2: The use of terms t3 or t5 is always associated with that of terms t1 and t6 • The rule R3 express mutual equivalence of the terms {t3,t5: All the documents which have the term t3 also have the t5 term.

Summary Text Mining Clustering Mapping Knowledge Discovery

Knowledge Management • A knowledge management system is concerned with the identification, acquisition, development, diffusion, use, and preservation of the enterprise’s knowledge

KM Objectives • Using advanced technology • For facilitating creation, access, and reuse of knowledge • For converting knowledge from the sources accessible to an organization and connecting people with that knowledge

Project • Adding to the information analysis system a formalized operator for processing together: • The knowledge that is extracted from databases • The knowledge that the experts produce when they analyze the clusters, maps, concepts and rules

We have reached our last subject, but not the end !

Merci Gracias Obrigado Thanks Xavier Polanco

Textual Information Clustering and Visualization for Knowledge Discovery and Management

Textual Information Clustering and Visualization for Knowledge Discovery and Management

Presentation Transcript

Information and Communication Technologies, Knowledge Management and Indigenous Knowledge

Knowledge Representation and Inference Models for Textual Entailment

Human Knowledge Seeking and Information Visualization

Data Point Visualization and Clustering Analysis

Knowledge Representation using Information Visualization

Subspace Clustering Visualization

e -DISCOVERY AND INFORMATION MANAGEMENT

Visualization for Classification and Clustering Techniques

INFORMATION AND KNOWLEDGE MANAGEMENT (SBEM)

Information and Knowledge Management: Importance and proposals for improvement

Information and Knowledge Management Program (IKMP)

Knowledge Modeling and Discovery

Knowledge Management, Social Network Analysis, and Knowledge Discovery for Homeland Security

Knowledge Organization Systems and Information Discovery

Textual Visualization Plug-in for Eclipse

Representations and information visualization

Knowledge, Information and Communication for Integrated Landscape Management

Knowledge and Information Management

GRIDs in Drug Discovery and Knowledge Management

Visualization for Classification and Clustering Techniques