420 likes | 606 Views
Textual Information Clustering and Visualization for Knowledge Discovery and Management. Xavier Polanco URI-INIST-CNRS. Introduction. We are concerned with the design and development of computer-based information analysis tools
E N D
Textual Information Clusteringand Visualization for Knowledge Discovery and Management Xavier Polanco URI-INIST-CNRS
Introduction • We are concerned with the design and development of computer-based information analysis tools • Cluster analysis, computational linguistics and artificial intelligence techniques are combined
On the technology side • An information analysis computer-based system is • an integrated environment that somehow assisted a user • in carrying out the complex process of converting information from the textual data sources to knowledge
Information Analysis System Lexicons or terminological resources French or English text-data Dataset or Corpus Clustering and Mapping DBMS-R Term Extraction And Indexation Bibliometric statistics WWW Server SDOC HENOCH NEURODOC MIRIAD ILC Mac PC WS
Home Pages Intranet Extranet
Plan • Text Mining • Cluster Analysis • Visualization or Mapping • Knowledge Discovery • Knowledge Management
Textual Information • Big amount of information is available in textual form in databases and online sources • In this context, manual analysis and effective extraction of useful information are not possible • It is relevant to provide automatic tools for analyzing large textual collections
Text Mining • Text mining consists of extraction information from hidden patterns in large text-data collections • The results can be important both: • for the analysis of the collection, and • for providing intelligent navigation and browsing methods
Process • The text mining process can be organized roughly into five-major steps: • Data Selection • Term Extraction and Filtering • Data Clustering and Classification • Mapping or Visualization • Result Interpretation • Iterative and interactive process
Natural Language Processing • Experience shows that linguistic engineering approach insures a higher performance of the data mining algorithms • Part-of-speech tagging (tagging texts), and lemmatization are tasks generally admit
The approach • Our approach to text mining is based on extracting meaningful terms from documents • In this presentation, the focus is on the term extraction process, and • The need of the organization of the generated terms in a taxonomy
The main tasks • Term extraction or acquisition • Indexation • Human control and screening • Indexing quality control • Index screening clustering phase
Language Engineering Natural Language Engineering System Lexicons Text-DB Indexed Corpus Lexicons: Management and Linguistic Processing Texts: Part-of-speech tagging, lemmatization, and indexation
Taxonomy • A taxonomic structure should improve text mining • Considering the clustering techniques that might be used in text mining. One must be mindful that more taxonomic classifying capabilities would be incorporated into text mining • A taxonomic classifying capability might also facilitate cluster interpretation by giving the user some kind of rules
Clustering • Clustering is a descriptive task where one seeks to identify a finite set of categories • Clustering is used to segment a database into subsets or clusters • Clustering means finding the clusters themselves from a given set of data
Natural Language Engineering System Lexicons Text-DB Indexed Corpus Clustering Process Similarity Measures: s(x,y) Clustering Algorithm D(n,p) C(m,p) Dissimilarity Measures: d(x,y)
Documents Keywords KW1 KW2 KW3 KW4 KW5 KW6 D1 1 0 1 0 1 1 D2 1 0 1 0 1 1 D3 0 1 0 1 0 0 D4 1 0 0 1 0 1 Di KWj = {1,0} Di KWj = {1, 2, …, n} C1 = ({D1,D2}{KW1,KW3,KW5,KW6}) C2 = ({D4}{KW1,KW4,KW6}) C3 = ({D3}{KW2,KW4})
Clustering Algorithms • Major families of clustering methods: • Sequential algorithms • Hierarchical algorithms • Agglomerative algorithms • Divisive algorithms • Fuzzy clustering algorithms
Information Analysis Process • The text-data information analysis is divided into two phases: • Cluster generation • Map display of clusters • A hypertext user interface enables the analyst to explore and interpret results
Example Antibiotic Resistance 2 DB 4025 documents (1998-1999) Data 30 Medicine Molecular Biology Clusters Map Hypertext
Information Visualization • Definition : The use of computer-supported, interactive, visual representation of abstract data to amplify the acquisition or use of knowledge(Card et al., 1999) • Visual artifacts aid human thought • The progress of civilization can be read in the invention of visual artifacts, from writing to mathematics, to maps, to diagrams, to visual computing
Process • Raw Data Data Tables • Data Tables Clustering • Clustering Visual Structures : Map • Visual Structures Views
Visual Structures • Data Tables are mapped to Visual Structures, which augment a spatial substrate with marks and graphical properties to encode information • A Graphic Representation is said to be expressive if all and only the data in the Data Table are also represented in the Visual Structure • A Graphic Representation is said to be more effective if it is faster to interpret
Map Display • We are concerned with map display of the clusters • A problem of particular interest is how to visualize data set with many variables: • Multivariate-Data are clustered, and • Clusters are mapped
Mapping tools • For mapping, we use the following techniques: • Density and Centrality Diagrams • Principal Component Analysis (PCA) • Multi-Layer Perceptrons (MLP) • Self-Organizing Maps (SOM) • Multi-SOMs
Multi-Layer Perceptron 1 • ISE=||s-x||2 prion proteins Wcij Wsjk s1 x1 scrapie sk xi human disease spongiform encephalopathy mankind Wc(p,2) Ws(2,p) xp sp CJD
First Hidden Layer Input Layer Output Layer x 1 y 1 C(m,p) x y p p Second Hidden Layer (Cartography) Polarizer node Multi-Layer Perceptron 2 protein infection resistance Agrobacterium plasmids
Raw Data Processing System Graphic-Hypertext User Interface Pre-processing DB SOMPACK Post-processing MAPS MULTISOM Java Application Multi-SOM Platform
3 Multi-Self-Organizing Map Display Maps associated to 5 viewpoints : Map 1 Plants Map 2 Plant Parts Map 3 Pathogen Agents Map 4 Genetic Techniques Map 5 Patenting Firms 5 4 2 1 Rice Area Activated Use of the inter-Map Communication Mechanism
Knowledge Discovery • KD is informally defined as the extraction of useful knowledge from databases or large amounts of data • One of the most important research topics in KD is the rule discovery or extraction • The discovered knowledge is usually expressed in the form of « if-then » rules
Association Rules • Association rules can be seen as one of the key tasks of KDD • The intuitive meaning of an association rule X Y, where X and Y are keywords or descriptors, is : “a document set containing keyword X is likely to also contain keyword Y”
Example • In a given a food-industry corpus: • “98% of the documents which are interested on apple juice does it related with the chromatography analytic technique” • X Y : “apple juice chromatography”
The Galois Lattice • Our current research includes an approach based on the lattice structure to discover concepts and rules to the objects (documents) and their properties (keywords) • The Galois lattice approach is also known as conceptual clustering
The concept lattice Given the context (D1,T1) where D1 = {d1,d2,d3,d4} & T1 = {t1,t2,t3,t4,t5,t6} Hasse Diagram C1:(D1,Ø) R t1 t2 t3 t4 t5 t6 d1 1 0 1 0 1 1 d2 1 0 1 0 1 1 d3 0 1 0 1 0 0 d4 1 0 0 1 0 1 C2:({d1,d2,d4},{t1,t6} C3:({d3,d4},{t4} C4:({d1,d2},{t1,t3,t5,t6} C5:({d4},{t1,t4,t6} C6:({d3},{t2,t4} Table: The input relation R = documents keywords C7:(Ø, T1) The formal concept C4 has two own terms {t3,t5} and two inherited terms {t1,t6}
Association Rules Extraction • The formal concept C4 makes it possible the following rules • R1 : t3 t1 t6 • R2 : t5 t1 t6 • R3 : t3 t5 • The interpretation of the R1 and R2: The use of terms t3 or t5 is always associated with that of terms t1 and t6 • The rule R3 express mutual equivalence of the terms {t3,t5: All the documents which have the term t3 also have the t5 term.
Summary Text Mining Clustering Mapping Knowledge Discovery
Knowledge Management • A knowledge management system is concerned with the identification, acquisition, development, diffusion, use, and preservation of the enterprise’s knowledge
KM Objectives • Using advanced technology • For facilitating creation, access, and reuse of knowledge • For converting knowledge from the sources accessible to an organization and connecting people with that knowledge
Project • Adding to the information analysis system a formalized operator for processing together: • The knowledge that is extracted from databases • The knowledge that the experts produce when they analyze the clusters, maps, concepts and rules
We have reached our last subject, but not the end !
Merci Gracias Obrigado Thanks Xavier Polanco