1 / 42

Textual Information Clustering and Visualization for Knowledge Discovery and Management

Textual Information Clustering and Visualization for Knowledge Discovery and Management. Xavier Polanco URI-INIST-CNRS. Introduction. We are concerned with the design and development of computer-based information analysis tools

race
Download Presentation

Textual Information Clustering and Visualization for Knowledge Discovery and Management

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Textual Information Clusteringand Visualization for Knowledge Discovery and Management Xavier Polanco URI-INIST-CNRS

  2. Introduction • We are concerned with the design and development of computer-based information analysis tools • Cluster analysis, computational linguistics and artificial intelligence techniques are combined

  3. On the technology side • An information analysis computer-based system is • an integrated environment that somehow assisted a user • in carrying out the complex process of converting information from the textual data sources to knowledge

  4. Information Analysis System Lexicons or terminological resources French or English text-data Dataset or Corpus Clustering and Mapping DBMS-R Term Extraction And Indexation Bibliometric statistics WWW Server SDOC HENOCH NEURODOC MIRIAD ILC Mac PC WS

  5. Home Pages Intranet Extranet

  6. Plan • Text Mining • Cluster Analysis • Visualization or Mapping • Knowledge Discovery • Knowledge Management

  7. Textual Information • Big amount of information is available in textual form in databases and online sources • In this context, manual analysis and effective extraction of useful information are not possible • It is relevant to provide automatic tools for analyzing large textual collections

  8. Text Mining • Text mining consists of extraction information from hidden patterns in large text-data collections • The results can be important both: • for the analysis of the collection, and • for providing intelligent navigation and browsing methods

  9. Process • The text mining process can be organized roughly into five-major steps: • Data Selection • Term Extraction and Filtering • Data Clustering and Classification • Mapping or Visualization • Result Interpretation • Iterative and interactive process

  10. Natural Language Processing • Experience shows that linguistic engineering approach insures a higher performance of the data mining algorithms • Part-of-speech tagging (tagging texts), and lemmatization are tasks generally admit

  11. The approach • Our approach to text mining is based on extracting meaningful terms from documents • In this presentation, the focus is on the term extraction process, and • The need of the organization of the generated terms in a taxonomy

  12. The main tasks • Term extraction or acquisition • Indexation • Human control and screening • Indexing quality control • Index screening  clustering phase

  13. Language Engineering Natural Language Engineering System Lexicons Text-DB Indexed Corpus Lexicons: Management and Linguistic Processing Texts: Part-of-speech tagging, lemmatization, and indexation

  14. Variation

  15. Taxonomy • A taxonomic structure should improve text mining • Considering the clustering techniques that might be used in text mining. One must be mindful that more taxonomic classifying capabilities would be incorporated into text mining • A taxonomic classifying capability might also facilitate cluster interpretation by giving the user some kind of rules

  16. Clustering • Clustering is a descriptive task where one seeks to identify a finite set of categories • Clustering is used to segment a database into subsets or clusters • Clustering means finding the clusters themselves from a given set of data

  17. Natural Language Engineering System Lexicons Text-DB Indexed Corpus Clustering Process Similarity Measures: s(x,y) Clustering Algorithm D(n,p) C(m,p) Dissimilarity Measures: d(x,y)

  18. Documents  Keywords KW1 KW2 KW3 KW4 KW5 KW6 D1 1 0 1 0 1 1 D2 1 0 1 0 1 1 D3 0 1 0 1 0 0 D4 1 0 0 1 0 1 Di  KWj = {1,0} Di KWj = {1, 2, …, n} C1 = ({D1,D2}{KW1,KW3,KW5,KW6}) C2 = ({D4}{KW1,KW4,KW6}) C3 = ({D3}{KW2,KW4})

  19. Clustering Algorithms • Major families of clustering methods: • Sequential algorithms • Hierarchical algorithms • Agglomerative algorithms • Divisive algorithms • Fuzzy clustering algorithms

  20. Information Analysis Process • The text-data information analysis is divided into two phases: • Cluster generation • Map display of clusters • A hypertext user interface enables the analyst to explore and interpret results

  21. Example Antibiotic Resistance 2 DB 4025 documents (1998-1999) Data 30 Medicine Molecular Biology Clusters Map Hypertext

  22. Information Visualization • Definition : The use of computer-supported, interactive, visual representation of abstract data to amplify the acquisition or use of knowledge(Card et al., 1999) • Visual artifacts aid human thought • The progress of civilization can be read in the invention of visual artifacts, from writing to mathematics, to maps, to diagrams, to visual computing

  23. Process • Raw Data  Data Tables • Data Tables  Clustering • Clustering  Visual Structures : Map • Visual Structures  Views

  24. Visual Structures • Data Tables are mapped to Visual Structures, which augment a spatial substrate with marks and graphical properties to encode information • A Graphic Representation is said to be expressive if all and only the data in the Data Table are also represented in the Visual Structure • A Graphic Representation is said to be more effective if it is faster to interpret

  25. Map Display • We are concerned with map display of the clusters • A problem of particular interest is how to visualize data set with many variables: • Multivariate-Data are clustered, and • Clusters are mapped

  26. Mapping tools • For mapping, we use the following techniques: • Density and Centrality Diagrams • Principal Component Analysis (PCA) • Multi-Layer Perceptrons (MLP) • Self-Organizing Maps (SOM) • Multi-SOMs

  27. Multi-Layer Perceptron 1 • ISE=||s-x||2 prion proteins Wcij Wsjk s1 x1 scrapie sk xi human disease spongiform encephalopathy mankind Wc(p,2) Ws(2,p) xp sp CJD

  28. First Hidden Layer Input Layer Output Layer x 1 y 1 C(m,p) x y p p Second Hidden Layer (Cartography) Polarizer node Multi-Layer Perceptron 2 protein infection resistance Agrobacterium plasmids

  29. Raw Data Processing System Graphic-Hypertext User Interface Pre-processing DB SOMPACK Post-processing MAPS MULTISOM Java Application Multi-SOM Platform

  30. 3 Multi-Self-Organizing Map Display Maps associated to 5 viewpoints : Map 1  Plants Map 2  Plant Parts Map 3  Pathogen Agents Map 4  Genetic Techniques Map 5  Patenting Firms 5 4 2 1 Rice Area Activated Use of the inter-Map Communication Mechanism

  31. Knowledge Discovery • KD is informally defined as the extraction of useful knowledge from databases or large amounts of data • One of the most important research topics in KD is the rule discovery or extraction • The discovered knowledge is usually expressed in the form of « if-then » rules

  32. Association Rules • Association rules can be seen as one of the key tasks of KDD • The intuitive meaning of an association rule X  Y, where X and Y are keywords or descriptors, is : “a document set containing keyword X is likely to also contain keyword Y”

  33. Example • In a given a food-industry corpus: • “98% of the documents which are interested on apple juice does it related with the chromatography analytic technique” • X  Y : “apple juice  chromatography”

  34. The Galois Lattice • Our current research includes an approach based on the lattice structure to discover concepts and rules to the objects (documents) and their properties (keywords) • The Galois lattice approach is also known as conceptual clustering

  35. The concept lattice Given the context (D1,T1) where D1 = {d1,d2,d3,d4} & T1 = {t1,t2,t3,t4,t5,t6} Hasse Diagram C1:(D1,Ø) R t1 t2 t3 t4 t5 t6 d1 1 0 1 0 1 1 d2 1 0 1 0 1 1 d3 0 1 0 1 0 0 d4 1 0 0 1 0 1 C2:({d1,d2,d4},{t1,t6} C3:({d3,d4},{t4} C4:({d1,d2},{t1,t3,t5,t6} C5:({d4},{t1,t4,t6} C6:({d3},{t2,t4} Table: The input relation R = documents  keywords C7:(Ø, T1) The formal concept C4 has two own terms {t3,t5} and two inherited terms {t1,t6}

  36. Association Rules Extraction • The formal concept C4 makes it possible the following rules • R1 : t3  t1  t6 • R2 : t5  t1  t6 • R3 : t3  t5 • The interpretation of the R1 and R2: The use of terms t3 or t5 is always associated with that of terms t1 and t6 • The rule R3 express mutual equivalence of the terms {t3,t5: All the documents which have the term t3 also have the t5 term.

  37. Summary Text Mining Clustering Mapping Knowledge Discovery

  38. Knowledge Management • A knowledge management system is concerned with the identification, acquisition, development, diffusion, use, and preservation of the enterprise’s knowledge

  39. KM Objectives • Using advanced technology • For facilitating creation, access, and reuse of knowledge • For converting knowledge from the sources accessible to an organization and connecting people with that knowledge

  40. Project • Adding to the information analysis system a formalized operator for processing together: • The knowledge that is extracted from databases • The knowledge that the experts produce when they analyze the clusters, maps, concepts and rules

  41. We have reached our last subject, but not the end !

  42. Merci Gracias Obrigado Thanks Xavier Polanco

More Related