Text Mining Overview

1. Text Mining Overview Piotr Gawrysiak gawrysia@ii.pw.edu.pl Warsaw University of Technology Data Mining Group

2. Topics Natural Language Processing Text Mining vs. Data Mining The toolbox Language processing methods Single document processing Document corpora processing Document categorization � a closer look Applications Classic Profiled document delivery Related areas Web Content Mining & Web Farming

3. Natural Language Processing Natural language � test for Artificial Intelligence Alan Turing NLP and NLU

4. Information explosion

5. Data Mining This information explosion is not only problem with text, but also with all other kinds of data. Here data mining comes to the rescue.This information explosion is not only problem with text, but also with all other kinds of data. Here data mining comes to the rescue.

6. Knowledge pyramid

7. Text Mining � a definition TM can be described as statistical method � because KDD is mostly based on statisticsTM can be described as statistical method � because KDD is mostly based on statistics

8. Text Mining tools Linguistic analysis Thesauri, dictionaries, grammar analysers etc. Machine translation Automatic feature extraction Automatic summarization Document categorization Document clustering Information retrieval Visualization methods

9. Language analysis

10. Thesaurus construction

11. Machine translation

12. Fully automatic approach

13. Feature extraction

14. Document summarization New area � multimedia document summarization New area � multimedia document summarization

15. Document categorization & clustering

16. Categorization/clustering system

17. Information retrieval

18. IR � exact match

19. IR � fuzzy search

20. Document visualization

21. Document visualization

22. Document categorization A closer look

23. Measuring quality

24. Metrics Wartosc wskaznika dokladnosci okresla prawdopodobienstwo dokonania poprawnej klasyfikacji, dla losowo wybranego dokumentu ze zbioru D. Wartosc wskaznika precyzji okresla prawdopodobienstwo, iz losowy dokument wybrany z dokument�w uznanych za relewantne, jest rzeczywiscie dokumentem relewantnym. Zupelnosc odpowiada prawdopodobienstwu tego, iz dokument faktycznie relewantny, zostanie za taki uznany przez system. Zaszumienie okresla z kolei prawdopodobienstwo niepoprawnego uznania za relewantny dokumentu, kt�ry faktycznie relewantny nie jest. Wartosc wskaznika dokladnosci okresla prawdopodobienstwo dokonania poprawnej klasyfikacji, dla losowo wybranego dokumentu ze zbioru D. Wartosc wskaznika precyzji okresla prawdopodobienstwo, iz losowy dokument wybrany z dokument�w uznanych za relewantne, jest rzeczywiscie dokumentem relewantnym. Zupelnosc odpowiada prawdopodobienstwu tego, iz dokument faktycznie relewantny, zostanie za taki uznany przez system. Zaszumienie okresla z kolei prawdopodobienstwo niepoprawnego uznania za relewantny dokumentu, kt�ry faktycznie relewantny nie jest.

25. Multiple class scenario

26. Categorization example

27. Document representations

28. Bigram example

29. Probabilistic interpretation

30. Positional representation

31. Creating positional representation

32. Examples

33. Processing representations

34. Expanding and trimming

35. Representation processing

36. Attribute selection

37. Attribute space remapping

38. Applications

39. Thank you Plato has written in Fajdros that the art of writing may be lethal to our knowledge and wisdom, as human beings will no longer rely on their memory and therefore will recall everything from potentially misleading external sources.Plato has written in Fajdros that the art of writing may be lethal to our knowledge and wisdom, as human beings will no longer rely on their memory and therefore will recall everything from potentially misleading external sources.

Text Mining Overview

Text Mining Overview

Presentation Transcript

Text Mining Tools

Text Mining Concepts

Text Mining

Text mining- text analytics- data mining

Text Mining Overview

SQL Text Mining

Text Mining

Overview of Text Data Mining

Overview of Text Mining Expertise @ SCD

Biomedical text mining

Text Mining

Text Mining

Text Mining

Text Mining

Comparative Text Mining

Text Mining