E N D
1. Text Mining Overview Piotr Gawrysiak
gawrysia@ii.pw.edu.pl
Warsaw University of Technology
Data Mining Group
2. Topics Natural Language Processing
Text Mining vs. Data Mining
The toolbox
Language processing methods
Single document processing
Document corpora processing
Document categorization – a closer look
Applications
Classic
Profiled document delivery
Related areas
Web Content Mining & Web Farming
3. Natural Language Processing Natural language – test for Artificial Intelligence
Alan Turing
NLP and NLU
4. Information explosion
5. Data Mining This information explosion is not only problem with text, but also with all other kinds of data. Here data mining comes to the rescue.This information explosion is not only problem with text, but also with all other kinds of data. Here data mining comes to the rescue.
6. Knowledge pyramid
7. Text Mining – a definition TM can be described as statistical method – because KDD is mostly based on statisticsTM can be described as statistical method – because KDD is mostly based on statistics
8. Text Mining tools Linguistic analysis
Thesauri, dictionaries, grammar analysers etc.
Machine translation
Automatic feature extraction
Automatic summarization
Document categorization
Document clustering
Information retrieval
Visualization methods
9. Language analysis
10. Thesaurus construction
11. Machine translation
12. Fully automatic approach
13. Feature extraction
14. Document summarization New area – multimedia document summarization
New area – multimedia document summarization
15. Document categorization & clustering
16. Categorization/clustering system
17. Information retrieval
18. IR – exact match
19. IR – fuzzy search
20. Document visualization
21. Document visualization
22. Document categorization A closer look
23. Measuring quality
24. Metrics Wartosc wskaznika dokladnosci okresla prawdopodobienstwo dokonania poprawnej klasyfikacji, dla losowo wybranego dokumentu ze zbioru D. Wartosc wskaznika precyzji okresla prawdopodobienstwo, iz losowy dokument wybrany z dokumentów uznanych za relewantne, jest rzeczywiscie dokumentem relewantnym. Zupelnosc odpowiada prawdopodobienstwu tego, iz dokument faktycznie relewantny, zostanie za taki uznany przez system. Zaszumienie okresla z kolei prawdopodobienstwo niepoprawnego uznania za relewantny dokumentu, który faktycznie
relewantny nie jest.
Wartosc wskaznika dokladnosci okresla prawdopodobienstwo dokonania poprawnej klasyfikacji, dla losowo wybranego dokumentu ze zbioru D. Wartosc wskaznika precyzji okresla prawdopodobienstwo, iz losowy dokument wybrany z dokumentów uznanych za relewantne, jest rzeczywiscie dokumentem relewantnym. Zupelnosc odpowiada prawdopodobienstwu tego, iz dokument faktycznie relewantny, zostanie za taki uznany przez system. Zaszumienie okresla z kolei prawdopodobienstwo niepoprawnego uznania za relewantny dokumentu, który faktycznie
relewantny nie jest.
25. Multiple class scenario
26. Categorization example
27. Document representations
28. Bigram example
29. Probabilistic interpretation
30. Positional representation
31. Creating positional representation
32. Examples
33. Processing representations
34. Expanding and trimming
35. Representation processing
36. Attribute selection
37. Attribute space remapping
38. Applications
39. Thank you Plato has written in Fajdros that the art of writing may be lethal to our knowledge and wisdom, as human beings will no longer rely on their memory and therefore will recall everything from potentially misleading external sources.Plato has written in Fajdros that the art of writing may be lethal to our knowledge and wisdom, as human beings will no longer rely on their memory and therefore will recall everything from potentially misleading external sources.