220 likes | 367 Views
COMP3410 DB32: Technologies for Knowledge Management. 10 : Introduction to Knowledge Discovery By Eric Atwell, School of Computing, University of Leeds (including re-use of teaching resources from other sources, esp. Knowledge Management by Stuart Roberts,
E N D
COMP3410 DB32:Technologies for Knowledge Management 10: Introduction to Knowledge Discovery By Eric Atwell, School of Computing, University of Leeds (including re-use of teaching resources from other sources, esp. Knowledge Management by Stuart Roberts, School of Computing, University of Leeds)
“Most international organizations produce more information in a week than many people could read in a lifetime”Adriaans and Zantinge What has Machine Learning got to do with Computing / Information Systems?
Data mining is about discovering patterns in data. For this we need: KD/DM techniques, algorithms, tools, eg BootCat, WEKA A methodological framework to guide us, in collecting data and applying the best algorithms: CRISP-DM Objectives of knowledge discovery or data mining
Data Mining was originally about “learning” patterns from DataBases, data structured as Records, Fields Knowledge Discovery is “exotic term” for DM??? Increasingly, data is unstructured text (WWW), so Text Mining is a new subfield of DM, focussing on Knowledge Discovery from unstructured text data Data Mining, Knowledge Discovery, Text Mining
Data mining, also known as knowledge-discovery in databases (KDD), is the practice of automatically searching large stores of data for patterns. To do this, data mining uses computational techniques from statistics and pattern recognition. en.wikipedia.org/wiki/Data_mining define: data mining
Text mining, also known as intelligent text analysis, text data mining or knowledge-discovery in text (KDT), refers generally to the process of extracting interesting and non-trivial information and knowledge from unstructured text. Text mining is a young interdisciplinary field which draws on information retrieval, data mining, machine learning, statistics and computational linguistics. ...en.wikipedia.org/wiki/Text_mining define: text mining
Knowledge discovery is the process of finding novel, interesting, and useful patterns in data. Data mining is a subset of knowledge discovery. It lets the data suggest new hypotheses to test.www.purpleinsight.com/downloads/docs/visualizer_tutorial/glossary/go01.html Data mining, also known as knowledge-discovery in databases (KDD), is the practice of automatically searching large stores of data for patterns. To do this, data mining uses computational techniques from statistics and pattern recognition. en.wikipedia.org/wiki/Knowledge_discovery define: knowledge discovery
Data Mining: Overview Concepts, Instances or examples, Attributes Data Mining Concept Descriptions Each instance is an example of the concept to be learned or described. The instance may be described by the values of its attributes.
Input to a data mining algorithm is in the form of a set of examples, or instances. Each instance is represented as a set of features or attributes. Usually in DB Data-Mining this set takes the form of a flat file; each instance is a record in the file, each attribute is a field in the record. In text-mining, instance is word/term in a corpus. The concepts to be learned are formed from patterns discovered within the set of instances. Instances
The types of concepts we try to ‘learn’ include: Key “differences” – terms specific to our domain corpus Clusters or ‘Natural’ partitions; Eg we might cluster customers according to their shopping habits. Rules for classifying examples into pre-defined classes. Eg “Mature students studying information systems with high grade for General Studies A level are likely to get a 1st class degree” General Associations Eg “People who buy nappies are in general likely also to buy beer” concepts
The types of concepts we try to ‘learn’ include: Numerical prediction Eg look for rules to predict what salary a graduate will get, given A level results, age, gender, programme of study and degree result – this may give us an equation: Salary = a*A-level + b*Age + c*Gender + d*Prog + e*Degree (but are Gender, Programme really numbers???) More concepts
@relation weather @attribute outlook {sunny,overcast,rainy} @attribute temperature real @attribute humidity real @attribute windy {TRUE, FALSE} @attribute play {yes, no} @data sunny,85,85,FALSE,no sunny,80,90,TRUE,no overcast,83,86,FALSE,yes rainy,70,96,FALSE,yes rainy,68,80,FALSE,yes rainy,65,70,TRUE,no overcast,64,65,TRUE,yes sunny,72,95,FALSE,no sunny,69,70,FALSE,yes rainy,75,80,FALSE,yes /usr/local/weka-3-4-5/data/weather.arff
“First catch your rabbit” (Mrs Beaton’s cookbook): Other tools are possible, but WWW-BootCat *should* be easier to use … First: sign up for Domain, SketchEngine account, Google key; download seeds-en from http://corpus.leeds.ac.uk/internet.html (see coursework spec for URLs) Text mining example: discovering terms in a domain, using WWW-BootCat
Advanced Search option with parameter settings: using SergeSharoff's seed-en http://corpus.leeds.ac.uk/internet/seeds-en list of typical medium-frequency English words as seed-words, Google key set to the Key which I set up beforehand at https://www.google.com/accounts/NewAccount Language set to English Select URLs ticked, so I can cut-and-paste the list of urls to a textfile (TO HAND IN WITH CW) Corpus name set to EnglishUK (in my case), or English?? (change ?? To your Domain) email address set to USERNAME@comp.leeds.ac.uk Query Extension set to site:.uk (in my case), or site:.?? (change ?? To your Domain) other Advanced Options left at default values...??? ... then click on Build a corpus!, follow instructions as they appear, and (after some wait) download the corpus in raw and vertical formats (either direct from URL or wait for email to tell you URL…) First collect your corpus
WWW-Bootcat: log in, Advanced options: upload seed-en, check URLs, site:.??; Build Corpus If it crashes, ?bad HTML in website?, try again Download your corpus, because… 500,000-word quota – room for 2 corpuses (only), so you can only compare 2 at a time in WWWBootCat Or compare on your linux account… /home/www/db32/cw/EnglishUS , EnglishUK Problems?
Aim: to find terms in C1 not in C2? and terms in C2 not in C1? Sort C1, C2 in Vertical format (1 word per line) to give C1termlist, C2termlist: sort C1 > C1termlist; sort C2> C2termlist diff C1termlist C2termlist BUT this shows LOTS of differences many “not significant”: 1 example (hapax legomena) Comparing text corpora
Better: to find “significant” terms in C1 not in C2 sort C1 | uniq -c | sort -n -r > C1termlist Terms with frequencies – most common first Can be compared “OLAP-style” – you can spot high-freq words in one list but not the other ? No need for further processing? Comparing “significant” terms
BootCat (and others, eg Paul Rayson) offer tools to compare frequencies of words – to find words used MUCH MORE in one corpus than another Several different metrics available, eg “mutual information”, “normalised frequency difference”,… Not necessary for DB32 coursework (probably) … BUT I will be impressed if you do use these advanced metrics! Comparing word-frequencies
Knowledge Discovery (Data Mining) tools semi-automate the process of discovering patterns in data. Tools differ in terms of what concepts they discover (differences, key-terms, clusters, decision-trees, rules)… … and in terms of the output they provide (eg clustering algorithms provide a set of subclasses) Selecting the right tools for the job is based on business objectives: what is the USE for the knowledge discovered Knowledge Discovery: Key points
You should be able to: Decide which is the appropriate data mining technique for a given a problem defined in terms of business objectives. Decide which is the most appropriate form of output. Self-test