COMP3410 DB32: Technologies for Knowledge Management

COMP3410 DB32:Technologies for Knowledge Management 10: Introduction to Knowledge Discovery By Eric Atwell, School of Computing, University of Leeds (including re-use of teaching resources from other sources, esp. Knowledge Management by Stuart Roberts, School of Computing, University of Leeds)

“Most international organizations produce more information in a week than many people could read in a lifetime”Adriaans and Zantinge What has Machine Learning got to do with Computing / Information Systems?

Data mining is about discovering patterns in data. For this we need: KD/DM techniques, algorithms, tools, eg BootCat, WEKA A methodological framework to guide us, in collecting data and applying the best algorithms: CRISP-DM Objectives of knowledge discovery or data mining

Data Mining was originally about “learning” patterns from DataBases, data structured as Records, Fields Knowledge Discovery is “exotic term” for DM??? Increasingly, data is unstructured text (WWW), so Text Mining is a new subfield of DM, focussing on Knowledge Discovery from unstructured text data Data Mining, Knowledge Discovery, Text Mining

Data mining, also known as knowledge-discovery in databases (KDD), is the practice of automatically searching large stores of data for patterns. To do this, data mining uses computational techniques from statistics and pattern recognition. en.wikipedia.org/wiki/Data_mining define: data mining

Text mining, also known as intelligent text analysis, text data mining or knowledge-discovery in text (KDT), refers generally to the process of extracting interesting and non-trivial information and knowledge from unstructured text. Text mining is a young interdisciplinary field which draws on information retrieval, data mining, machine learning, statistics and computational linguistics. ...en.wikipedia.org/wiki/Text_mining define: text mining

Knowledge discovery is the process of finding novel, interesting, and useful patterns in data. Data mining is a subset of knowledge discovery. It lets the data suggest new hypotheses to test.www.purpleinsight.com/downloads/docs/visualizer_tutorial/glossary/go01.html Data mining, also known as knowledge-discovery in databases (KDD), is the practice of automatically searching large stores of data for patterns. To do this, data mining uses computational techniques from statistics and pattern recognition. en.wikipedia.org/wiki/Knowledge_discovery define: knowledge discovery

Data Mining: Overview Concepts, Instances or examples, Attributes Data Mining Concept Descriptions Each instance is an example of the concept to be learned or described. The instance may be described by the values of its attributes.

Input to a data mining algorithm is in the form of a set of examples, or instances. Each instance is represented as a set of features or attributes. Usually in DB Data-Mining this set takes the form of a flat file; each instance is a record in the file, each attribute is a field in the record. In text-mining, instance is word/term in a corpus. The concepts to be learned are formed from patterns discovered within the set of instances. Instances

The types of concepts we try to ‘learn’ include: Key “differences” – terms specific to our domain corpus Clusters or ‘Natural’ partitions; Eg we might cluster customers according to their shopping habits. Rules for classifying examples into pre-defined classes. Eg “Mature students studying information systems with high grade for General Studies A level are likely to get a 1st class degree” General Associations Eg “People who buy nappies are in general likely also to buy beer” concepts

The types of concepts we try to ‘learn’ include: Numerical prediction Eg look for rules to predict what salary a graduate will get, given A level results, age, gender, programme of study and degree result – this may give us an equation: Salary = a*A-level + b*Age + c*Gender + d*Prog + e*Degree (but are Gender, Programme really numbers???) More concepts

DB Example: weather to play?

@relation weather @attribute outlook {sunny,overcast,rainy} @attribute temperature real @attribute humidity real @attribute windy {TRUE, FALSE} @attribute play {yes, no} @data sunny,85,85,FALSE,no sunny,80,90,TRUE,no overcast,83,86,FALSE,yes rainy,70,96,FALSE,yes rainy,68,80,FALSE,yes rainy,65,70,TRUE,no overcast,64,65,TRUE,yes sunny,72,95,FALSE,no sunny,69,70,FALSE,yes rainy,75,80,FALSE,yes /usr/local/weka-3-4-5/data/weather.arff

“First catch your rabbit” (Mrs Beaton’s cookbook): Other tools are possible, but WWW-BootCat *should* be easier to use … First: sign up for Domain, SketchEngine account, Google key; download seeds-en from http://corpus.leeds.ac.uk/internet.html (see coursework spec for URLs) Text mining example: discovering terms in a domain, using WWW-BootCat

Advanced Search option with parameter settings: using SergeSharoff's seed-en http://corpus.leeds.ac.uk/internet/seeds-en list of typical medium-frequency English words as seed-words, Google key set to the Key which I set up beforehand at https://www.google.com/accounts/NewAccount Language set to English Select URLs ticked, so I can cut-and-paste the list of urls to a textfile (TO HAND IN WITH CW) Corpus name set to EnglishUK (in my case), or English?? (change ?? To your Domain) email address set to USERNAME@comp.leeds.ac.uk Query Extension set to site:.uk (in my case), or site:.?? (change ?? To your Domain) other Advanced Options left at default values...??? ... then click on Build a corpus!, follow instructions as they appear, and (after some wait) download the corpus in raw and vertical formats (either direct from URL or wait for email to tell you URL…) First collect your corpus

WWW-Bootcat: log in, Advanced options: upload seed-en, check URLs, site:.??; Build Corpus If it crashes, ?bad HTML in website?, try again Download your corpus, because… 500,000-word quota – room for 2 corpuses (only), so you can only compare 2 at a time in WWWBootCat Or compare on your linux account… /home/www/db32/cw/EnglishUS , EnglishUK Problems?

Aim: to find terms in C1 not in C2? and terms in C2 not in C1? Sort C1, C2 in Vertical format (1 word per line) to give C1termlist, C2termlist: sort C1 > C1termlist; sort C2> C2termlist diff C1termlist C2termlist BUT this shows LOTS of differences many “not significant”: 1 example (hapax legomena) Comparing text corpora

Better: to find “significant” terms in C1 not in C2 sort C1 | uniq -c | sort -n -r > C1termlist Terms with frequencies – most common first Can be compared “OLAP-style” – you can spot high-freq words in one list but not the other ? No need for further processing? Comparing “significant” terms

BootCat (and others, eg Paul Rayson) offer tools to compare frequencies of words – to find words used MUCH MORE in one corpus than another Several different metrics available, eg “mutual information”, “normalised frequency difference”,… Not necessary for DB32 coursework (probably) … BUT I will be impressed if you do use these advanced metrics! Comparing word-frequencies

Knowledge Discovery (Data Mining) tools semi-automate the process of discovering patterns in data. Tools differ in terms of what concepts they discover (differences, key-terms, clusters, decision-trees, rules)… … and in terms of the output they provide (eg clustering algorithms provide a set of subclasses) Selecting the right tools for the job is based on business objectives: what is the USE for the knowledge discovered Knowledge Discovery: Key points

You should be able to: Decide which is the appropriate data mining technique for a given a problem defined in terms of business objectives. Decide which is the most appropriate form of output. Self-test

COMP3410 DB32: Technologies for Knowledge Management

COMP3410 DB32: Technologies for Knowledge Management

Presentation Transcript

Knowledge Technologies for Museums of the World

Emerging Technologies for Knowledge Management

Knowledge Technologies 2002-2006

Exploiting Semantic Web and Knowledge Management Technologies for E-learning

Environmental Knowledge for Disaster Risk Management Challenges in Integrating Geospatial Technologies

KNOWLEDGE MANAGEMENT

Information and Communication Technologies, Knowledge Management and Indigenous Knowledge

COMP3410 DB32: Technologies for Knowledge Management

Knowledge technologies for network Organisations

Secure Knowledge Management and Trustworthy Semantic Web Technologies

Innovations for Knowledge Management

KNOWLEDGE MANAGEMENT

Tools and Technologies for Knowledge Management

Knowledge and semantic technologies

Information, Knowledge, Technologies, Concepts and Systems Management

Advanced Knowledge Technologies

Advanced Knowledge Technologies

Knowledge Technologies 2002-2006

Knowledge Management

‘Knowledge Management’ for Health