1.29k likes | 1.99k Views
Introduction to Probability Theory in Machine Learning: A Bird View. Mohammed Nasser Professor, Dept. of Statistics, RU,Bangladesh Email: mnasser.ru@gmail.com. Content of Our Present Lecture. Introduction Problem of Induction and Role of Probability Techniques of Machine Learning
E N D
Introduction to Probability Theory in Machine Learning: A Bird View Mohammed Nasser Professor, Dept. of Statistics, RU,Bangladesh Email: mnasser.ru@gmail.com
Content of Our Present Lecture • Introduction • Problem of Induction and Role of Probability • Techniques of Machine Learning • Density Estimation • Data Reduction • Classification and Regression Problems • Probability in Classification and Regression • Introduction to Kernel Methods
Introduction The problem of searching for patterns in data is the basic problem of science. • the extensive astronomical observations of Tycho Brahe in the 16th century allowed Johannes Kepler to discover the empirical laws of planetary motion, which in turn provided a springboard for the development of classical mechanics. Johannes Kepler 1571 - 1630 Brahe, Tycho 1546-1601
Introduction • Darwin’s(1809-1882) study of nature in five years voyage on HMS Beagle revolutionized biology. • the discovery of regularities in atomic spectra played a key role in the development and verification of quantum physics in the early twentieth century. Off late the field of pattern recognition has become concerned with the automatic discovery of regularities in data through the use of computer algorithms and with the use of these regularities to take actions such as classifying the data into different categories.
Problem of Induction • The inductive inference process: • Observe a phenomenon • Construct a model of the phenomenon • Make predictions →This is more or less the definition of natural sciences ! →The goal of Machine Learning is to automate this process →The goal of Learning Theory is to formalize it.
Problem of Induction Let us suppose somehow we have x1,x2,- - -xn measurements n is very large, e.g. n=1010000000.. Each of x1,x2,- - -xn measurements satisfies a proposition, P Can we say that( n+1) (1010000000+1)th obsevation satisfies P. certainly.. No
Problem of Induction Let us consider P(n)= - n The question: Is P(n) >0 ? It is positive upto very very large number, but after that becomes negative. What can we do now? Probabilistic framework to the rescue!
Problem of Induction • What is the probability, p that the sun will rise tomorrow? • p is undefined, because there has never been an experiment that tested the existence of the sun tomorrow • The p = 1, because the sun rose in all past experiments. • p = 1-ε, where εis the proportion of stars that explode per day. p = d+1/d+2, which is Laplace rule derived from Bayes rule.(d = past # days sun rose, ) Conclusion: We predict that the sun will rise tomorrow with high probability independent of the justification.
The Sub-Fields of ML Classification • Supervised Learning Regression Clustering • Unsupervised Learning Density estimation Data reduction • Reinforcement Learning
Unsupervised Learning: Density Estimation What is the wt of the elephant? What is the wt/distance of sun? What is the wt of a DNA molecule? What is the wt/size of baby in the womb?
Solution of the Classical Problem Let us suppose somehow we have x1,x2,- - -xn measurements One million dollar question: How can we choose the optimum one among infinite possible alternatives to combine these n obs. to estimate the target,μ What is the optimum n?
We need the concepts: Probability measures, Probability distributions - - - Target that we want to estimate ith observations
Meaning of Measure On any sample space Whenever An A Probability Measures Discrete P(A)=1, #(A)=finite or Continuous P{x}=0 for all x Absolutely Continous Non A.C.
Discrete Distributions On Rk We have special concepts
Different Shapes of the Models Is sample mean appropriate for all the models?
I know the population means - - - I know Pr(a<X<b) for every a and b.
Approaches of Model Estmation Bayesian • Parametric Non-Bayesian Cdf estimation • Nonparametric Density estimation • Semiparametric
Infinite-dimensional Ignorance Generally any function space is infinite-dimensional Parametric modeling assumes our ignorance is finite-dimensional Semi-parametric modeling assumes our ignorance has two parts: one finite-dimensional and the other, infinite-dimensional Non-parametric modeling assumes our ignorance is infinite-dimensional
Semiparametric /Robust Density Estmation Parametric model Parametric model Nonparametric model
Application of Density Estmation Picture of Three Objects
Curse of Dimension Courtesy: Bishop(2006)
Curse of Dimension Courtesy: Bishop(2006)
Unsupervised Learning: Data Reduction If population model is MVN with high corr., it works well.
Unsupervised Learning: Data Reduction ??????? One dimensional manifold
Problem-2 Fisher’s Iris Data (1936): This data set gives the measurements in cm (or mm) of the variables Sepal length Sepal width Petal length Petal width and Species (setosa, versicolor, and virginica) There are 150 observation with 50 from each species. We want to predict the class of a new observation . What is the available method to do job? LOOK! DEPENDENT VARIABLE IS CATEGORICAL*** INDEPENDENT VARIABLES ARE CONTINUOUS***
Problem-3 BDHS (2004): The dependent variable is childbearing risk with two values (High Risk and Low Risk). The target is to predict the childbearing risk based some socio economic and demographic variables. The complete list of the variables is given in the next slide. Again we are in the situation where the dependent variable is categorical, independent variables are mixed.
Problem-4 Face Authentication (/ Identification) • Face Authentication/Verification (1:1 matching) • Face Identification/Recognition (1:N matching)
Applications Access Control www.viisage.com www.visionics.com
Applications Video Surveillance (On-line or off-line) Face Scan at Airports www.facesnap.de
Why is Face Recognition Hard? Inter-class Similarity Twins Father and son Intra-class Variability
Handwritten digit recognition We want to recognize the postal codes automically
Problem 6: Credit Risk Analysis • Typical customer: bank. • Database: • Current clients data, including: • basic profile (income, house ownership, delinquent account, etc.) • Basic classification. • Goal: predict/decide whether to grant credit.
Problem 7: Spam Email Detection, Search Engine etc traction.tractionsoftware.com www.robmillard.com
Problem 9: Genome-wide data 40 mRNA expression data hydrophobicity data protein-protein interaction data sequence data (gene, protein)
Problem 10: Robot control • Goal: Control a robot in an unknown environment. • Needs both • to explore (new places and action) • to use acquired knowledge to gain benefits. • Learning task “control” what is observed!
Problem-11 Wisconsin Breast Cancer Database (1992): This breast cancer databases was obtained from the University of Wisconsin Hospitals, Madison from Dr. William H. Wolberg. Or you get it from (http://www.potschi.de/svmtut/breast-cancer-wisconsin.data). The variables are: Clump Thickness 1 - 10 Uniformity of Cell Size 1 - 10 Uniformity of Cell Shape 1 - 10 Marginal Adhesion 1 - 10 Single Epithelial Cell Size 1 – 10 Bare Nuclei 1 - 10 Bland Chromatin 1 - 10 Normal Nucleoli 1 - 10 Mitoses 1 - 10 Status: (benign, and malignant) There are 699 observations are available. Now we want to predict the status of a patient weather it benignormalignant. DEPENDENT VARIABLE IS CATEGORICAL. Independent variables???
Problem 12:Data Description Despreset al. pointed out that the topography of adipose tissue (AT) is considered as risk factors for cardiovascular disease. Cardiovascular diseases affect the heart and blood vessels and include shock, heart failure, heart valve disease, congenital heart disease etc. It is important to measure the amount of intra-abdominal AT as part of the evaluation of the cardiovascular-disease risk of an individual. Adipose Tissue
Data Description Problem: Computed tomography of AT is ---- very costly ----- requires irradiation of the subject ----- not available to many physicians. Not available to Physician • Materials:The simple anthropometric measurements such as waist circumference which can be obtained cheaply and easily. • Variables: • Y= Number of deep abdominal AT • X=The Waist Circumference (in cm.) • Total observation is 109 (men) • Data sources:W. W.Daniel (2003) How well we can predict and estimate the deep abdominal AT from the knowledge of waist circumference ? Waist Circumference
Complex Problem 13 Hypothesis: The infant’s size at birth is associated with the maternal characteristics and SES Variables: X Maternal & SES 1. Age (x1) 2. Parity (x2) 3. Gestational age (x3) 4. Mid-Upper Arm Circumference MUAC (x4) 5. Supplementation group (x5) 6. SES index (x6) CCA, KCCA, MR, PLS etc give us some solutions to this complex problem.
Data Vectors Collections of features e.g. height, weight, blood pressure, age, . . . Can map categorical variables into vectors Matrices Images, Movies Remote sensing and satellite data (multispectral) Strings Documents Gene sequences Structured Objects XML documents Graphs
Let US Summarize!!Classification (reminder) Y=g(X) X ! Y • Anything: • continuous (,d, …) • discrete ({0,1}, {1,…k}, …) • structured (tree, string, …) • … • discrete: • {0,1} binary • {1,…k} multi-class • tree, etc. structured
Classification (reminder) Perceptron Logistic Regression Support Vector Machine Decision Tree Random Forest Kernel trick X • Anything: • continuous (,d, …) • discrete ({0,1}, {1,…k}, …) • structured (tree, string, …) • …
Regression Y=g(X) X ! Y • Anything: • continuous (,d, …) • discrete ({0,1}, {1,…k}, …) • structured (tree, string, …) • … • continuous: • , d Not Always
Regression Perceptron Normal Regression Support Vector regression GLM Kernel trick X • Anything: • continuous (,d, …) • discrete ({0,1}, {1,…k}, …) • structured (tree, string, …) • …