180 likes | 194 Views
This course provides an overview of probability theory, introduces R programming language, explores exploratory data analysis, statistical inference, maximum likelihood estimation, regression, multivariate analysis, cluster analysis, nonparametric statistics, model selection, goodness of fit, Bayesian inference, and spatial statistics.
E N D
Overview G. Jogesh Babu
Probability theory • Probability is all about flip of a coin • Conditional probability & Bayes theorem (Bayesian analysis) • Expectation, variance, standard deviation (units free estimates) • density of a continuous random variable (as opposed to density defined in physics) • Normal (Gaussian) distribution, Chi-square distribution (not Chi-square statistic) • Probability inequalities and the CLT
R Programming environment • Introduction to R programming language • R is an integrated suite of software facilities for data manipulation, calculation and graphical display. • Commonly used techniques such as, graphical description, tabular description, and summary statistics, are illustrated through R.
Exploratory Data Analysis An approach/philosophy for data analysis that employs a variety of techniques (mostly graphical) to: • maximize insight into a data set • uncover underlying structure • extract important variables • detect outliers and anomalies • formulate hypotheses worth testing • develop parsimonious models • provide a basis for further data collection through surveys or experiments
Statistical Inference • While Exploratory Data Analysis provides tools to understand what the data shows, the statistical inference helps in reaching conclusions that extend beyond the immediate data alone. • Statistical inference helps in making judgments of an observed difference between groups is a dependable one or one that might have happened by chance in a study. • Topics include: • Point estimation • Confidence intervals for unknown parameters • Principles of testing of hypotheses
Maximum Likelihood Estimation • Likelihood - differs from that of a probability • Probability refers to the occurrence of future events • while a likelihood refers to past events with known outcomes • MLE is used for fitting a mathematical model to data. • Modeling real world data by estimating maximum likelihood offers a way of tuning the free parameters of the model to provide a good fit.
Regression • Basic Concepts in Regression • Bias-Variance Tradeoff • Linear Regression • Nonparametric Regression • Local Polynomial Regression • Confidence Bands • Splines
Linear regression issues in astronomy • Compares different regression lines used in astronomy • Illustrates them with Faber-Jackson relation. • Measurement Error models are also discussed
Multivariate analysis • Analysis of data on two or more attributes (variables) that may depend on each other • Principle components analysis, to reduce the number of variables • Canonical correlation • Tests of hypotheses • Confidence regions • Multivariate regression • Discriminant analysis (supervised learning). • Computational aspects are covered in the lab
Cluster Analysis • Data mining techniques • Classifying data into clusters • k-means • Model clustering • Single linkage (friends of friends) • Complete linkage clustering algorithm
Nonparametric Statistics • These statistical procedures make no assumptions about the probability distributions of the population. • The model structure is not specified a priori but is instead determined from data. • As non-parametric methods make fewer assumptions, their applicability is much wider • Procedures described include: • Sign test • Mann-Whitney two sample test • Kruskal-Wallis test for comparing several samples • Density Estimation
Bootstrap • How to get most out of repeated use of the data. • Bootstrap is similar to Monte Carlo method but the `simulation' is carried out from the data itself. • A very general, mostly non-parametric procedure, and is widely applicable. • Applications to regression, cases where the procedure fails, and where it outperforms traditional procedures will be also discussed
Model selection • Chi-square test • Wald Test • Rao's score test • Likelihood ratio test • AIC, BIC
Goodness of Fit • Curve (model) fitting or goodness of fit using bootstrap procedure. • Procedure like Kolmogorov-Smirnov does not work in multidimensional case, or when the parameters of the curve are estimated. • Bootstrap comes to rescue • Some of these procedures are illustrated using R in a lab session on Hypothesis testing and bootstrapping
Bayesian Inference • As evidence accumulates, the degree of belief in a hypothesis ought to change • Bayesian inference takes prior knowledge into account • The quality of Bayesian analysis depends on how best one can convert the prior information into mathematical prior probability • Methods for parameter estimation, model assessment etc • Illustrations with examples from astronomy
Spatial Statistics • Spatial Point Processes • Gaussian Processes (Inference and computational aspects) • Modeling Lattice Data • Homogeneous and inhomogeneous Poisson processes • Estimation of Ripley's K function (useful for point pattern analysis) • Cox Process (doubly stochastic Poisson Process) • Markov Point Processes
Time Series • Time domain procedures • State space models • Kernel smoothing • Poisson processes • Spectral methods for inference • A brief discussion of Kalman filter • Illustrations with examples from astronomy
Monte Carlo Markov Chain • MCMC methods are a collection of techniques that use pseudo-random (computer simulated) values to estimate solutions to mathematical problems • MCMC for Bayesian inference • Illustration of MCMC for the evaluation of expectations with respect to a distribution • MCMC for estimation of maxima or minima of functions • MCMC procedures are successfully used in the search for extra-solar planets