1 / 31

Zhang Yanxia China-VO Group 2006.11.30 in Guilin

Chinese Virtual Observatory. Data Mining in Astronomy. Zhang Yanxia China-VO Group 2006.11.30 in Guilin. Outline. Why What How Example challenge summary. ROSAT ~keV. DSS Optical. IRAS 25 m. 2MASS 2 m. GB 6cm. WENSS 92cm. NVSS 20cm. IRAS 100 m. Astronomy facing “data avalanche”.

Download Presentation

Zhang Yanxia China-VO Group 2006.11.30 in Guilin

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Chinese Virtual Observatory Data Mining in Astronomy Zhang Yanxia China-VO Group 2006.11.30 in Guilin

  2. Outline • Why • What • How • Example • challenge • summary China-VO 2006, Guilin

  3. ROSAT ~keV DSS Optical IRAS 25m 2MASS 2m GB 6cm WENSS 92cm NVSS 20cm IRAS 100m Astronomy facing “data avalanche” Necessity Is the Mother of Invention DM&KDD China-VO 2006, Guilin

  4. Issues in Astronomy Ofer Lahav, 2006, astro-ph/0610703 Summary on the 4th meeting on “Statistical Challenge in Modern Astronomy” held at Penn State University in June 2006 • Compression (e.g. Galaxy images and spectra) • Classification (e.g. Stars, galaxies, or Gamma Ray Bursts) • Reconstruction (e.g. of blurred galaxy images, mass distribution from weak gravitational lensing) • Feature extraction (e.g. signatures feature of stars, galaxies and quasars) • Parameter estimation (e.g. Star parameter measurement, Photometric redshift prediction, orbital parameters of extra-solar planets, or cosmological parameters ) • Model selection (e.g. are there 0,1,2,……planets around stars, or is there a cosmological model with none-zero neutrino mass more favorable) China-VO 2006, Guilin

  5. Science Requirements for DM (Borne K D, 2001, Proc. Of the MPA/ESO/MPE Workshop,671) • Cross-Identification - refers to the classical problem of associating the source list in one database to the source list in another. • Cross-Correlation - refers to the search for correlations, tendencies, and trends between physical parameters in multi-dimensional data, usually across databases. • Nearest-Neighbor Identification - refers to the general application of clustering algorithms in multi-dimensional parameter space, usually within a database. • Systematic Data Exploration - refers to the application of the broad range of event-based and relationship-based queries to a database in the hope of making a serendipitous discovery of new objects or a new class of objects. China-VO 2006, Guilin

  6. KDD: Opportunity and Challenges Competitive Pressure Data Rich Knowledge Poor (the resource) KDD Data Mining Technology Mature Enabling Technology (Interactive MIS, OLAP, parallel computing, Web, etc.) China-VO 2006, Guilin

  7. KDD: A Definition KDD is the automatic extraction of non-obvious, hidden knowledge from large volumes of data. KDD is the automatic extraction of non-obvious, hidden knowledge from large volumes of data. 106-1012 bytes: never see the whole data set or put it in the memory of computers What knowledge? How to represent and use it? Data mining algorithms? China-VO 2006, Guilin

  8. Benefits of Knowledge Discovery Value Disseminate DSS Generate MIS EDP Rapid Response Volume EDP: Electronic Data Processing MIS: Management Information Systems DSS: Decision Support Systems China-VO 2006, Guilin

  9. DM: A KDD Process Knowledge • Data mining: the core of knowledge discovery process. Pattern Evaluation Data Mining Task-relevant Data Selection Data Warehouse Data Cleaning Data Integration Databases

  10. Work at each process of DM DM object Data preparation Data processing Analysis and Evalution 60 50 40 30 20 10 0 China-VO 2006, Guilin

  11. Primary Tasks of Data Mining finding the description of several predefined classes and classify a data item into one of them. identifying a finite set of categories or clusters to describe the data. Clustering Classification finding a model which describes significant dependencies between variables. maps a data item to a real-valued prediction variable. Regression Dependency Modeling discovering the most significant changes in the data finding a compact description for a subset of data Deviation and change detection Summarization China-VO 2006, Guilin

  12. Feature selection • Filter method • Wrapper method • Embedded method • Feature weighted method China-VO 2006, Guilin

  13. Feature extraction • PCA • Factor analysis (Principal FA/Maximum Likelihood FA) • Projection pursuit • ICA • Non-linear PCA/ICA • Random projection • Principal curves • MDS • LLE • ISOMAP • Topological continuous map • Neural network • Vector quantization • Kernel PCA/ICA • LDA (linear discriminant analysis ) • QDA (quadratic discriminant analysis) • FDA (Fisher discriminant analysis) • GDA (Generalized discriminant analysis) • KDDA (kernel direct discriminant analysis) China-VO 2006, Guilin

  14. Classification Methods • Based on statistical theory: SVMs, ML, LDA,FDA,QDA,KNN • Based on NN: LVQ, RBF, PNN, KSOM,BBN,SLP,MLP • Based on Decision Tree: REPTree, RandomTree, CART,C5.0, J48, DecisionStump, RandomForest, NBtree,AC2,Cal5, ADTree,KDTree • Based on Decision Rule: Decision Table,CN2,ITrule, AQ • Based on bayesian theory: Naive Bayes classifier, NBTree • Based on meta learning: adaboost, boosting, bagging • Based on evolution theory: genetic algorithm • Based on fuzzy theory: fuzzy set, rough set • Ensembles of classifiers Data Mining algorithm patterns China-VO 2006, Guilin

  15. Regression Methods • (penalized) logistic regression • Bayesian regression analysis • Additive regression • Locally weighted regression • Voted perceptron network • Projection pursuit regression • Recursive partitioning regression • Alternating condition expectation • Stepwise regression • Recursive least square • Fourier transform regression • Ruled-based regression • Principal component regression • Instance-based regression • Multivariate adaptive regression splines • Regression trees (CART, RETIS, M5,random forest, KDtree) • Simple windowed regression • SVM • NN China-VO 2006, Guilin

  16. Method to estimate errors • Train-test • Cross-validation • Bootstrap • Leave-one-out China-VO 2006, Guilin

  17. Evaluation of methods • Accuracy • Speed • Comprehensibility • Time to learn • Generalization China-VO 2006, Guilin

  18. Model Selection for Classifiction • Accuracy • G-mean • F-measure • ROC (Receive Operating Characteristic Curve) China-VO 2006, Guilin

  19. Model Selection for Regression • AIC(Akaike information criterion) • BIC (Bayesian information criterion) • SRM (Structure Risk Minimization) China-VO 2006, Guilin

  20. Example 1 Lim Jien-sien et al. Machine Learning, 40, 203-229(2000) 33 algorithms on 16 different samples 22 decision trees CART, S-Plus tree, C4.5,FACT,QUEST,IND,OC1,LMDT,CAL5,T1 9 statistical methods LDA,QDA,NN,LOG,FDA,PDA,MDA,POL 2 neural networks LVQ,RBF China-VO 2006, Guilin

  21. Example 1 Lim Jien-sien et al. Machine Learning, 40, 203-229(2000) China-VO 2006, Guilin

  22. Example 2 China-VO 2006, Guilin

  23. Example 3 Zhao,Y, Zhang,Y., 2006, submitted to cospar China-VO 2006, Guilin

  24. Zhang,Y,Zhao,Y, 2006, submitted to CHJAA Example 3 For NB, ADTree MLP, the corresponding whole accuracy amounts to 97.5%, 98.5% and 98.1%, respectively. China-VO 2006, Guilin

  25. Zhang,Y, Luo, A, Zhao,Y, 2006, submitted to Cospar Example 4 By best-forward search, j-h, b-v,j+ 2.5lgFpeak are optimal features selected from the 10 features. Decision Table is applied. 10-fold cross-validation for training and test. 98.03% China-VO 2006, Guilin

  26. Li,Y.,Zhang,Y.,Zhao,Y.,2006,submitted to Chinese Science Example 5 k-Nearest neighbor classifier China-VO 2006, Guilin

  27. Zhang,Y., Zhao, Y., 2006,ADASS XV,351,173 Example 6 China-VO 2006, Guilin

  28. Challenges and Influential Aspects Handling of different types of data with different degree of supervision Massive data sets, high dimensionality (efficiency, scalability) Different sources of data (distributed, heterogeneous databases, noise and missing, irrelevant data, etc.) Interactive, Visualization Knowledge Discovery Understandability of patterns, various kinds of requests and results (decision lists, inference networks, concept hierarchies, etc.) Changing data and knowledge China-VO 2006, Guilin

  29. Summary • Linear or non-linear • Gassian or non-gassian • Continous or discrete • Missing or not • Comparision of the number of attributes with that of records • Choose the appropriate method or ensemble algorithms according to the task and data characteristics China-VO 2006, Guilin

  30. Prospect With the wing of DM, find better or best knowledge! With the wing of DM, find more, better or best knowledge! Thank you for your attention! China-VO 2006, Guilin

  31. Thank you !!!

More Related