360 likes | 470 Views
Graphical Data Mining for Computational Estimation in Materials Science Applications. Aparna Varde Ph.D. Dissertation August 15, 2006 Committee Members Prof. Elke Rundensteiner (Advisor) Prof. Carolina Ruiz Prof. David Brown Prof. Neil Heffernan
E N D
Graphical Data Mining for Computational Estimation in Materials Science Applications Aparna Varde Ph.D. Dissertation August 15, 2006 Committee Members Prof. Elke Rundensteiner (Advisor) Prof. Carolina Ruiz Prof. David Brown Prof. Neil Heffernan Prof. Richard Sisson Jr. (Head of Materials Science, WPI) This work is supported by the Center for Heat Treating Excellence and by Department of Energy Award DE-FC-07-01ID14197
Introduction • Scientific domains: Experiments conducted with given input conditions • Results plotted as graphs: Good visual depictions • Experimental results help in analysis: Assist decision-making • Performing experiment: Consumes time and resources
Motivating Example • Heat Treating of Materials • Controlled heating & cooling of materials to achieve mechanical & thermal properties • Performing experiments involves • One time cost: $1000s • Recurrent costs: $100s • Time: 5 to 6 hours • Human labor • Desirable to estimate • Graphs given input conditions • Conditions to achieve given graph CHTE Experimental Setup
Problem Definition • To develop an estimation technique with following goals: • Given input conditions in an experiment, estimate resulting graph 2. Given desired graph in an experiment, estimate conditions to obtain it
Main Tasks Task 1 AutoDomainMine Learning Strategy of Integrating Clustering and Classification [AAAI-06 Poster, ACM SIGART’s ICICIS-05] Task 2 Learning Domain-Specific Distance Metrics for Graphs [ACM KDD’s MDM-05, MTAP-06 Journal] Task 3 Designing Semantics-Preserving Representatives for Clusters [ACM SIGMOD’S IQIS-06, ACM CIKM-06]
Task 2: Learning Domain-Specific Distance Metrics for Graphs
Motivation • Various distance metrics • Absolute position of points • Statistical observations • Critical features • Issues • Not known what metrics apply • Multiple metrics may be relevant • Need for distance metric learning in graphs Example of domain-specific problem
Proposed Distance Metric Learning Approach: LearnMet • Given • Training set with actual clusters of graphs • Additional Input • Components: distance metrics applicable to graphs • LearnMet Metric • D = ∑wiDi
Evaluate Accuracy • Use pairs of graphs • A pair (ga,gb) is • TP - same predicted, same actual cluster: (g1, g2) • TN - different predicted, different actual clusters: (g2,g3) • FP -same predicted cluster, different actual clusters: (g3,g4) • FN - different predicted, same actual clusters: (g4,g5)
Evaluate Accuracy (Contd.) • How do we compute error for whole set of graphs? • For all pairs • Error Measure • Failure Rate FR • FR = (FP+FN) / (TP+TN+FP+FN) • Error Threshold (t) • Extent of FR allowed • If (FR < t) then clustering is accurate
Adjust the Metric • Weight Adjustment Heuristic: for each Di • New wi = wi – sfi (DFNi/DFN + DFPi/DFP) [KDD’s MDM-05]
Evaluation of LearnMet • Details: MTAP-06 • Effect of pairs per epoch (ppe) • G = number of graphs, e.g., = 25 • GC2 = total number of pairs, e.g., = 300 • Select subset of GC2 pairs per epoch • Observations • Highest accuracy with middle range of ppe • Learning efficiency best with low ppe • Average accuracy with LearnMet 86% Accuracy of Learned Metrics over Test Set Learning Efficiency over Training Set
Task 3: Designing Semantics-Preserving Representatives for Clusters
Motivation • Different combinations of conditions could lead to a single cluster • Graphs in a cluster could have variations • Need for designing representatives that • Incorporate semantics • Avoid visual clutter • Cater to various users
Candidates for Conditions 1. Nearest Representative Set of conditions in Cluster A Nearest Representative for Cluster A • Return set of conditions closest to all others in cluster • Notion of distance: Domain-specific distance metric from decision tree paths [CIKM-06]
Candidates for Conditions (Contd.) 2. Summarized Representative Cluster A • Build sub-clusters of condition using domain knowledge • Return nearest sub-cluster representatives • Sort them Sub-clusters within the Cluster A Summarized Representative for Cluster
Candidates for Conditions (Contd.) 3. Combined Representative • Return all sets of conditions • Sort them in ascending order Cluster A Combined Representative for Cluster A
Candidates for Graphs 1. Nearest Representative • Select graph that is nearest neighbor for all others • Notion of distance: Domain-specific metric from LearnMet
Candidates for Graphs (Contd.) 2. Medoid Representative • Select graph closest to average of all graphs • Average of y-coordinate values since x-coordinates are same
Candidates for Graphs (Contd.) 3. Summarized Representative • Construct average graph with prediction limits • Average: centroid, Prediction limits: domain-specific thresholds
Candidates for Graphs (Contd.) 3. Combined Representative • Construct superimposed graph of all graphs in cluster • Same x-values, so plot y-values on a common x-axis
Effectiveness Measure for Candidates • Minimum Description Length Principle • Theory: Representative, Examples: all items in cluster • Representative: Measure Complexity (ease of interpretation) Complexity = log2 N for graphs, log2 AV for conditions, • N = number of points to store representative graph • A = number of attributes for conditions, • V = number of values in representative set of conditions • Examples: Measure distance of items from representative (information loss) Distance for graphs = log2 (1/G)∑{i=1 to G} D(r,gi) • D: distance using domain-specific metric • G: total number of graphs in cluster • gi: each graph • r: representative graph • Encoding [SIGMOD IQIS-06] Effectiveness= UBC*Complexity + UBD*Distance • UBC, UBD: User bias % weights for complexity and distance
Evaluation of DesRept: Conditions • Details • Data Set Size = 400, Number of Clusters = 20 • Observations • Overall winner is Summarized • As weight for complexity increases, Nearest wins • Designed better than Random
Evaluation of DesRept: Graphs • Details • Data Set Size = 400, Number of Clusters = 20 • Observations • Overall winner is Summarized • As weight for complexity increases, Nearest / Medoid wins • Designed better than Random
User Evaluation of AutoDomainMine System • Formal user surveys in different applications • Evaluation Process • Compare estimation with real data in test set • If they match estimation is accurate • Observations • Estimation Accuracy around 90 to 95 % Accuracy: Estimating Conditions Accuracy: Estimating Graphs
Related Work • Similarity Search [HK-01, WF-00] • Non-matching conditions could be significant • Mathematical Modeling [M-95, S-60] • Existing models not applicable under certain situations • Case-based Reasoning [K-93, AP-03] • Adaptation of cases not feasible with graphs • Learning nearest neighbor in high-dimensional spaces: [HAK-00] • Focus is dimensionality reduction, do not deal with graphs • Distance metric learning given basic formula: [XNJR-03] • Deal with position-based distances for points, no graphs involved • Similarity search in multimedia databases [KB-04] • Use various metrics in different applications, do not learn a single metric • Image Rating: [HH-01] • User intervention involved in manual rating • Semantic Fish Eye Views: [JP-04] • Display multiple objects in small space, no representatives • PDA Displays in Levels of Detail: [BGMP-01] • Do not evaluate different types of representatives
Summary • Dissertation Contributions • AutoDomainMine: Integrating Clustering and Classification for Estimation [AAAI-06 Poster, ACM SIGART’s ICICIS-05] • LearnMet: Learning Domain-Specific Distance Metrics for Graphs [ACM KDD’s MDM-05, MTAP-06 Journal] • DesRept: Designing Semantics-Preserving Representatives for Clusters [ACM SIGMOD’s IQIS-06, ACM CIKM-06] • Trademarked Tool for Computational Estimation in Materials Science [ASM HTS-05, ASM HTS-03] • Future Work • Image Mining, e.g., Comparing Nanostructures • Data Stream Matching, e.g., Stock Market Analysis • Visual Displays, e.g., Summarizing Web Information
Publications Dissertation-Related Papers 1. Designing Semantics-Preserving Representatives for Scientific Input Conditions, A. Varde, E. Rundensteiner, C. Ruiz, D. Brown, M. Maniruzzaman and R. Sisson Jr., In CIKM, Arlington, VA, Nov 2006. 2. Integrating Clustering and Classification for Estimating Process Variables in Materials Science. A. Varde, E. Rundensteiner, C. Ruiz, D. Brown, M. Maniruzzaman and R. Sisson Jr. In AAAI, Poster Track, Boston, MA, Jul 2006. 3. Effectiveness of Domain-Specific Cluster Representatives for Graphical Plots. A. Varde, E. Rundensteiner, C. Ruiz, D. Brown, M. Maniruzzaman and R. Sisson Jr. In ACM SIGMOD IQIS, Chicago, IL, Jun 2006. 4. LearnMet: Learning Domain-Specific Distance Metrics for Plots of Scientific Functions. A. Varde, E.Rundensteiner, C. Ruiz, M. Maniruzzaman and R. Sisson Jr. Accepted in the International MTAP Journal, Springer Publications, Special Issue on Multimedia Data Mining, 2006. 5. Learning Semantics-Preserving Distance Metrics for Clustering Graphical Data. A. Varde, E. Rundensteiner, C. Ruiz, M. Maniruzzaman and R. Sisson Jr. In ACM KDDMDM, Chicago, IL, Aug 2005, pp. 107-112. 6. Apriori Algorithm and Game-of-Life for Predictive Analysis in Materials Science. A. Varde, M. Takahashi, E. Rundensteiner, M. Ward, M. Maniruzzaman and R. Sisson Jr. In KES Journal, IOS Press, Netherlands, Vol. 8, No. 4, 2004, pp. 213 – 228. 7. Data Mining over Graphical Results of Experiments with Domain Semantics. A. Varde, E. Rundensteiner, C. Ruiz, M. Maniruzzaman and R. Sisson Jr. In ACM SIGART ICICIS, Cairo, Egypt, Mar 2005, pp. 603 – 611.
Publications (Contd.) 8. QuenchMiner: Decision Support for Optimization of Heat Treating Processes. A. Varde, M. Takahashi, E. Rundensteiner, M. Ward, M. Maniruzzaman and R. Sisson Jr. In IEEE IICAI, Hyderabad, India, Dec 2003, pp. 993 – 1003. 9. Estimating Heat Transfer Coefficients as a Function of Temperature by Data Mining. A. Varde, E. Rundensteiner, M. Maniruzzaman and R. Sisson Jr. In ASM HTS, Pittsburgh, PA, Sep 2005. 10 . The QuenchMiner Expert System for Quenching and Distortion Control. A. Varde, E. Rundensteiner, M. Maniruzzaman and R. Sisson Jr. In ASM HTS, Indianapolis, IN, Sep 2003, pp. 174 – 183. Other Papers 11.MEDWRAP: Consistent View Maintenance over Distributed Multi-Relation Sources. A. Varde and E. Rundensteiner. In DEXA. Aix-en-Provence, France, Sep 2002, pp. 341 – 350. 12. SWECCA for Data Warehouse Maintenance. A. Varde and E. Rundensteiner. In SCI, Orlando, FL, Jul 2002, Vol. 5, pp. 352 – 357. 13. MatML: XML for Information Exchange with Materials Property Data, A. Varde, E. Begley, S. Fahrenholz-Mann. In ACM KDD DM-SPP, Philadelphia, PA, Aug 2006. 14. Semantic Extensions to Domain-Specific Markup Languages. A. Varde, E. Rundensteiner, M. Mani, M. Maniruzzaman and R. Sisson Jr. In IEEE CCCT, Austin, TX, Aug 2004, Vol. 2, pp. 55 – 60.
Acknowledgments • First of all, my Advisor: Prof. Elke Rundensteiner • Committee: Prof. Carolina Ruiz,Prof. David Brown, Prof Neil Heffernan • External Member: Prof. Richard D. Sisson Jr., Head of Materials Program • Director of Metal Processing Institute: Prof. Diran Apelian • Domain Expert: Dr. Mohammed Maniruzzaman • Members of Center for Heat Treating Excellence • CS Department Head: Prof. Michael Gennert • Former CS Department Head: Prof. Micha Hofri • WPI Administration (CS, Materials): In particular Mrs. Rita Shilansky • Reviewers of Conferences and Journals where my papers got accepted • Members of DSRG, AIRG, KDDRG and Quenching Research Group • Colleagues and Friends: Shuhui, Sujoy, Viren, Olly, Mariana, Rimma, Maged, Bin, Lydia, Shimin and others… • Great Thanks to my Family: Parents Dr. Sharad Varde and Dr. (Mrs.) Varsha Varde, Grandparents Mr. D.A. Varde and Mrs. Vimal Varde, Brother Ameya Varde and Sister-in-law Deepa Varde • All the attendees of my Ph.D. Defense • Finally, God for guiding me throughout my doctoral journey