540 likes | 717 Views
找中心句. 寒蝉凄切,对长亭晚,骤雨初歇。都门帐饮无绪,留恋处,兰舟催发,执手相 看泪眼,竟无语凝噎。 念去去千里烟波,暮霭沉沉楚天阔。 多情自古伤离别,更那堪、冷落清秋节。今宵酒醒何处?杨柳岸,晓风残月。此去经年。应是良辰好景虚设。便纵有千种风情,更与何人说。. Studying software evolution using topic models. Stephen W. Thomas, Bram Adams, Ahmed E. Hassan, Dorothea Blostein. 报告人:刘海林.
E N D
找中心句 寒蝉凄切,对长亭晚,骤雨初歇。都门帐饮无绪,留恋处,兰舟催发,执手相 看泪眼,竟无语凝噎。 念去去千里烟波,暮霭沉沉楚天阔。 多情自古伤离别,更那堪、冷落清秋节。今宵酒醒何处?杨柳岸,晓风残月。此去经年。应是良辰好景虚设。便纵有千种风情,更与何人说。
Studying software evolution using topic models Stephen W. Thomas, Bram Adams, Ahmed E. Hassan, Dorothea Blostein 报告人:刘海林 Intelligence Service and Software Engineering Center College of Software Engineering, Chongqing University
关于作者简介 • Stephen W. Thomas
关于作者简介 • 研究领域 • Empirical software engineering,Data and text mining,Temporal databases. • 成果
文章被引用分析 • Publication • Science of Computer Programming • homepage: www.elsevier.com/locate/scico • References • M. D’Ambros, M. Lanza, R. Robbes Evaluating defect prediction approaches: a benchmark and an extensive comparison Empirical Software Engineering, 17 (4–5) (2012), pp. 531–577 • N. Nagappan, A. Zeller, T. Zimmermann, K. Herzig, B. Murphy, Change bursts as defect predictors, in: Proceedings of the 21st International Symposium on Software Reliability Engineering, 2010, pp. 309–318. • H.U. Asuncion, A.U. Asuncion, R.N. Taylor, Software traceability with topic modeling, in: Proceedings of the 32nd International Conference on Software Engineering, 2010, pp. 95–104. • A. Kuhn, S. Ducasse, T. Girba Semantic clustering: identifying topics in source code Information and Software Technology, 49 (3) (2007), pp. 230–243 • ……
摘要 …. Recently, researchers found topics to be effective tools for structuring various software artifacts, such as source code, requirements documents, and bug reports. This research also hypothesized that using topics to describe the evolution of software repositories could be useful for maintenance and understanding tasks. However, research has yet to determine whether these automatically discovered topic evolutions describe the evolution of source code in a way that is relevant or meaningful to project stakeholders, and thus it is not clear whether topic models are a suitable tool for this task. In this paper, we take a first step towards evaluating topic models in the analysis of software evolution by performing a detailed manual analysis on the source code histories of two well-known and well-documented systems, JHotDraw and jEdit. We define and compute various metrics on the discovered topic evolutions and manually investigate how and why the metrics evolve over time.
摘要 effective 源代码、需求文档、缺陷报告 主题 USE 管理和了解项目开发任务 软件库演化 主题演化 useful USE
摘要 …. Recently, researchers found topics to be effective tools for structuring various software artifacts, such as source code, requirements documents, and bug reports. This research also hypothesized that using topics to describe the evolution of software repositories could be useful for maintenance and understanding tasks. However, research has yet to determine whether these automatically discovered topic evolutions describe the evolution of source code in a way that is relevant or meaningful to project stakeholders, and thus it is not clear whether topic models are a suitable tool for this task. In this paper, we take a first step towards evaluating topic models in the analysis of software evolution by performing a detailed manual analysis on the source code histories of two well-known and well-documented systems, JHotDraw and jEdit. We define and compute various metrics on the discovered topic evolutions and manually investigate how and why the metrics evolve over time.
文章结构 • Abstract • Introduction • Background • Motivation and research questions • Study methodology • Examining the discovered topics • Manual analysis of validity (RQ1) • Limitations, threats to validity, and future work • Related work • Conclusion
讲解大纲 1.Motivation & RQ 2. Relation Knowledge 3.Case Study 4. Conclusion 5. My work
1.Motivation & Research Question • Motivation 1)Keep project in good health by discovering and monitoring topic drifts 2) Monitor the day-today development activities 3) Help to understand the history of certain aspects of the system
1.Motivation & Research Question • Research Question RQ1、主题演化是如何与源代码变更对应的 To determine the accuracy of topic evolution models. Given a set of change events in a discovered topic evolution, as well as a set of change activities to the source code, what is the correspondence? RQ2、代码变更类别与主题演化的关系 To perform a descriptive study to determine the common relationships between evolutions and code change categories (i.e.,bug fixes, feature additions, and refactorings). RQ3、主题演化的规律 To gain insight into both the abstract notion of topic evolution as well as into the development processes of the studied systems
讲解提纲 1. Motivation & RQ 2. Related knowledge 3. Case Study 4. Conclusion 5. Discusion
2. Related knowledge • Topic models 1.Latent Dirichlet allocation(LDA) • Topic evolution models 1. Dynamic Topic model 2. Topics Over Time model 3. The Link model 4. The Hall model • Kullback–Leibler (KL) distance • Change event
2. Related knowledge 主题1 主题n 主题2 W1 W2 … W1 W2 … W1 W2 … 生成模型的视角来看 认为一篇文章的每个词都是通过“以一定概率选择了某个主题,并从这个主题中以一定概率选择某个词语”这样一个过程得到的。那么,如果我们要生成一篇文档,它里面的每个词语出现的概率为: • Topic models
2. Related knowledge • Topic models • LDA JGibbLDA(java版本的LDA)
2. Related knowledge model-XXXXX.others: model-XXXXX.phi model-XXXXX.theta model-XXXXX.tassign model-XXXXX.twords • Topic models • LDA JGibbLDA(java版本的LDA)
2. Related knowledge Hall model Link model • Topic evolution models
2. Related knowledge • KL-Distance • 相对熵(relative entropy)又称为KL散度(Kullback–Leibler divergence,简称KLD),KL距离 • KL散度是两个概率分布P和Q差别的非对称性的度量。 KL散度是用来度量使用基于Q的编码来编码来自P的样本平均所需的额外的比特(bit)个数。 典型情况下,P表示数据的真实分布,Q表示数据的理论分布,模型分布,或P的近似分布。
2. Related knowledge • KL-Distance
2. Related knowledge • Change event • 实验中度量元(assignment、weight)有所变化的事件 Tangle α-support Turnover similarity • Assignment • Weight • Scattering
讲解提纲 1. Motivation & RQ 2. Relation knowledge 3. Case Study 4. Conclusion 5. Discussion
3. Case Study JHotDraw and jEdit
3. Case Study • JHotDraw and jEdit
2. Case Study • Study setup • 1. 代码的预处理 1.分离注释和标识符;2.处理语法结构去除JAVA关键字;3.分词;4.去除停顿词a、the、an;5.提取词干;6. 删减高频、低频词汇。 • 2.主题个数的选取 在文中选取了45个主题,在一个研究中,并没有一个最优的k值,一般情况可以通过实验得到一个相对较好的取值区间其中0<<k。 • 3.选择变化阈值 考虑研究主题演化中出现的这些小概率的变化,引入了阈值δ来确定一个度量元是否已经从一个版本到下一版本发生了显著改变。这个阈值能够帮助我们去除我们不感兴趣的部分,同时保留下我们感兴趣的部分。 • 4.结果的得出与分析
3. Case Study • Assignment • weight • RQ1:How topic evolutions correspond to actual change activities • 1.度量元的选取与计算
3. Case Study 假如有13个版本 12*45=?? V1 V2 v3 … Vn-1 Vn Change Event Change Event Change Event Change Event Change Event • RQ1:How topic evolutions correspond to actual change activities • 2.选取合理的主题进行研究
3. Case Study 113 out of 495 events for jEdit 132 out of 540 events for JHotDraw σ² = p(1 − p) • RQ1:How topic evolutions correspond to actual change activities • 2.选取合理的主题进行研究
3. Case Study • RQ1:How topic evolutions correspond to actual change activities • 2.选取合理的主题进行研究
3. Case Study • RQ1:How topic evolutions correspond to actual change activities • 2.选取合理的主题进行研究
3. Case Study • RQ1:How topic evolutions correspond to actual change activities • 2.选取合理的主题进行研究
3. Case Study • RQ1:How topic evolutions correspond to actual change activities • 2.选取合理的主题进行研究
3. Case Study • RQ1:How topic evolutions correspond to actual change activities • 3.结果 定义:change event能够与源代码的变更关联,认为这个event是valid(比如:spikes和drops能够对应代码中添加、删除或者较大的改动)
3. Case Study • RQ1:How topic evolutions correspond to actual change activities • 3.结果 1. Noisy membership changes 2. Confounded topics
3. Case Study • RQ2:What is the relationship between code change categories and topic evolution • 1.定义变更类别 • C1. Corrective evolution (i.e., bug fixes) • C2. Refactoring (i.e., code improvement and adaptation) C2.1. Adoption of coding conventions and style C2.2. Adoption of new framework or libraries C2.3. Improvement of the internal structure of the code • C3. New functionalities and features
3. Case Study • RQ2:What is the relationship between code change categories and topic evolution • 2.结果
3. Case Study • RQ2:What is the relationship between code change categories and topic evolution • 2.结果
3. Case Study • RQ3:What are the patterns of topic evolution?
3. Case Study • RQ3:What are the patterns of topic evolution? Overall growth Major events Births and deaths Constant topics Unstable topics Spike-only topics
讲解提纲 1. Motivation & RQ 2. Relation knowledge 3. Case Study 4. Conclusion 5. My work
Conclusion • C1: The majority of topics evolve due to actual change activities in the source code, with only a small minority of change events caused by noise or confounded topics in the probabilistic LDA model • C2: Topics evolve due to a variety of underlying change activities, including bug fixes, refactoring efforts,and the addition of new functionalities • C3: Topic evolutions in JHotDraw are very active, the topic evolutions in jEdit are more calm.Topics in both systems tend to grow, not shrink, as does the size of the source code.
讲解提纲 1. Motivation & RQ 2. Relation knowledge 3. Case Study 4. Conclusion 5. My work
My Work 1.一些主题具有更高的缺陷倾向 2..缺陷主题具有遗传性,并且用缺陷主题来预测源代码的缺陷情况(未证明) 3.用于主题中的一些度量元能够很好的解释为什么一些源代码有更多的缺陷
My Work 难点与缺陷: 1.数据集的收集较为困难; 2.一个版本内每个文件的缺陷数往往需要手工收集,可能会出现差错。
My Work 用缺陷主题来预测高缺陷的模块 同时定义如下公式:
My Work Eclipse platform子项目,共11个版本11个模块,初始主题数为100
My Work ?怎么来判断不同版本间主题的相似性 1.相对熵 2.概率最高单词的相似性
My Work 定义:
My Work 修改μ值 主题相似度二维表