Jeong-Wook Seo, Heaseon Whang , HyunSook Shin, Naye Choi, Hyoung Won Park

Context based assessment scheme for suspected plagiarized scholarly articles: ‘Context based comparable data sets’ Jeong-Wook Seo, Heaseon Whang, HyunSook Shin, Naye Choi, HyoungWon Park Seoul National University Medical Library and Creative Commons Korea, Seoul, Korea

Steps to judge an article • SUSPECT : Target article • COMPARE to DETECT : Similar phrases and sentences from comparison database • ASSESS : Quantitative and qualitative • COMPARE to JUDGE : with what? • JUDGE the article as plagiarized or not

Glossary of terms Similarity Index Paper Information The largest matching No. The 2nd largest matching No Target article Paper Text Matching Sources

Publishers’ site Abstracts services

Exclusion steps for published articles Similarity Index

High Similarity Index and high largest matching number

High Similarity Index and low largest matching number

Steps to judge an article • SUSPECT : Target article • COMPARE to DETECT : Similar phrases and sentences from comparison database • ASSESS : Quantitative and qualitative • COMPARE to JUDGE : with what? • JUDGE the article as plagiarized or not

The judgment of plagiarism is not straightforward • but is dependent upon the social and academic context. • The assessment should be based on the statistical analysis of similarity reports in a group of articles from the comparable academic and social context.

Working Definition of Misconduct types A. Redundant/ duplicate publication: Most parts are same. The largest matching No. is over 50%. B. Self (or team) plagiarism: Copying of one paragraph or more from authors’ earlier publication without any means of citation. C. Direct copying of methods section: One paragraph or more are copied from methods section without any means of citation D. Uncited Extracts: One paragraph or more are copied without any means of citation E. Excessive Extracts: Copied with citation but too many. (Modified from Zhang, 2010)

Similarity index and numbers of the largest, 2ND & 3RD matching sources

Context based Misconduct probability by 3 groups of the largest matching % Eng, Engineering; SS, Social sciences; IT, Medical, medicine and medical science; Natural Sci, Natural science; Information technology

Context based Misconduct cases by 3 groups of the largest matching %

Context based Comparison of Misconduct Probability (Context Based Comparable Data Set)

Context based presentation of similarity index and the largest, 2ND & 3RD matching numbers in the order of the largest matching number. A statistical comparison. AVERAGE + STDEV 11.7+16.5 7.4+6.3 6.6+7.3 8.0+15.6 9.0+9.5

‘The comparable academic and social context’ • This is to define • the character of the article to be assessed • the purpose of the assessment • This context may be • a specific context for current assessment • predefined contexts to be used for future events of assessing plagiarism. • The context includes • Academic contexts • the domain, • the type of articles, • years published, • the language, • reputation level of journals • Social contexts • country, • authors’ (current) profession (academic or non-academic positions) • the purpose of assessment of plagiarism.

Take Home Messages • Copying is commoner than expected. • Copying pattern is different among different contexts. • The largest matching number(%) is more reliable indicator for plagiarism than Similarity Index. • Context based judge scheme is proposed for plagiarism assessment.

‘An automated tool for similarity detection’ • We used iThenticate for this purpose. • The automated tool detects similar phrases and sentences among the target article and other subject articles from comparison database. • A comprehensive coverage of comparison database (database sources to compare submissions against) is a crucial feature of the tool and the coverage should be adjustable. • Technical features of the tool are also important and should be adjustable.

Context based assessment scheme for suspected plagiarized scholarly articles: ‘Context based comparable data sets’Jeong-Wook Seo, Heaseon Whang, Hyun Sook Shin, Naye Choi, Hyoung Won Park Seoul National University Medical Library and Creative Commons Korea, Seoul, Korea. When we assess a target article for suspected plagiarism, we detect similar parts from other articles from comparison database. We then fell into a dilemma whether the detected similarities among them are within the acceptable range or not. There is no standardized process for this critical assessment on plagiarism but we only judge it based on our personal opinion or based on the general principle adopted by the group. We think the judgment on plagiarism is not straightforward but is dependent upon the social and academic context. The best solution on this dilemma, therefore, would be the assessment based on the statistical analysis of similarity reports in a group of articles from the comparable academic and social context. We designed a protocol to develop ‘Context based comparable data sets.’ Our concept of key features of this design is as follows; ‘The comparable academic and social context’: This is to define the character of the article to be assessed and the purpose of the assessment. This context may be a specific context for current assessment but we can also develop predefined contexts to be used for future events of assessing plagiarism. The context includes the academic domain, the type of articles, years published, the language, reputation level of journals and other social contexts such as country, authors’ (current) profession (academic or non-academic positions) or the purpose of assessment of plagiarism. ‘An automated tool for similarity detection’: We used iThenticate for this purpose. The automated tool detects similar phrases and sentences among the target article and other subject articles from comparison database. A comprehensive coverage of comparison database (database sources to compare submissions against) is a crucial feature of the tool and the coverage should be adjustable. Technical features of the tool are also important and should be adjustable. ‘A group of articles for a data set’: Fifty to one hundred fifty published articles are selected as a control group. They are from the comparable context (the same journal, the same year and the same type of articles). ‘Statistical analysis of similarity reports from articles of the data set’. Mean, median, standard deviation of similarity indices and outliers are calculated and represented. And they are analyzed among different contexts.

Jeong-Wook Seo, Heaseon Whang , HyunSook Shin, Naye Choi, Hyoung Won Park