1 / 41

CATAR-C ontent Analysis Toolkit for Academic Research

CATAR is a robust toolkit developed by Yuen-Hsien Tseng for content analysis in academic research. Explore its installation, usage, interpretation, and a case study. Learn about automatic content analysis tools, motivation, and functions provided. Prepare your data for analysis and stay updated with this comprehensive tool.

hcornish
Download Presentation

CATAR-C ontent Analysis Toolkit for Academic Research

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CATAR-ContentAnalysisToolkitforAcademicResearch Introduction Installation Usage Interpretation CaseStudy Yuen-HsienTseng NationalTaiwanNormalUniversity 2018/04/25

  2. Content Analysis: Introduction • Related Subjects: • Bibliometrics, Scientometrics, Infometrics • Content analysis in social science • Related Journals • JASIST, Scientometrics, Journal of Infometrics • Related Conferences • ISSI: International Society for Scientometrics and Infometrics • STI: Science and Technology Indicators 2

  3. Content Analysis: Motivation • Prior art search and analysis is expected to be done in a half day • From a vice president of a second largest analog IC design company • To know the past, avoid reinventing the wheels, and improve innovation • For strategic planning in S&T • To know highly impact authors, institutions • For speech, cooperation, advise • For evaluation, budget distribution 3

  4. Automatic Content Analysis • Long term goal: • Automatic literature analysis, organization, and presentation • To form hypothesis for exploration, verification, and decision making • Related studies: • Structured Abstract in library science (1987) • Automated structured abstract in biology (2007) • Automatic patent analysis(2004, NTCIR) • Sentimental analysis in literature(2010, STI) 4

  5. Automatic Content Analysis -Tools(1/2) • CiteSpace • Chao-Mei Chen,Drexel University (2003) • http://cluster.cis.drexel.edu/~cchen/citespace/ • To know the paradigm shift in research • VOSviewer • Nees Jan van Eck and Ludo Waltman (2007) • CWTS of Leiden University • http://www.vosviewer.com/ 5

  6. Automatic Content Analysis -Tools(1/2) • Science Mapping Software Tools: Review, Analysis, and Cooperative Study Among Tools • Cobo, et al, JASIST 2011 paper • Compare nine tools (free, commercial) • Bibexcel, VantagePoint, Sci2 Tool, … • None of them cover all the functions of the others • Scientometrics analysis has a standard procedure (Börneret al 2003) • CATAR released in 2010 (developed since 2004) 6

  7. CATAR-Introduction • Content Analysis Toolkit for Academic Research • Yuen-Hsien Tseng, 2004-2018 • http://web.ntnu.edu.tw/~samtseng/CATAR/ • CATAR technical details: • Yuen-Hsien Tseng, Chi-Jen Lin, and Yu-I Lin, "Text Mining Techniques for Patent Analysis", Information Processing and Management, Vol. 43, No. 5, 2007, pp. 1216-1247. (indexed in ESI as top 1% cited paper in 2017) • Journal clustering of Library and Information Science for subfield delineation using the bibliometric analysis toolkit: CATAR", Scientometrics, Vol. 95, No. 2, pp. 503-528, May 2013. 7

  8. CATAR Analysis Functions • Overview analysis • Topic Clustering based on: • bibliographic coupling • co-word analysis 8

  9. CATAR Installation • Please check below link for latest update: • http://web.ntnu.edu.tw/~samtseng/CATAR/ • C:\CATAR folder: • C:\CATAR\src\ : programs for analysis • C:\CATAR\Source\ : data to be analyzed • C:\CATAR\Result\ : result after analysis • C:\CATAR\doc\ : intermediate result during analysis (just for debugging, no need for end user) 9

  10. Preparing Data for Analysis • Delineate the data (the most important step, the second valuable part) • Data from keyword search • Publication records in core journals • Combined search results • Journals + keywords + time limit • Data record verified by domain experts • Data source ready for analysis by CATAR • Records from Web of Science • Patents from USPTO 10

  11. FN ISI Export Format VR 1.0 PT J AUTseng, SC Tsai, CC AF Tseng, Sheng-Chau Tsai, Chin-Chung TIOn-line peer assessment and the role of the peer feedback: A study of high school computer course SOCOMPUTERS & EDUCATION LA English DT Article DEinteractive learning environments; secondary education; learning communities; improving classroom teaching; peer assessment IDWORLD-WIDE-WEB; ASSESSMENT SYSTEM; HIGHER-EDUCATION; STUDENTS; THINKING; SCIENCE; SELF ABThe purposes of this study were to explore the effects and the validity of on-line peer assessment in high schools and … C1Natl Chiao Tung Univ, Inst Educ, Hsinchu 300, Taiwan. Natl Chiao Tung Univ, Ctr Teacher Educ, Hsinchu 300, Taiwan. RP Tsai, CC, Natl Chiao Tung Univ, Inst Educ, 1001 Ta Hsueh Rd, Hsinchu 300, Taiwan. EM cctsai@mail.nctu.edu.tw CRROTH WM, 1997, SCI EDUC, V6, P373 DOCHY F, 1999, STUD HIGH EDUC, V24, P331 … NR23 TC2 PU PERGAMON-ELSEVIER SCIENCE LTD PI OXFORD PA THE BOULEVARD, LANGFORD LANE, KIDLINGTON, OXFORD OX5 1GB, ENGLAND SN 0360-1315 J9COMPUT EDUC JI Comput. Educ. PD DEC PY2007 VL49 IS 4 BP1161 EP 1174 DI 10.1016/j.compedu.2006.01.007 PG 14 SCComputer Science, Interdisciplinary Applications; Education & Educational Research GA 218OF UTISI:000250024100013 ER ISI WoS Publication Record Only the fields in red color are used. Cited References are used in the bibliographic coupling for topic clustering and citation tracking 11

  12. Import Fields of WoS AU: authors' names, e.g., Kainz, H; Hofstetter, H TI: publication title, e.g., Adaption of the main waste water treatment … SO: journal title, e.g., WATER SCIENCE AND TECHNOLOGY。 DE: keywords given by the authors: large wastewater treatment plant; ID: identifiers given by WoK to describe the topics of the article AB: publication's abstract C1: authors‘ countries, e.g., USA, UK, … CR: cited references: BALDI F, 1988, WATER AIR SOIL POLL, V38, P111 NR: Number of references, e.g., 3 TC: Times Cited, e.g., 1 PY: publication year, e.g., 1996 SC: source categories e.g.,Environmental Sciences; Water Resources UT:indexing key given by WoS, e.g., ISI:A1996VF74600009 12

  13. Overview Analysis • Parse data and save into DBMS for management, cross tabulation, and verification • Trend Analysis • Trend indicator in terms of the slope of the linear regression based on the publication volume per year • Yuen-Hsien Tseng, Yu-I Lin, Yi-Yang Lee, Wen-Chi Hung, and Chun-Hsiang Lee, " A Comparison of Methods for Detecting Hot Topics", Scientometrics, Vol. 81, No. 1, Oct. 2009, pp. 73-90. • Command to be executed: • C:\CATAR\src>perl –s automc.pl -OOASE ..\Source_Data\SE\data Path to the data for analysis Command option Folder name for result 13

  14. DOS Command • Find and run cmd.exe in MS Windows • Change drive to C: C: • Change folder to CATAR: cd \CATAR • Change working directory: cd src • Absolute path: C:\CATAR\Source_Data\SE\data • Relative path: if working directory is under \CATAR\src then the path to data is..\Source_Data\SE\data 14

  15. Overview Analysis Example Results in : C:\CATAR\Result\SE\_SE_by_field.xls Document Type=(Article)Databases=SCI-EXPANDED, SSCI, A&HCI Timespan=2005-2009 15

  16. Year Production: Top 8 Countries 16

  17. Most Productive Authors: Top 10 AUTseng, SC Tsai, CC Tseng, SC : 0.5 Tsai, CC : 0.5 AUTseng, SC Tsai, CC Tseng, SC : 1 Tsai, CC : 1 NC=Normal Count: each co-author is counted as a single author FC=Fractional Count: all the co-authors are counted as a single author IF=TC/NC,FIF=FTC/FC 17

  18. Most Productive Institutes: Top 15 Data are from the C1 field of each record: C1Natl Chiao Tung Univ, Inst Educ, Hsinchu 300, Taiwan 18

  19. Most Cited References Data are from the CR field of each record: CRROTH WM, 1997, SCI EDUC, V6, P373 19

  20. Most Cited Authors Data are from the CR field of each record: CRROTH WM, 1997, SCI EDUC, V6, P373 20

  21. Most Cited Journals Data are from the CR field of each record: CRROTH WM, 1997, SCI EDUC, V6, P373 21

  22. Topic Clustering Procedure • Indexing Construction (for fast analysis) • Similarity Computation • Document Clustering • Cluster Labels Generation • Multi-Stage Clustering for Topic Trees • Multi-Dementional Scaling (MDS)for Topic Map • Cross tabulations for topic and other data 22

  23. Indexing Building • Bibliographic Coupling (BC) : • Constructing BC matrix • Normalize citation counts • Co-word Analysis (CW) • Remove stop words (the, of, for, on, and, at, …) • Normalize terms (stemming, lemmatization, vocabulary control) • Keyterm extraction ( patented [Tseng, 2002, JASIST]) • Building inverted files for later fast computation 23

  24. D1 D1 D2 D2 Dn Dn D1 D1 詞彙 T 文獻 M D2 D2 Dn Dn 詞彙 2 文獻 2 詞彙 1 文獻 1 Doc A Doc B Doc A Doc B Co-word Bibliographic Coupling Similarity Computation T=2529 for 318 EEPA papers M=9957 for 318 EEPA papers Sim(A, B) = 2x|S(A)∩S(B)| -------------------- |S(A)|+|S(B)| 24

  25. Topic Tree • Agglomerative Hierarchical Clustering (AHC) • Complete link criterion • Dendrogram 0.0 Threshold: 0.075 Result: 6 clusters 0.1 0.2 0.3 D1 D2 D3 D4 D5 D6 D7 D8 D9 D10 D11 D12 D13 D14 D15 D16 D17 25

  26. 主題樹範例 (電影新聞資料) • 1(7):161 : 7 Docs. : 0.3478 (美國: 9.4) • 2 : 4 Docs. : 1.0000 (美國: 4.1) • 13 : 101765 : 2006-01-01:納尼亞傳奇 美國片 • 55 : 113371 : 2006-03-19:V怪客 美國片 • 48 : 109839 : 2006-03-12:北國性騷擾 美國片 • 1 : 98663 : 2006-01-08:惡狼ID 美國片 • 32 : 3 Docs. : 0.7245 (影迷: 7.0, 美國: 2.4) • 14 : 2 Docs. : 0.9340 (影迷: 4.0, 絕命終結站: 3.5, 絕命: 3.5, 飛車: 2.8, 雲霄飛車: 2.8) • 11 : 101543 : 2006-01-15:奪魂鋸2美國片 • 27 : 104778 : 2006-02-26:絕命終結站3雲霄飛車驚魂 • 16 : 102575 : 2006-01-08:偷穿高跟鞋 美國片 • 9(3):28 : 3 Docs. : 0.7614 (傑克: 10.0, 李安: 8.9, 傑克基倫霍: 7.0, 基倫霍: 7.0, 希斯萊傑: 3.2) • 17 : 2 Docs. : 0.9141 (李安: 11.0, 傑克: 5.7, 斷背山: 4.9, 希斯萊傑: 4.0, 傑克基倫霍: 3.2) • 3 : 98770 : 2006-01-22:李安靠 斷背山重拾熱情 • 7 : 100886 : 2006-01-22:斷背山 美國片 • 21 : 104156 : 2006-02-26:鍋蓋頭 美國片 • 12(3):74 : 3 Docs. : 0.5263 (奶油: 7.3, 絕配: 6.0, 料理: 5.1, 凱特: 4.9, 尼克: 3.2) • 58 : 2 Docs. : 0.6041 (番紅花: 6.3, 凱特: 6.0, 番紅花醬汁: 4.9, 尼克: 4.0, 鮮奶: 4.0) • 68 : 397612 : 2007-08-25:料理絕配 跟著男主角做義國菜 • 71 : 403973 : 2007-08-25:料理絕配 跟著女主角做法國菜 • 69 : 398615 : 2007-08-25:料理絕配 看電影學用餐禮儀 相似度 類別序號與篇數 類別標題詞 類別編號 (下一階使用)與篇數 26

  27. Generation of Cluster Labels • Automatic extracting cluster terms as cluster labels • Yuen-Hsien Tseng, " Generic Title Labeling for Clustered Documents", Expert Systems With Applications, Vol. 37, No. 3, 15 March 2010, pp. 2247-2254 . 27

  28. Multi-Stage Clustering Each stage is an AHC Topics Stage 2 Concepts Stage 1 Docs. Outliers: below threshold, unable to be clustered 28

  29. 2. Electronics and Semi-conductors 5. Material 1.Chemistry 4. Communication and computers 3. Generality 6. Biomedicine Topic Map • MDS(Multi-Dimensional Scaling) • Projecting high dimensional similarities into 2 dimension similarity map for ease of visualization Topic map of US patents of MOST 29

  30. Topic Map and Topic Tree Carbon Nanotube patents 30

  31. Analysis by Bibliographic Coupling • Command example: • C:\CATAR\src>perl -s automc.pl -OBCSE ..\Source_Data\SE\SE.mdb • Results: • C:\CATAR\Result\SE_BC • *.html: topic tree • *all*.html: cross tabulation of topics • *.xls: cross tabulation of topics • *titles*.html: titles of each cluster 31

  32. Analysis based Co-Word • Command example: • C:\CATAR\src>perl -s automc.pl-OCWSE ..\Source_Data\SE\SE.mdb • Results: • C:\CATAR\Result\SE_CW • *.html: topic tree • *all*.html: cross tabulation of topics • *.xls: cross tabulation of topics • *titles*.html: titles of each cluster 32

  33. BC Analysis Example Threshold=0.0 Reasonable: 100% • 1(6): 34 : 6 Docs. : 0.020000 (cluster: 5.1, map: 3.0, min: 3.0, text: 2.1) • 12 : 4 Docs. : 0.142857 (cluster: 7.0, patent: 5.2, text: 3.7, generic: 2.6, title: 2.6) • 5 : 3 Docs. : 0.224490 (cluster: 5.0, generic: 3.1, title: 3.1, text: 2.4, document: 2.3) • 1 : 2 Docs. : 0.692308 (generic: 4.0, title: 4.0, cluster: 3.2, document: 3.1, correlation coefficient: 2.0) • 2 : ISI:000241690200012 : 2006:Toward generic title generation for clustered documents6 : ISI:000272846500049 : 2010:Generic title labeling for clustered documents • 3 : ISI:000246869800006 : 2007:Text mining techniques for patent analysis • 4 : ISI:000251991600006 : 2007:Patent surrogate extraction and evaluation in the context of patent mapping • 18 : 2 Docs. : 0.052632 (education: 4.0, content analysi: 2.0, content: 2.0, media: 2.0) • 7 : ISI:000277110400017 : 2010:Mining concept maps from news stories for measuring civic scientific literacy in media • 8 : ISI:000279714800001 : 2010:Trends of Science Education Research: An Automatic Content Analysis • 2(3): 15 : 3 Docs. : 0.095238 (neural network: 3.1, quadratic: 2.3, sort: 2.3, perceptron: 1.7) • 2 : 2 Docs. : 0.333333 (quadratic: 3.0, sort: 3.0, perceptron: 2.3, winner-take-all: 1.4, constant-time: 1.4) • 13 : ISI:A1995QT09700011 : 1995:ON A CONSTANT-TIME, LOW-COMPLEXITY WINNER-TAKE-ALL NEURAL-NETWORK • 9 : ISI:A1992HU15600007 : 1992:SOLVING SORTING AND RELATED PROBLEMS BY QUADRATIC PERCEPTRONS • 10 : ISI:A1992HY58100028 : 1992:CONSTRUCTING ASSOCIATIVE MEMORIES USING HIGH-ORDER NEURAL NETWORKS • 3(2): 14 : 2 Docs. : 0.113208 (automatic: 3.1, chinese: 1.4, text: 1.4, thesauru: 1.4) • 0 : ISI:000167255500002 : 2001:Automatic cataloguing and searching for retrospective data by use of OCR text • 1 : ISI:000178776600007 : 2002:Automatic thesaurus generation for Chinese documents • 4(2): 3 : 2 Docs. : 0.285714 (code: 4.0, decoder: 1.4, fast: 1.4, reed-muller: 1.4) • 11 : ISI:A1993MA58300001 : 1993:DECODING REED-MULLER CODES BY MULTILAYER PERCEPTRONS • 12 : ISI:A1993MA58300002 : 1993:FAST NEURAL DECODERS FOR SOME CYCLIC CODES • 5(1): 36 : 1 Docs. : 0 (hot: 2.0, detect: 2.0, comparison: 2.0, topic: 1.1, scientometric: 0.7) • 5 : ISI:000270841800006 : 2009:A comparison of methods for detecting hot topics 33

  34. BC Analysis Example: 2nd Stage Threshold=0.0 Reasonable: 100% • 1(2):1 : 5 Docs. : 0.100000 (neural: 4.0, perceptron: 3.0, code: 2.4, decoder: 1.8, network: 1.8) • 1 : 15 : 3 Docs. : 0.095238(neural network: 3.1, quadratic: 2.3, sort: 2.3, perceptron: 1.7) • 3 : 3 : 2 Docs. : 0.285714(code: 4.0, decoder: 1.4, fast: 1.4, reed-muller: 1.4) • 2(2):2 : 8 Docs. : 0.022556 (automatic: 5.0, document: 4.0, text: 4.0, generation: 3.0, cluster: 1.8) • 0 : 34 : 6 Docs. : 0.020000(cluster: 5.1, map: 3.0, min: 3.0, text: 2.1) • 2 : 14 : 2 Docs. : 0.113208(automatic: 3.1, chinese: 1.4, text: 1.4, thesauru: 1.4) • 3(1):4 : 1 Docs. : 0 (hot: 2.0, detect: 2.0, comparison: 2.0, topic: 2.0, scientometric: 1.0) • 4 : 36 : 1 Docs. : 0(hot: 2.0, detect: 2.0, comparison: 2.0, topic: 1.1, scientometric: 0.7) Labe IDfromstage 1 34

  35. BC Analysis Example: 2nd Stage 35

  36. Co-Word Analysis Example Threshold=0.0, 1, 1, 1 Reasonable: 60%-80% • 1(5):29 : 5 Docs. : 0.0940 (term: 19.0, document: 6.7, algorithm: 4.0) • 7 : 3 Docs. : 0.5403 (document: 12.2, generic: 7.7, cluster: 7.6, term: 7.4, algorithm: 6.0) • 2 : 2 Docs. : 0.9610 (cluster: 10.8, generic: 10.0, label: 7.0, title: 7.0, document: 5.6) • 2 : ISI:000272846500049 : 2010:Generic title labeling for clustered documents • 6 : ISI:000241690200012 : 2006:Toward generic title generation for clustered documents • 7 : ISI:000178776600007 : 2002:Automatic thesaurus generation for Chinese documents • 3 : 2 Docs. : 0.7090 (map: 7.7, patent: 5.4, term: 4.1, scientific: 4.0, new: 4.0) • 1 : ISI:000277110400017 : 2010:Mining concept maps from news stories for measuring civic scientific literacy in media • 4 : ISI:000251991600006 : 2007:Patent surrogate extraction and evaluation in the context of patent mapping • 2(3):19 : 3 Docs. : 0.2776 (automatic: 7.3, text: 6.9, analysi: 4.9, approach: 4.6, topic: 1.9) • 4 : 2 Docs. : 0.6881 (science: 7.4, analysi: 6.9, education: 5.4, science education: 5.4, research: 5.4) • 0 : ISI:000279714800001 : 2010:Trends of Science Education Research: An Automatic Content Analysis • 5 : ISI:000246869800006 : 2007:Text mining techniques for patent analysis • 8 : ISI:000167255500002 : 2001:Automatic cataloguing and searching for retrospective data by use of OCR text • 3(2):1 : 2 Docs. : 1.00 (network: 7.7, memory: 4.0, associative memory: 2.7, winner-take-all: 2.0) • 12 : ISI:A1992HY58100028 : 1992:CONSTRUCTING ASSOCIATIVE MEMORIES USING HIGH-ORDER NEURAL NETWORKS • 9 : ISI:A1995QT09700011 : 1995:ON A CONSTANT-TIME, LOW-COMPLEXITY WINNER-TAKE-ALL NEURAL-NETWORK • 4(1):30 : 1 Docs. : 0 (trend: 6.7, different: 5.0, better: 3.0, trend observation: 3.0, choice: 3.0) • 3 : ISI:000270841800006 : 2009:A comparison of methods for detecting hot topics Have common term: Map, Mapping, but different interpretation 36

  37. Breakdown Trends of ICTin Edu. Main stream topic Dying out topics Hot topics during that period Topic with periodic attraction Promising topics (not yet mature)

  38. Interpretation (1/2) • The most valuable part of your analysis • Access file: data after parsing or loading • Updatable for BC/CW analysis • Excelfile: various cross tabulation for analysis • HTML file: topic tree • Results are under C:\CATAR\Result\ : • The topic trees in stage n are under the folder name ended with Sn • The topic maps of stage n are in the folder name ended with S(n+1) 38

  39. Interpretation (1/2) • Accept default parameters and try different parameters • Interpretable is more important than reasonable • Meaningful information may scatter in the results of different stages • Need domain experts to help interpretation and verification • Reference: • Chao-Mei Chen (2010) How to choose parameter in CiteSpace: http://www.sciencenet.cn/m/user_content.aspx?id=378974 39

  40. Case Studies • Yulan Yuan, Ulrike Gretzel, and Yuen-Hsien Tseng*, "Revealing the Nature of Contemporary Tourism Research: Extracting Common Subject Areas through Bibliographic Coupling ", International Journal of Tourism Research, Vol. 17, No. 5, pp. 417–431, Sep./Oct. 2015, DOI: 10.1002/jtr.2004. • Yuen-Hsien Tseng*, Chun-Yen Chang, M. Shane Tutwiler, Ming-Chao Lin, and James Barufaldi, " A Scientometric Analysis of the Effectiveness of Taiwan's Educational Research Projects", Scientometrics, Vol. 95, No. 3, pp 1141-1166, June 2013. • Yuen-Hsien Tseng and Ming-Yueh Tsay, " Journal clustering of Library and Information Science for subfield delineation using the bibliometric analysis toolkit: CATAR", Scientometrics, Vol. 95, No. 2, pp. 503-528, May 2013. • Yueh-Hsia Chang, Chun-Yen Chang, Yuen-Hsien Tseng,"Trends of Science Education Research: An Automatic Content Analysis", Journal of Science Education and Technology, Vol. 19, No. 4, 2010, pp. 315-331. 40

  41. Remarks • Start from the overview analysis • So that the Wos can be parse into the database for later use • Followed by BC and CW analysis • For non-WoS data • Refer to: • C:\CATAR\Source_Data\movie\movie.mdb • C:\CATAR\Source_Data\eport\eport.mdb • Put your own data into table TPaper in the database based on the meaning of the field names • Separate each value by “; “ if multiple values in a field: • Chang, YH; Chang, CY; Tseng, YH 41

More Related