Introduction to R

Introduction to R ปราณี นิลกรณ์

R คืออะไร? • R เป็นภาษาและโปรแกรมสำเร็จรูปสำหรับการคำนวณทางสถิติและสร้างกราฟประเภทให้เปล่า ( free open source package ) ที่พัฒนาขึ้นมาจากภาษา S ( S language, Bell Labs) โดย Robert Gentleman และ Ross Ihakaแห่งUniversity of Auckland, New Zealandเมื่อปี 2538 • เหมาะทั้งสำหรับการเขียนโปรแกรมเอง และใช้แบบโปรแกรมสำเร็จรูป • มีฟังก์ชันทางสถิติให้เรียกใช้มากมาย และมีผู้พัฒนาเพิ่มอย่างต่อเนื่อง

R คืออะไร? • ข้อมูลทุกอย่างเกี่ยวกับ R หาอ่านได้จาก http://www.R-project.org • R system ประกอบด้วย 2 ส่วนหลัก คือ • Base system – ประกอบด้วย R language software และส่วนเพิ่มเติมอื่นๆที่มีความจำเป็นต้องใช้บ่อยๆ • User contributed add-on packages

การใช้ R • จะหาโปรแกรม R ได้จากไหน? • ไป download ได้ที่ www.r-project.orgหรือที่ http://CRAN.R-project.org • โดยเลือกลงโปรแกรมพื้นฐาน ( Base Package) • ถ้าต้องการใช้แบบเมนู จะต้องติดตั้งโปรแกรมRcmdrเพิ่มเติม

ความสามารถของ Base Package • การจัดการข้อมูลและหน่วยความจำ • การคำนวณในรูป Array และ Matrix • การวิเคราะห์ข้อมูลทางสถิติ • การสร้างกราฟ • การเขียนโปรแกรม

โปรแกรม R RGui ( Gui – Graphical user interface) ประกอบด้วย • วินโดวส์ R Console สำหรับเขียนคำสั่งและแสดงผลลัพธ์ • วินโดวส์ R Graph สำหรับแสดงกราฟ • Script Windows สำหรับเขียน แก้ไขคำสั่ง โปรแกรม

The R GUI

Open script

R Graph Windows

R และ สถิติ • R มีPackages ทีมีผู้สร้างสำหรับการคำนวณและการวิเคราะห์ข้อมูลทางสถิติ ซึ่งเราสามารถ ดาวน์โหลดมาใช้ได้อย่างสะดวกและรวดเร็ว • มีผู้พัฒนา packages สำหรับเทคนิคการวิเคราะห์ใหม่ๆนอกจากวิธีทางสถิติแบบเดิม เช่น data/text mining • นักสถิติที่วิจัยและพัฒนาวิธีการทางสถิติใหม่ๆ นิยมเขียนวิธีการเป็น R packages

การทำงานของ R โดยทั่วไป • การทำงานกับ R โดยทั่วไปคือ • พิมพ์คำสั่ง R ตามที่ต้องการ ในcommand line interface หรือ • โหลดไฟล์ที่มีคำสั่ง R อยู่แล้ว(Script file) มา run • ช้า แต่มีข้อดี คือ • เป็นการบันทึกขั้นตอนการวิเคราะห์ข้อมูล เก็บไว้เป็นไฟล์สำหรับงานแต่ละงานได้ • เวลาพบความผิดพลาด ทราบได้ว่าผิดพลาดขั้นตอนไหน • ถ้าการวิเคราะห์ต้องทำหลายขั้นตอน สามารถนำคำสั่งมา run ซ้ำใหม่ได้โดยไม่ต้อง click ใหม่ซ้ำ ๆ

Getting help >? t.test or >help(t.test)

R Advantages Disadvantages • Not user friendly @ start - steep learning curve, minimal GUI. • No commercial support; figuring out correct methods or how to use a function on your own can be frustrating. • Easy to make mistakes and not know. • Working with large datasets is limited by RAM • Data prep & cleaning can be messier & more mistake prone in R vs. SPSS or SAS • Some users complain about hostility on the R listserve • Fast and free. • State of the art: Statistical researchers provide their methods as R packages. SPSS and SAS are years behind R! • 2nd only to MATLAB for graphics. • Mx, WinBugs, and other programs use or will use R. • Active user community • Excellent for simulation, programming, computer intensive analyses, etc. • Forces you to think about your analysis. • Interfaces with database storage software (SQL)

R Commercial packages • Many different datasets (and other “objects”) available at same time • Datasets can be of any dimension • Functions can be modified • Experience is interactive-you program until you get exactly what you want • One stop shopping - almost every analytical tool you can think of is available • R is free and will continue to exist. Nothing can make it go away, its price will never increase. • One datasets available at a given time • Datasets are rectangular • Functions are proprietary • Experience is passive-you choose an analysis and they give you everything they think you need • Tend to be have limited scope, forcing you to learn additional programs; extra options cost more and/or require you to learn a different language (e.g., SPSS Macros) • They cost money. There is no guarantee they will continue to exist, but if they do, you can bet that their prices will always increase

ตัวอย่างประเภทข้อมูลและการเขียนคำสั่งใน R >Variables >a = 49 > sqrt(a) [1] 7 > a = "The dog ate my homework" > sub("dog","cat",a) [1] "The cat ate my homework“ > a = (1+1==3) > a [1] FALSE numeric character string logical

เวกเตอร์ > a = c(7,5,1) > a[2] [1] 5 ลิสต์: an ordered collection of data of arbitrary types. > doe = list(name="john",age=28,married=F) > doe$name [1] "john“ > doe$age [1] 28

Data frames data frame: is supposed to represent the typical data table that researchers come up with – like a spreadsheet. It is a rectangular table with rows and columns; data within each column has the same type (e.g. number, text, logical), but different columns may have different types. Example: localisation tumorsize progress XX348 proximal 6.3 FALSE XX234 distal 8.0 TRUE XX987 proximal 10.0 FALSE

R as a simple calculator • > x<-c(1,3,2,10,5); y<-1:5 # creation of 2 vectors • x [1] 1 3 2 10 5 • > x+y [1] 2 5 5 14 10 > x*y [1] 1 6 6 40 25 > x/y [1] 1.0000000 1.5000000 0.6666667 2.5000000 1.0000000 > x^y [1] 1 9 8 10000 3125 • > sum(x) #sum of elements in x [1] 21 > cumsum(x) #cumulative sumvector[1] 1 4 6 16 21

Basic math/stat tools # 20 numbersfrom 0 to 20, > x = round(runif(20,0,20), digits=1) > x [1] 10.0 1.6 2.5 15.2 3.1 12.6 19.4 6.1 [9] 9.2 10.9 9.5 14.1 14.3 14.3 12.8 [16] 15.9 0.1 13.1 8.5 8.7 > min(x) [1] 0.1 > max(x) [1] 19.4 > median(x) # médiane [1] 10.45 > mean(x) # moyenne [1] 10.095 > var(x) # variance [1] 27.43734 > sd(x) # standard deviation [1] 5.238067 > sqrt(var(x)) [1] 5.238067 > length(x) [1] 20 > round(x) [1] 10 2 2 15 3 13 19 6 9 11 10 14 1414 13 16 0 13 8 9 > cor(x,sin(x/20)) # corrélation [1] 0.997286 > quantile(x) # les quantiles, 0% 25% 50% 75% 100% 0.10 7.90 10.45 14.15 19.40

Statistical tools • Samples tests • Checkingnormality • Kolmogorov-Smirnov test > #generate 500 observations fromuniform (0,1) distribution > F500<-runif(500);a<-c(mean(F500),sd(F500)) > qqnorm(F500) #normal probability plot > qqline(F500) #idealsamplewillfallnear the straight line >ks.test(F500, "pnorm", mean=a[1], sd=a[2]) One-sample Kolmogorov-Smirnov test data: F500 D = 0.0655, p-value = 0.02742 alternative hypothesis: two.sided

Introduction to R