350 likes | 519 Views
STAT115 Lab 3 PART I. Homework Q8 The Dot Matrix Method. A general way to see similarities in pair-wise comparisons:. The Dot Matrix Method. Gets you started thinking about sequence alignment in general. Provides a ‘Gestalt’ of all possible alignments between two sequences.
E N D
STAT115 Lab 3 PART I Homework Q8 The Dot Matrix Method
A general way to see similarities in pair-wise comparisons: • The Dot Matrix Method. • Gets you started thinking about sequence alignment in general. • Provides a ‘Gestalt’ of all possible alignments between two sequences. • To begin — I will use a very simple 0, 1 (match, no-match) identity scoring function without any windowing. As you will see later today, more complex scoring functions will normally be used in sequence analysis (especially with amino acid sequences)
Since this is a comparison between two of the same sequences, an intra-sequence comparison, the most obvious feature is the main identity diagonal. Two short perfect palindromes can also be seen as crosses directly off the main diagonal; they are “ANA” and “SIS.”
The biggest asset of dot matrix analysis is it allows you to visualize the entire comparison at once, not concentrating on any one ‘optimal’ region, but rather giving you the ‘Gestalt’ of the whole thing.
Check out the ‘mutated’ inter-sequence comparison below: Here you can easily see the effect of a sequence ‘insertion’ or ‘deletion.’ It is impossible to tell whether the evolutionary event that caused the discrepancy between the two sequences was an insertion or a deletion and hence this phenomena is called an ‘indel.’ A jump or shift in the register of the main diagonal on a dotplot clearly points out the existence of an indel. (again zero:one match score function)
Another phenomenon that is very easy to visualize with dot matrix analysis are duplications or direct repeats. These are shown in the following example: The ‘duplication’ here is seen as a distinct column of diagonals; whenever you see either a row or column of diagonals in a dotplot, you are looking at direct repeats.
Now consider the more complicated ‘mutation’ in the following comparison: Again, notice the diagonals. However, they have now been displaced off of the center diagonal of the plot and, in fact, in this example, show the occurrence of a ‘transposition.’ Dot matrix analysis is one of the only sensible ways to locate such transpositions in sequences. Inverted repeats still show up as perpendicular lines to the diagonals, they are just now not on the center of the plot. The ‘deletion’ of ‘PRIMER’ is shown by the lack of a corresponding diagonal.
Filtered Windowing — Reconsider the same plot. Notice the extraneous dots that neither indicate runs of identity between the two sequences nor inverted repeats. These merely contribute ‘noise’ to the plot and are due to the ‘random’ occurrence of the letters in the sequences, the composition of the sequences themselves. How can we ‘clean up’ the plots so that this noise does not detract from our interpretations? Consider the implementation of a filtered windowing approach; a dot will only be placed if some ‘stringency’ is met. What is meant by this is that if within some defined window size, and when some defined criteria is met, then and only then, will a dot be placed at the middle of that window. Then the window is shifted one position and the entire process is repeated. This very successfully rids the plot of unwanted noise.
In this plot a window of size three and a stringency of two is used to considerably improve the signal to noise ratio (remember, I am using a 1:0 identity scoring function).
TUTORIAL I LAB 3 Alejandro Quiroz-Zárate Daniel Fernandez
A little of istory R is a dialect of the S language
Essentially we work with a 40 year-old technology! • R is dived in 2 parts • The BASE system • What comes with the download from CRAN (Comprehensive R Archive Network) • The packages that you download • Based on your needs!!! • Over 1000 packages on CRAN • http://www.r-project.org/ • Last but NOT least • R is FREE!!!!!!
Outline • The Console and the Script • Workspace management • Objects • Classes and Mode • Some Classes: • Vectors, Matrices and data.frames • Some Modes: • Lists, strings • Loops and conditional statements • Functions • R functions • My own functions • Handling data • Reading and writing! • Plotting! • Libraries • Exercises
Getting started The Console The Script Essentially were the commands are executed Were the code is written
An R session Type code here Output appears Adjust/Extend code
Workspace Management • Before jumping into R, it is important to ask ourselves • Where am I? • getwd() • I want to be there… • setwd(“C://”) • With who am I? • dir() # lists all the files in the working directory • With who I can count on? • ls() #lists all the variables on the current session
Workplace Management (2) • Saving • save(x,file=“name.RData”) • Saves specific objects • save.image(“name.Rdata”) • Saves the whole workspace • Loading • load(“name.Rdata”) • ‘?function’ and ‘??function’ • ? To get the documentation of the function • ?? Find related functions to the query
R Objects • Almost all things in R are OBJECTS! • Functions, datasets, results, etc… (graphs NO) • OBJECTS are classified by two criteria • MODE: How objects are stored in R • Character, numeric, logical, factor, list, function… • To obtain the mode of an object • mode(object) • CLASS: How objects are treated by functions • Vector, matrix, array, data.frame,… • To obtain the class of an object • class(object)
x1 x2 x3 x4 x5 x6 12345678 R Objects (2) MODE:Is determined by the type of things stored (numbers, characters, Boolean,) If only numbers: numeric If it is a mixture: list CLASS: Is determined by how functions deal with this object. If only numbers: matrix If it is a mixture: data.frame
Some classes • Vectors!!! • x=c(10,5,3,6) • Calculations on vector are performed on each entry • y=c(log(x),x,x^2) • Not necessarily to have vectors of the same length in operations! • w=sqrt(x)+2 • z=c(pi,exp(1),sqrt(2)) • x+z • Logical vectors • aux=x<7
Some classes (2) • Matrices !!! • x=1:8 • dim(x)=c(2,4) • y=matrix(1:8,2,4,byrow=F) • Operations are applied on each element • x*x, max(x) • x=matrix(1:28,ncol=4), y=7:10 so then x*y is…? • y=matrix(1:8,ncol=2) • y%*%t(y)
Some classes (3) • Extracting info • y[1,] or y[,1] • Extending matrices • cbind(y,seq(101,104)) • rbind(y,c(102,109)) • apply is a useful function! • apply(y,2,mean) • apply(y,1,log)
Some classes (4) • data.frame!!! • Creation • Several ways to create a data frame • 1) • logical=sample(c(T,F),size=20,replace=T) • numeric=rnorm(20) • my.df=data.frame(logical, numeric) • 2) • test=matrix(rnorm(21),7,3) • test=data.frame(test) • class(my.df[1,])
A mode • Lists!!! • Is like a vector • An element of a list can be an object of any type and structure • x1=1:5 • x2=c(T,T,F,T,F) • y=list(numbers=x1,questions=x2)
Functions! • My own functions • function.name=function(arg1,arg2,…,argN) { Body of the function } • fun.plot=function(y,z){ y=log(y)*z-z^3+z^2 plot(z,y)} • z=seq(-11,10) • y=seq(11,32) • fun.plot(y,z)
Functions! (2) • The ‘…’ argument • Can be used to pass arguments from one function to another • Without the need to specify arguments in the header fun.plot=function(y,z,...) { y=log(y)*z-z^3+z^2 plot(z,y,...) } fun.plot(y,z,type="l",col="red") fun.plot(y,z,type="l”,col=“red”,lwd=4)
Handling data I/O • Reading files • read.csv(“filename.csv“) # reads csv files into a data.frame • read.table(“filename.txt“) # reads txt files in a table format to a data.frame • scan(filename) # not friendly for matrices or tables!!! • Writing to files • write(x,file=“filename”) # writes the object x to filename • write.table(x,filename) # writes the object x to filename in a table format
Plotting! x.data=rnorm(1000) y.data=x.data^3-10*x.data^2 z.data=-0.5*y.data-90 plot(x.data,y.data,main="Title of the graph",xlab="x label",ylab="y label") points(x.data,z.data,col="red") legend(-2,2,legend=c("Black points","Red points"),col=c("black","red"),pch=1,text.col=c("black","red"))
Plotting! (2) • You can export graphs in many formats • To check the formats that are available in your R installation • capabilities() • png • png("Lab2_plot.png",width=520,height=440) • plot(x.data,y.data,main="Title of the graph",xlab="x label",ylab="y label") • points(x.data,z.data,col="red") • legend(-2,2,legend=c("Black points","Red points"),col=c("black","red"),pch=1,text.col=c("black","red")) • dev.off() • eps • postscript("Lab2_plot.eps",width=500,height=440)
Libraries!! • Collection of R functions that together perform a specialized analysis or task. • Install packages from CRAN • install.packages(“PackageName”) • Loading libraries • library(LibraryName) • Getting the documentation of a library • library(help=LibraryName) • Listing all the available packages • library()
Exercise 1 – Probability Transform • We know that , and we want to know the probability associated with • Plot the theoretical pdf and cdf of X. • Generate 10,000,000 observations of the random variable X • Compute Y=3X5+4X2-7 • Estimate the probability that • Plot histogram and empirical CDF of Y
Exercise 2 – The empire strikes back: GOOG versus BAIDU • Plot historical Stock Prices times series using prices from yahoo finance. • Download and install tseries package. • Include tseries package as a library in your code. • Use get.hist.quote to download GOOG and BAIDU historical data. • Plot both time series in the same panel and add a legend to the plot.
Exercise 3 – Challenging Challenger On January 28, 1986, the space Shuttle Challenger exploded in the early stages of its flight. Feynman, along a committee determined that the explosion was due to low temperatures and the failure of O-rings sealed on the booster rockets. The ambient temperature was 36 degrees on the morning of the launch. The scientists had data (temperature, number of failures) from previous flights.
Question 3 – Challenging Challenger • Plot the number of failures versus the temperature for flights with one or more O-ring failures. Is there any evidence that temperature affects O-ring performance? • Plot the number of failures versus temperature for all the flights. Is there any evidence that temperature affects O-ring performance? • What’s your conclusion? What do you think the scientists plot before taking the decision to fly that day? Just historical curiosity, Whom played a central role in discovering the causes of the failure and how he announced it?