270 likes | 438 Views
Continuous Representations of Time Gene Expression Data. Ziv Bar-Joseph, Georg Gerber, David K. Gifford MIT Laboratory for Computer Science J. Comput . Biol .,10,341-356, 2003. Outline. Splines Estimating Unobserved Expression Values and Time Points
E N D
Continuous Representations of Time Gene Expression Data Ziv Bar-Joseph, Georg Gerber, David K. Gifford MIT Laboratory for Computer Science J. Comput. Biol.,10,341-356, 2003
Outline • Splines • Estimating Unobserved Expression Values and Time Points • Model Based Clustering Algorithm for Temporal Data • Aligning Temporal Data • Results
Splines • The word “spline” come from the ship building industry
Splines • Splines are piecewise polynomials with boundary continuity and smoothness constraints. • The typical way to represent a piecewise cubic curve :
Splines • We have cubic polynomial : • equations are required : • Interpolatingsplines
Splines • B-spline • In terms of a set of normalized Basis functions • The application of fitting curved to gene expression time-series data • Convenient with the B-spline basis to obtains approximating or smoothingsplines • Fewer basis coefficient than there are observed data points • Avoid overfitting
Splines • The basis coefficients : • Interpreted geometrically as control points • The vertices of a polygon that control the shape of the spline but are not interpolated by the curve • The curve lies entirely within the convex hull of this controlling polygon. • Each vertex exerts only a local influence on the curve.
y bi,1 1 bi,2 bi,3 xi xi+1 xi+2 xi+3 t Splines • 任何xi區間中S(t)必為k-1次的多項式 • S(t)具有1,2,…,k-2階微分的連續性 • 對於同一k值而言 • 在t的有效區間中bi,k≧0,且任一bi,k均僅有唯一極大值,除k=1,2外bi,k均為連續平滑曲線。
Splines • A uniform knot vector is one in which the entries are evenly space • i.e. • The basis functions will be translated of each other, i.e. • For a periodic cubic B-spline (k=4), the equation specifying the curve :
B-splines • The B-spline will only be defined in the shaded region 3t 4
Estimating Unobserved Expression Values and Time Points • To obtain a continuous time formulation, use cubic B-spline • Getting the value of the splines at a set of control points in the time-series. • Re-sample the curve to estimate expression values at any time-points. • Spline function are not fit for each gene individually • due to noise and missing value • lead to over-fitting • Instead, constraint the spline coefficients of co-expressed genes to have the same covariance matrix • Use other genes in the same class to estimate the missing values of a specific gene.
Estimating Unobserved Expression Values and Time Points • Aprobabilistic model of time series expression data • Assume a set of genes are grouped together • Using prior biological knowledge • a clustering algorithm
Estimating Unobserved Expression Values and Time Points • To learn the parameters of this model (, , and ) • Use the observed values, and maximize the likelihood of the input data
Estimating Unobserved Expression Values and Time Points • Decompose the probability : • If the values were observed, decompose the probability:
Estimating Unobserved Expression Values and Time Points • Use EM • E step : find the best estimation for usingthe values we have for 2, , and . • M step : maximize .
Model Based Clustering Algorithm for Temporal Data • A new clustering algorithm that simultaneously solves the parameter estimation and class assignment problems • EM algorithm • E step • M step
Aligning Temporal Data • Assume we have two sets of time-series gene expression profiles • Splines for reference • Splines in the set to be warped • A mapping • Linear transformation
Aligning Temporal Data • The error of the alignment: • Averaged squared distance • Find parameters a and b that minimize • The error for a set of genes S of size n The averaged squared distance between the two curve Take into account the degree of overlap between the curves.
Results • 800 genes in Saccharomycescerevisiae with five groups • Unobserved data estimation
Results • Clustering • Explore the effect that non-uniform sampling • Two synthetic curves :