450 likes | 860 Views
An Introduction to Bioinformatics. 北京大学医学部医学信息学系 崔庆华 11-16, 2008. Introduction of basic concepts. Bioinformatics-- a definition -- by NIH(1995).
E N D
An Introduction to Bioinformatics 北京大学医学部医学信息学系 崔庆华 11-16, 2008
Bioinformatics-- a definition--by NIH(1995) Bioinformatics is defined as a scientific discipline that encompasses all aspects of biological information acquisition, processing, storage, distribution, analysis and interpretation, that combines the tools and techniques of mathematics, computer science and biology with the aim of understanding the biological significance of a variety of data.
Bio-informatics– the term • Bio-informatics • Computational biology • Biological computing
★ Large-scale and high-throughput ★ High-dimensional ★ Non-linear ★ Noisy ★ Unequally distributed Data……
Bioinformatics– what is the most important • Algorithms? • Data? • Questions!
Bioinformatics– 误解 • 什么都能做? • 生物学/信息学 Biology Experimental Theoretical Computational
Alignment E<10-20 • blastall • blastp • blastn • blastx • tblastn • Tblastx • clusterX
Evolution Selection • Coding region: Ka, Ks (dn,ds), Ka/Ks (dn/ds) • PAML • Kaks_calculator • K-estimator • Mega • Database: UCSC or ENSEMBL • Non-coding region • Ralph Haygood (Nature Genetics 2007) • Recent populations • LRH test (Sabeti et al., Nature 2002 • iHS test (Voight et al., Plos Biology 2006) • XP-EHH (Sabeti et al., Nature 2007) Constructing phylogenetic trees • Phylip • Clustalw • PAML • MEGA (Kumar et al., Briefings in Bioinformatics 2004)
Evolution—An application • Recent positive selection • SLC24A5, SLC45A2, skin pigment, Europe population • LARGE, DMD, Lassa fever virus, Africa population • EDAR, EDA2R, the development of hair, teeth and exocrine glands, Asia population (Sabeti, Nature 2007).
Alternative Splicing (AS) • Predicted from ESTs • Predicted from cDNA clones • Prediction of tissue-specific AS • Splicing graphs and EST assembly problem
Functional Domain • TF binding sites • TRANSFAC: a TF binding site database • TESS: a web-based program • Exons, introns, 5’UTR, 3’UTR • UCSC • Promoter • CorePromoter • Motif • Weeder • RNA family • Rfam • Protein domain • Pfam: database • InterPro: database • HMMER: a program based on HMM
Sequence mutations PIK3CA Gymnopoulos et al., pnas 2007 • Tool: SIFT & Sapred • Conservation score? • Near functional sites? • Similarity score? • Surface? • ……… Huang et al., Science 2007
Modeling structures • RNAfold • RNAStructure
Modeling structures • Homology modeling • ESyPred3D • Swiss Model • Ab initio prediction • Rosetta • Single mutation modeling • Modeller • Visualization • Pymol
目标 约束 解 最优化算法 目标:max (或min)Y=f(x) 约束:x>=0 解:求x=?
确定性优化算法-智能优化 遗传算法、模拟退火
Microarray总流程 Biological Question Data Analysis & Modelling Sample Preparation MicroarrayDetection Microarray Reaction Taken from Schena & Davis
s1 s2 s3• • • • • • • • sj • • • • • sM g1 g2 • • • • gi • • • • • gN gene profile Gi Mi,j array profile Aj Microarray data matrix
数据预处理 • 数据缺失 • 原因 • 图像受到污染 • 图像分辨率不足 • 片上灰尘或刮痕 • 缺失数据的处理方法 • 舍弃该数据(同时丢掉了有用信息!) • 再做一次实验 (太昂贵了!) • 用某个数取代,比如样本均值 • K-nearest neighbors估计 • 奇异值分解(SVD) 估计 • 标准化 • Log变换 • 线性回归 • 伸缩+平移
Microarray数据模式分类 X Y F(X) 训练样本 预处理 特征提取 机器学习 决策 新样本 分类器 决策
G1 x2 L: c1x1+c2x2-c=0 G2 x1
模式分类算法 • 线性分类器 • 神经网络 • 最近邻 • 贝叶斯分类器 • 隐马尔科夫模型分类器 • 决策树 • 支持向量机
Microarray数据模式聚类 • 层次聚类 • K-means 聚类 • Fuzzy C-means聚类 • 自组织映射 • Replicator dynamics (Cui, 2004)
基因表达特征抽取 • 差异表达基因 • Gene set or pathway • PCA • SVD • ISOMAP • MDS • 区分男女的特征 • 头发长度? • 皮肤光滑度? • 嗓音? • 身高? • 力量? • 穿着? • 姿态? • XX/XY
基因关系的刻划 • Static relationship • Pearson’s correlation • Spearman’s correlation • Mutual information • Other similarity metric • Dynamic relationship • Dynamic regression (Cui, 2005) • Window based correlation
基因表达网络 • Pearson’s correlation • Hard threshold • Weighted • Mutual information • Bayesian network
What is Systems Biology? • Not a new concept! • Systems biology is an emergent field that aims at system-level understanding of biological systems (Kitano 2002). • To understand biology at the system level, we must examine the structure and dynamics and cellular organismal function, rather than the characteristics of isolated parts of a cell or organism.
E _ B D + + + A C 0 Why Systems Biology? http://www.newvisions.ucsb.edu/background/images/elephant.gif
Why Computational Systems Biology? • Golden opportunity, now! ★ More than 16 international meetings in 2006 Large-scale, high-throughput data ★ More than 10 books in the past two years ★ Journals: Molecular systems biology (Nature & EMBO), BMC systems biology, IET systems biology, EURASIP Journal on Bioinformatics and Systems Biology etc.
Fields of Computational Systems Biology? • Biological networks construction, such as gene regulatory networks, cellular signaling networks, metabolic networks, protein-protein interaction networks, genetic interaction networks, gene co-expression networks, literature networks.
Fields of Computational Systems Biology? • Properties of systems, such as topology, robustness, tolerance. Albert et al., Nature 2000
Fields of Computational Systems Biology? • Biological questions on systems-levels, such as diseases, evolution, medicine etc. Ras region TGFβ region P53 region Goh et al., PNAS 2007 Cui et al., MSB 2007
M1 D1 M2 D2 M3 M4 D1 D1 D3 一个应用:microRNA-disease systems biology
My Suggestions • 第一,相关参考文献通读一遍,相关数据要记录下来。 • 第二,浏览本ppt一遍或者咨询生物信息学专业人士看有无Bioinformatics就可以解决的问题 • 第三,所阅读文献中数据本身有无生物信息学分析的可能,比如Meta-analysis, Systems biology. • 第四,包括生物信息学在内的新知识并不难,当你亲自完成一个项目的时候就会深有体会!
我们需要实验验证的工作 • The functions of mir-423, mir-608 that are under recent positive selection • SLC24A5, SLC45A2, skin pigment, Europe population • LARGE, DMD, Lassa fever virus, Africa population • EDAR, EDA2R, the development of hair, teeth and exocrine glands, Asia population (Sabeti, Nature 2007). • Experimental validation of a potential liver-disease related microRNA: miR-149 • SNP: rs2292832, CEU and YRI 80% C 20% U; CHB and JPT 20% C 80% U. • Host gene is GPC1(Glypican 1,硫酸乙酰肝素蛋白聚糖), which is overexpressed in pancreas cancer; and another member (GPC3) of this host gene family is a liver cancer marker. • GPC1是肝素结合生长因子的受体 • Not expression in liver/ Expression in liver • Target HEV and HGV • Free energy: C: -54.9; U: -52.7
我们需要实验验证的工作 • Cardiovascular • miR-1 • miR-133 • miR-199a • miR-21 • miR-23a • miR-23b • miR-208 • Liver (miR-122) • Kidney • Brain • Lung • ………
崔庆华:15801250611,82801585 Email: cuiqinghua@bjmu.edu.cn 您身边最好的裁缝 谢谢大家 欢迎指导