基于聚类的蛋白质家族建立

基于聚类的蛋白质家族建立 组长：许坤组员：高晨曦、曹天骄、韩蕊主讲：曹天骄韩蕊联系方式：xukun@icst.pku.edu.cn ——Final Project Proposal 2010

备选提案 • 多国语词典 • 多媒体推荐系统 • 蛋白质比对分析处理系统 • 地震/气象预测系统 www.themegallery.com

可行性分析 • 分析标准 • 方案所要用到的主要技术与课程的相关性 • 方案所要用到的数据集规模和来源能否满足课程设计要求 • 实现方案可能遇到的问题 • 现有 www.themegallery.com

确定选用的方案 • 蛋白质比对分析处理系统 www.themegallery.com

Introduction • 世界上所有蛋白质的种类难以估计，一个细胞内就有上千种结构、功能、分子质量不同的蛋白质。 www.themegallery.com

What is protein? Components of organisms: Enzymes (metabolism) Transport (O2, membrane …) Movements (muscles) Antibodies (immunity ) Brain … … Protections (horns, skins…) www.themegallery.com

www.themegallery.com

氨基酸 • 20种基本氨基酸（一级结构） • 蛋白质的结构和功能（三级结构） • a carboxyl group(羧基) • an amino group（氨基） • side chains, or R groups www.themegallery.com

www.themegallery.com

同源蛋白质 • Protein sequences can elucidate the history of life on earth • The study of molecular evolution generally focuses on families of closely related proteins. • The members of protein families are called homologous proteins or homologs. • 同源蛋白质可以在物种内也可以在物种间。 • 蛋白质之间的关系远近可以体现出物种间进化关系的远近。 • 蛋白质的氨基酸序列包含了判断这一关系所需要的全部信息，因此通过氨基酸序列比对，可以得到物种的进化树。 www.themegallery.com

蛋白质的氨基酸序列数据库（约80G） (download from uniprot) ftp://ftp.ncbi.nlm.nih.gov美国生物信息中心 www.themegallery.com

Expectation • 通过比对蛋白质氨基酸序列，得到蛋白质的相似度，从而得到同源性高的蛋白质 • 最终建立蛋白质家族 www.themegallery.com

初步思路 1、输入输出：输入：蛋白质的氨基酸序列 Key/value:蛋白质名称/氨基酸序列输出：同源性高的蛋白质序列 2、方法：cluster www.themegallery.com

3、抽象模型： （1）坐标系的建立： ·维度：以最长的蛋白质序列的氨基酸个数作为维度数目，张开一个空间；每个坐标轴上有20个离散刻度（分别是每个氨基酸对应的数值）； ·坐标：根据氨基酸各个参数确定一个公式，以确定每个氨基酸对应的数值； www.themegallery.com

（2）散点空间位置的确定： 根据每个蛋白质的氨基酸序列把它对应到空间上的点。（3）两点距离公式（比对）： www.themegallery.com

参考模型 www.themegallery.com

Set-Similarity Join • partition the data across nodes • balance the workload • minimize the need for replication • self-join and R-S join cases • control the amount of data kept in main memoryon each node. even if we use the most fine-grained partitioning, the data • experiments on uniprot datasets • Synthetically increased in size, to evaluate the speedup and scale upproperties of the proposed algorithms using Hadoop. www.themegallery.com

Clustering www.themegallery.com

结果评价 www.themegallery.com

参考文献 • Nelson, D. L., and Cox, M. M. (2005) Lehninger Principles of Biochemistry, fourth edition, Worth Publishers. www.themegallery.com

基于聚类的蛋白质家族建立

基于聚类的蛋白质家族建立

Presentation Transcript