600 likes | 827 Views
生物信息学基础讲座. 第 1 讲 生物信息学研究 http://cbb.sjtu.edu.cn/~mywu/bi217. 什么是生物信息学 ?. Concepts of Bioinformatics. Biology + computer?. (Molecular) Bio - informatics
E N D
生物信息学基础讲座 第1讲 生物信息学研究 http://cbb.sjtu.edu.cn/~mywu/bi217
什么是生物信息学? Concepts of Bioinformatics
Biology + computer? • (Molecular)Bio - informatics • Bioinformatics is conceptualizing biology in terms of molecules (in the sense of physical-chemistry) and then applying “informatics” techniques (applied math, CS, and statistics) to understand and organize the informationassociated with these molecules, on alarge-scale. • Bioinformatics is a practical discipline with many applications.
生物分子信息 生物信息的载体 分子 存贮、复制、传递和表达 遗传信息的系统 细胞
两种信息载体 • 核酸分子 • 蛋白质分子
三种信息 与功能相关的结构信息 遗传信息 进化信息
实验数据信息知识 收集表示分析建模 刻画特征比较推理 应 用 基因工程 蛋白质设计 疾病诊断 疾病治疗 开发新药
交叉学科 • 以生物医学数据和过程为研究对象 • 采用数学、统计学方法和手段 • 借助现代计算机技术和方法 • 解读生物学数据 • 建立数学模型 • 对生物过程进行预测和控制
研究性学科中的应用学科 • 该学科目前还处于初步阶段 • 大部分还停留在实验室的水平 • 工业应用仍有许多不足
研究思路与方法 General Approaches to Bioinformatics
一般研究策略 • 提出生物学问题 • 数学描述与建模:前提、假设、问题的核心分别是什么 • 计算机分析:有时需要采取数值方法进行求解,有时需要采取蒙特卡洛的方法 • 分析结果的生物学意义
生物信息学的研究内容 Working Around Central Dogma
Genomes • Integrative Data 1995, HI (bacteria): 1.6 Mb & 1600 genes done 1997, yeast: 13 Mb & ~6000 genes for yeast 1998, worm: ~100Mb with 19 K genes 1999: >30 completed genomes! 2003, human: 3 Gb & 100 K genes...
1995 • Bacteria, 1.6 Mb, ~1600 genes [Science269: 496] 1997 • Eukaryote, 13 Mb, ~6K genes [Nature 387: 1] 1998 • Animal, ~100 Mb, ~20K genes [Science282: 1945] 2000? • Human, ~3 Gb, ~100K genes [???]
DNA • Raw DNA Sequence • Coding or Not? • Parse into genes? • 4 bases: AGCT • ~1 K in a gene, ~2 M in genome • ~3 Gb Human atggcaattaaaattggtatcaatggttttggtcgtatcggccgtatcgtattccgtgca gcacaacaccgtgatgacattgaagttgtaggtattaacgacttaatcgacgttgaatac atggcttatatgttgaaatatgattcaactcacggtcgtttcgacggcactgttgaagtg aaagatggtaacttagtggttaatggtaaaactatccgtgtaactgcagaacgtgatcca gcaaacttaaactggggtgcaatcggtgttgatatcgctgttgaagcgactggtttattc ttaactgatgaaactgctcgtaaacatatcactgcaggcgcaaaaaaagttgtattaact ggcccatctaaagatgcaacccctatgttcgttcgtggtgtaaacttcaacgcatacgca ggtcaagatatcgtttctaacgcatcttgtacaacaaactgtttagctcctttagcacgt gttgttcatgaaactttcggtatcaaagatggtttaatgaccactgttcacgcaacgact gcaactcaaaaaactgtggatggtccatcagctaaagactggcgcggcggccgcggtgca tcacaaaacatcattccatcttcaacaggtgcagcgaaagcagtaggtaaagtattacct gcattaaacggtaaattaactggtatggctttccgtgttccaacgccaaacgtatctgtt gttgatttaacagttaatcttgaaaaaccagcttcttatgatgcaatcaaacaagcaatc aaagatgcagcggaaggtaaaacgttcaatggcgaattaaaaggcgtattaggttacact gaagatgctgttgtttctactgacttcaacggttgtgctttaacttctgtatttgatgca gacgctggtatcgcattaactgattctttcgttaaattggtatc . . . . . . caaaaatagggttaatatgaatctcgatctccattttgttcatcgtattcaa caacaagccaaaactcgtacaaatatgaccgcacttcgctataaagaacacggcttgtgg cgagatatctcttggaaaaactttcaagagcaactcaatcaactttctcgagcattgctt gctcacaatattgacgtacaagataaaatcgccatttttgcccataatatggaacgttgg gttgttcatgaaactttcggtatcaaagatggtttaatgaccactgttcacgcaacgact acaatcgttgacattgcgaccttacaaattcgagcaatcacagtgcctatttacgcaacc aatacagcccagcaagcagaatttatcctaaatcacgccgatgtaaaaattctcttcgtc ggcgatcaagagcaatacgatcaaacattggaaattgctcatcattgtccaaaattacaa aaaattgtagcaatgaaatccaccattcaattacaacaagatcctctttcttgcacttgg
Protein Sequence • 20 letter alphabet • ACDEFGHIKLMNPQRSTVWY but not BJOUXZ • Strings of ~300 aa in an average protein (in bacteria), ~200 aa in a domain • >1M known protein sequences (uniprot) d1dhfa_ LNCIVAVSQNMGIGKNGDLPWPPLRNEFRYFQRMTTTSSVEGKQ-NLVIMGKKTWFSI d8dfr__ LNSIVAVCQNMGIGKDGNLPWPPLRNEYKYFQRMTSTSHVEGKQ-NAVIMGKKTWFSI d4dfra_ ISLIAALAVDRVIGMENAMPWN-LPADLAWFKRNTL--------NKPVIMGRHTWESI d3dfr__ TAFLWAQDRDGLIGKDGHLPWH-LPDDLHYFRAQTV--------GKIMVVGRRTYESF d1dhfa_LNCIVAVSQNMGIGKNGDLPWPPLRNEFRYFQRMTTTSSVEGKQ-NLVIMGKKTWFSI d8dfr__ LNSIVAVCQNMGIGKDGNLPWPPLRNEYKYFQRMTSTSHVEGKQ-NAVIMGKKTWFSI d4dfra_ ISLIAALAVDRVIGMENAMPW-NLPADLAWFKRNTLD--------KPVIMGRHTWESI d3dfr__ TAFLWAQDRNGLIGKDGHLPW-HLPDDLHYFRAQTVG--------KIMVVGRRTYESF d1dhfa_ VPEKNRPLKGRINLVLSRELKEPPQGAHFLSRSLDDALKLTEQPELANKVDMVWIVGGSSVYKEAMNHP d8dfr__ VPEKNRPLKDRINIVLSRELKEAPKGAHYLSKSLDDALALLDSPELKSKVDMVWIVGGTAVYKAAMEKP d4dfra_ ---G-RPLPGRKNIILS-SQPGTDDRV-TWVKSVDEAIAACGDVP------EIMVIGGGRVYEQFLPKA d3dfr__ ---PKRPLPERTNVVLTHQEDYQAQGA-VVVHDVAAVFAYAKQHLDQ----ELVIAGGAQIFTAFKDDV d1dhfa_ -PEKNRPLKGRINLVLSRELKEPPQGAHFLSRSLDDALKLTEQPELANKVDMVWIVGGSSVYKEAMNHP d8dfr__ -PEKNRPLKDRINIVLSRELKEAPKGAHYLSKSLDDALALLDSPELKSKVDMVWIVGGTAVYKAAMEKP d4dfra_ -G---RPLPGRKNIILSSSQPGTDDRV-TWVKSVDEAIAACGDVPE-----.IMVIGGGRVYEQFLPKA d3dfr__ -P--KRPLPERTNVVLTHQEDYQAQGA-VVVHDVAAVFAYAKQHLD----QELVIAGGAQIFTAFKDDV
研究对象 • Basic building blocks • Genome/DNA • Genes, proteins • Cis-regulatory elements • Basic mechanisms • Transcription • Splicing • Translation (step 1~3 = gene expression) • Post-translational modification / protein folding • Transcriptional regulatory mechanisms and other regulatory mechanisms • Alternative splicing • microRNAs • DNA/protein modification
研究内容 • Sequencing • similarity search: find similar sequences in databases • global search • local search • Multiple alignment • Motif discovery • Gene finding:determining gene and gene structure • Phylogenetic analysis • Comparative studies
生物大分子结构分析研究 Extending Studies to 3-Dimensions
Hierarchy of Protein Structure Tertiary
Protein 3D Structure 3D Structure of Protein Turn or coil Alpha-helix Beta-sheet Loop and Turn
结构分析Structural Analysis • 二级结构预测secondary structure prediction • 三级结构预测 tertiary structure prediction • 分子对接molecular docking • 分子动力学模拟 molecular dynamics (MD) simulation
表达数据分析 Expression studies
Microarray data analysis Low-expression gene High-expression gene Highly expressed in control, but not treatment cells Highly expressed in treatment, but not control cells
Finding Genes in Genomic DNA Introns/exons/promotors Characterizing Repeats in Genomic DNA Statistics Patterns Duplications in the Genome Large scale genomic alignment Whole-Genome Comparisons Finding Structural RNAs Genome Sequence
Sequence Alignment non-exact string matching, gaps Dynamic Programming Local vs Global Alignment Suboptimal Alignment Hashing to increase speed (BLAST, FASTA) Amino acid substitution scoring matrices Multiple Alignment and Consensus Multiple alignment Transitive Comparisons HMMs, Profiles Motifs Scoring schemes and Matching statistics How to tell if a given alignment or match is statistically significant A P-value (or an e-value)? Score Distributions(extreme val. dist.) Low Complexity Sequences Evolutionary Issues Rates of mutation and change Protein Sequence
Secondary Structure “Prediction” via Propensities Neural Networks, Genetic Alg. Simple Statistics TM-helix finding Assessing Secondary Structure Prediction Structure Prediction: Protein v RNA Tertiary Structure Prediction Fold Recognition Threading Ab initio (Quaternary structure prediction) Direct Function Prediction Active site identification Relation of Sequence Similarity to Structural Similarity Sequence Structure
Structure Comparison Basic Protein Geometry and Least-Squares Fitting Distances, Angles, Axes, Rotations Calculating a helix axis in 3D via fitting a line LSQ fit of 2 structures Molecular Graphics Calculation of Volume and Surface How to represent a plane How to represent a solid How to calculate an area Hinge prediction Packing Measurement Structural Alignment Aligning sequences on the basis of 3D structure. DP does not converge, unlike sequences, what to do? Other Approaches: Distance Matrices, Hashing Fold Library Docking and Drug Design as Surface Matching Structures
Relational Database Concepts Keys, Foreign Keys SQL, OODBMS, views, forms, transactions, reports, indexes Joining Tables, Normalization Natural Join as "where" selection on cross product Array Referencing (perl/dbm) Forms and Reports Cross-tabulation DB interoperation What are the Units ? What are the units of biological information for organization? sequence, structure motifs, modules, domains How classified: folds, motions, pathways, functions? Clustering and Trees Basic clustering UPGMA single-linkage multiple linkage Other Methods Parsimony, Maximum likelihood Evolutionary implications Visualization of Large Amounts of Information The Bias Problem sequence weighting sampling DBs/Surveys
Mining • Information integration and fusion • Dealing with heterogeneous data • Dimensionality Reduction (PCA etc)
Expression Analysis Time Courses clustering Measuring differences Identifying Regulatory Regions Large scale cross referencing of information Function Classification and Orthologs The Genomic vs. Single-molecule Perspective Genome Comparisons Ortholog Families, pathways Large-scale censuses Frequent Words Analysis Genome Annotation Identification of interacting proteins Networks Global structure and local motifs Structural Genomics Folds in Genomes, shared & common folds Bulk Structure Prediction Genome Trees (Func) Genomics
Molecular Simulation Geometry -> Energy -> Forces Basic interactions, potential energy functions Electrostatics VDW Forces Bonds as Springs How structure changes over time? How to measure the change in a vector (gradient) Molecular Dynamics & MC Energy Minimization Parameter Sets Number Density Simplifications Poisson-Boltzman Equation Lattice Models and Simplification Simulation
Application 1Novel Drug design • Understanding How Structures Bind to Small Molecules (Function) • Designing Inhibitors • Docking, Structure Modeling
常用计算机技术 • 数据库技术:建库与搜索 Database • 统计 statistics • 数据挖掘技术data mining • 人工智能Artificial Intelligence • 模式识别 pattern recognition • 机器学习machine learning