470 likes | 595 Views
整合式基因體與蛋白體 資料庫. 劉 志 俊 ( Chih-Chin Liu) 中華大學 資訊工程系 July 2008. Outline. 生物資訊 (Bioinformatics): 資料庫觀點 生物資訊四大資料型態 (Data Types) 生物資料庫設計與 UML 整合式生物資料庫 : UniBio 豬 / 土雞基因體資料庫 蛋白體資料庫. 當生物遇見資訊. 生物學 分子遺傳學 分子生物學 生物化學 細胞生物學 蛋白質學 免疫學. 資訊學 程式語言 資料結構 演算法 資料庫 平行處理 資料探勘. 生物資訊.
E N D
整合式基因體與蛋白體資料庫 劉 志 俊 (Chih-Chin Liu) 中華大學 資訊工程系 July2008
Outline • 生物資訊(Bioinformatics): 資料庫觀點 • 生物資訊四大資料型態(Data Types) • 生物資料庫設計與UML • 整合式生物資料庫: UniBio • 豬/土雞基因體資料庫 • 蛋白體資料庫
當生物遇見資訊 生物學 分子遺傳學 分子生物學 生物化學 細胞生物學 蛋白質學 免疫學 資訊學 程式語言 資料結構 演算法 資料庫 平行處理 資料探勘 生物資訊
基因體、轉錄體、蛋白體、代謝體 • 基因體(Genome): • 轉錄體 (Transcriptome):The complement of expressed gene that are found in a particular cell or tissue. • 蛋白體 (Proteome): The complement of proteins that are found in a particular cell or tissue. • 代謝體 (Metabolome): The assembly of substrates, metabolites, and other small molecules that are present in a population of cells.
更多的【體】 • 結構體 (∑ Structures, Structurome) • 變異體 (∑ SNPs, SNPome) • 文獻體(∑ Literatures, Literaturome) • 訊號傳導體 (∑ Transductions, Transductome) • 反應路徑體 (∑ Pathways, Pathwayome) • 遺傳疾病體 (∑ Diseases, Diseasome) 體 資料庫
Research Issues in Biological Databases • Data Modeling • How to store/represent biological data • Data Retrieval • How to retrieve similar biological objects • Data Mining • How to find rules behind biological data • Simulation • Pathway Simulation, Virtual Cell, Virtual Life
New Data Types in Bio-Databases • Large Strings • DNA Sequences, Protein Sequences • Biological Images • 2D Gels, Microarray Images • 3D Structures • Proteins, Compounds • Network • Pathways
New Data Types in Bio-Databases • Large Strings: DNA Sequences 現代人第1號染色體的完整序列,長度為 245,564,334 bp 是GenBank最長的 一筆序列紀錄
New Data Types in Bio-Databases • Large Strings: Protein Sequences PIR: I38344 PIR資料庫最長的 蛋白質序列 26,926 個氨基酸 titin, cardiac muscle [validated] - human
New Data Types in Bio-Databases • Images: Microarray (Stanford Microarray Database)
New Data Types in Bio-Databases • Images: 1D-Gel, 2D-Gel
New Data Types in Bio-Databases • 3D Structures: Chemical Compound
New Data Types in Bio-Databases • 3D Structures
New Data Types in Bio-Databases • 3D Structures ATOM 1 N VAL 1 -4.004 15.224 13.636 1.00 32.64 N ANISOU 1 N VAL 1 4512 3449 4441 -335 -2675 320 N ATOM 2 CA VAL 1 -3.526 15.758 14.900 1.00 18.42 C ANISOU 2 CA VAL 1 1478 2233 3289 -286 -467 555 C ATOM 3 C VAL 1 -2.662 14.733 15.628 1.00 17.06 C ANISOU 3 C VAL 1 1603 1981 2899 -152 -466 234 C ATOM 4 O VAL 1 -3.053 13.569 15.714 1.00 18.61 O ANISOU 4 O VAL 1 1758 2150 3163 -489 -394 501 O
New Data Types in Bio-Databases • Network: Pathways
Database Design • Conceptual Database Design • Class Diagram (ER Model, UML Class Diagram) • Entities(Classes), Relationships, Attributes • Logical Database Design • Relational Schema • Normalization, ER to Relational Data Model Mapping • Physical Database Design • Implementation (e.g. Oracle, MySQL, SQL Server) • Indexes and Storage Methods
The UniBio Project • 完整性 • 收集所有生物相關之可下載資料庫 • 整合性 • 所有資料互相參考, 邏輯上為單一資料庫 • 中文化 • 盡可能提供對應之中文資料, 降低學習障礙
The UniBio Project 生物資訊 網站 UML 下載原始格式 生物資訊 生物資料庫 設計 MySQL Perl phpMyAdmin 生物 資料庫 調整生物 資訊格式 生物資料庫 建置
The UniBio ProjectDeveloping Environment • RedHat Linux 9.0 (Free, 穩定, 高效能) • MySQL (Free, 跑的最快的資料庫) • Apache (Free, 穩定, 功能強大, 高效能) • Perl (Free, 生物資訊主要程式語言, 程式精簡,跨平台) • PHP (Free, 函數眾多, 容易撰寫,跨平台) • C/C++(Free, 歷史悠久, 功能強大) • Java (Free, 可Web顯示, 跨平台)
Genome Data Management GenBank EMBL DDBJ RefSeq TIGR TGI UniGene Sampling Cloning Sequencing BLASTing Submitting Sample Database Clone Database cDNA Database BLAST Report Database GenBank Submission Files
Functional Genome Data Management cDNA Database KEGG Enzyme Gene Expression Gene Expression Profile in silico Simulation in situ Verification in vivo Testing ? ? ? MicroArray Database Profile Database Simulation Result Database Verification Report Database New Drug $$$
豬/土雞基因體資料庫dbEST Submission TYPE: EST STATUS: New CONT_NAME: Wen-Chuan Lee CITATION: Porcine testis EST project LIBRARY: Porcine testis cDNA library I EST#: PDUts1001A02 CLONE: PDUts1001A02 SOURCE: Division of Biotechnology, Animal Technology Institute Taiwan ... SEQ_PRIMER: T7 promoter primer HIQUAL_START: 1 HIQUAL_STOP: 306 DNA_TYPE: cDNA PUBLIC: 12/31/2005 SEQUENCE: CTCAACCATTGATGGAGCATATTTCTCTATTTTTAGTAGATCTAGAAAAAAATAGTATGA AGTTAGATATCCTAAGAAGAGCAATTACCGCTATTTCATTATATTTTGCTTAAAAAAAAA CAAGATTATTTTAATGGATATATCAAATCCTCGTGCACGATGTACAAAAATTAAAGCACG TCTGGGGCCACAAAGCACATCTCGATGAACTCTGAATAGATAGTACCAAGCAATTAGGTT ATAAATTAATACTTTACAAGAGAATTTAGAAAATTTCATAGTTGCCCAGTGTAAGCTACC TTTCTA ||
MassSpec SWISS- 2DPAGE Siena- 2DPAGE ATIT- 2DPAGE UniProt PMMA- 2DPAGE Plasma- 2DPAGE RESID Dali/FSSP PIR MIPS/JIPID Pfam SWISS-PROT PDB CATH PROSITE SCOP PRINTS BioCyc KEGG ENZYME BRENDA BLOCKS WIT EMOTIF LIGAND Integrated Proteomic Database
2D Gel Electrophoresis Separation by Charge (pI) Separation by Molecular Weight (MW) Molecular Weight Markers
Exploring Diseases Detect the spots that changed. Identify which proteins they are byPMF (Peptide Mass Fingerprinting) They could be candidates for drug screening.
2D-PAGE Spot Examples2D123456_1.out "SSP" "MR" "PI" "TA20040301PH4~7" "" "" "" "quantity" 0105 14.000000 0.940249 17718.58 0304 20.000000 0.100000 3015.93 0409 27.025288 2.881626 4703.69 0410 28.200542 3.015601 7963.92 0411 26.410089 3.035875 5168.19 0510 30.000000 0.100000 568.17 0610 45.000000 -1.000000 256.19 0708 70.379211 4.008969 12372.92 0709 60.177605 4.017597 60490.97 0710 71.341202 4.018401 20098.13 0711 68.146568 4.018714 25632.64 0712 57.148594 4.023514 73912.91 0713 66.000000 -1.000000 940.28 0902 116.400002 4.000000 160499.94
MassSpec Database • Samples • MassSpec Analysis Results (.pkl) • Mascot Configuration • Mascot Query • Mascot Result (.dat) • Mascot Protein Reports • Mascot Peptide Reports
Flowchart *.txt *.pkl Mascot Search (PMF) *.dat Mascot Parser MassSpec Database
Proteome Data Management Sample 2D-PAGE Spot Mass Spectrum Protein/ Peptide Report *.tiff *.out *.pkl *.dat upload/ parsing upload/ parsing upload upload key-in Gel Database MassSpec Database