790 likes | 990 Views
Chapter 1. Introduction. Introduction – Gene( 基因 ) History. 1865 Mendel: The basic unit of inheritance is a gene. Mendel’s work was forgotten until 1900s. 1944 The gene was known to be made of DNA ( D eoxyribo n ucleic A cid) .
E N D
Chapter 1 Introduction
Introduction – Gene(基因) History • 1865 Mendel: The basic unit of inheritance is a gene. • Mendel’s work was forgotten until 1900s. • 1944 The gene was known to be made of DNA (Deoxyribonucleic Acid). • 1953 James Watson and Francis Crick : Double helical structure of DNA. (雙股螺旋)
Introduction – Gene History (Cont.) • 1990 The Human Genome Project(人類基 因體計畫 ) started. • 1995 The first free-living organism to be sequenced : haemophilus influenzae (流行性感冒嗜血桿菌) • 1998 CELERA joined the gene research. • 2000 The human DNA sequence draft was completed (published in 2001).
Bioinformatics - 國內相關計畫 • 2000年國科會「生物資訊」跨領域研究 • 2001年國科會國家型研究計畫 • 基因體醫學國家型計畫 • 2001年國科會跨領域專題研究 • 工程處:資訊科技 • 生物處:生物資訊
動物細胞(細胞核、細胞質、細胞膜) • DNA位於細胞核內之「核仁」
核甘酸 • 核甘酸(Nucleotide)為核酸分子構成單元 • 核甘酸包含: • 五碳糖(去氧核糖, deoxyribose) • 磷酸基(phosphate group) • 含氮鹼基之一(A、G、C、T、U) 胞嘧啶 (C)
DNA and RNA • Nucleotide (核甘酸): 腺嘌呤 (adenine, A) 鳥糞嘌呤(guanine, G) 胞嘧啶(cytosine, C) 胸腺嘧啶(thymine, T) 尿嘧啶(uracil, U) • DNA(deoxyribonucleic acid , 去氧核糖核酸) {A, G, C, T} (base pair: GC, A=T ) • RNA(ribonucleic acid, 核糖核酸) {A, G, C, U} (base pair: GC, A=U, GU )
DNA Length • The total length of the human DNA is about 3109(30億) base pairs. • 1% ~ 1.5% of DNA sequence is useful. • # of human genes: 30,000~40,000 • Conclusion from the human genome project • Expected # is 100,000 originally.
DNA Sequencing(定序) • Given DNA sequence: TGCACTTGACGCATGCT Cut the sequence after random A: ATGCT length=5 ACGCATGCT length=9 AACGCATGCT length=10 ACTTGAACGCATGCT length=15
DNA Sequencing • 電泳法(eletrophoresis)
Amino Acids (胺基酸) 胺基酸:蛋白質的基本單位,共20種
General Structure of an Amino Acid 3 groups: Amino Group (胺基) Carboxyl Group (羧基) R Group (R 基團)
Amino Acids and RNA 每三個核甘酸(codon,基因密碼)對應至一種胺基酸。 AUG is also the “start” codon.
DNA TCCAACGGTGCTGAGGTGCAC Protein Gene DNA, Genes and Proteins • DNA: program for cell processes • Proteins: execute cell processes
Regulation (調控) of Genes Transcription Factor (Protein) RNA polymerase (Protein) DNA Gene Regulatory Element By Blanchette
Regulation of Genes Transcription Factor (Protein) RNA polymerase DNA Regulatory Element Gene By Blanchette
Regulation of Genes New protein RNA polymerase Transcription Factor DNA Regulatory Element Gene By Blanchette
Primary Structure (一級結構) of Protein 牛的胰島素(一種蛋白質)之胺基酸序列
Tertiary Structure (三級結構) of Protein 血紅素分子三級結構
Quaternary Structure (四級結構) of Protein 血紅素分子四級結構
Some Problems in Bioinformatics • Sequence comparison • Longest common subsequence • Edit distance • Similarity • Multiple sequence alignment • Fragment assembly of DNA sequences • Shortest common superstring • Physical mapping • Double digest problem • Consecutive ones problem • Evolutionary trees • Molecular structure prediction • Protein folding
Sequence Comparison • Goals: • Database search: Given a sequence S and a set of sequences G, to find all the sequences in G, which are similar to S. • Similarity: To find which parts of the sequences are alike and which parts differ. - Sequence alignment (global alignment) - Local alignment
Sequence Alignement • Global alignment • Local alignment
Longest Common Subsequence(1) • To find a longest common subsequence between two strings. string1: TAGTCACG string2: AGACTGTC LCS : AGACG • Dynamic programming:
Longest Common Subsequence(2) S2 S1 TAGTCACG AGACTGTC LCS:AGACG
Edit Distance(1) • To find a smallest edit process between two strings. S1: TAGTCACG S2: AGACTGTC Operation: DMMDDMMIMII
Edit Distance(2) S2 S1 TAGTCACG AGACTGTC DMMDDMMIMII
Similarity • Two sequences s1 and s2. • p is the match value if ai = bj, else it is the mismatch value. • g is the gap penalty.
Sequence Alignment a = TAGTCACG b = AGACTGTC ----TAGTCACG TAGTCAC-G-- AGACT-GTC--- -AG--ACTGTC • Which one is better?
Sequence Alignment Formula c0,0 = 0 ci,0 = i c0,j = j if ai bj if ai = bj
Sequence Alignment Example TAGTCAC-G-- -AG--ACTGTC
Multiple Sequence Alignment s1 = ATTCGAT s2 = TTGAG s3 = ATGCT alignment s1 = ATTCGAT s2 = -TT-GAG s3 = AT--GCT • If the number of sequences is k, and k is large, how to solve the problem? • NP-complete problem
Multiple Sequence Alignment - SP • Sum-of-pairs score =