Chapter 1

Chapter 1 Introduction

Introduction – Gene(基因) History • 1865 Mendel: The basic unit of inheritance is a gene. • Mendel’s work was forgotten until 1900s. • 1944 The gene was known to be made of DNA (Deoxyribonucleic Acid). • 1953 James Watson and Francis Crick : Double helical structure of DNA. (雙股螺旋)

Introduction – Gene History (Cont.) • 1990 The Human Genome Project(人類基因體計畫 ) started. • 1995 The first free-living organism to be sequenced : haemophilus influenzae (流行性感冒嗜血桿菌) • 1998 CELERA joined the gene research. • 2000 The human DNA sequence draft was completed (published in 2001).

Bioinformatics - 國內相關計畫 • 2000年國科會「生物資訊」跨領域研究 • 2001年國科會國家型研究計畫 • 基因體醫學國家型計畫 • 2001年國科會跨領域專題研究 • 工程處：資訊科技 • 生物處：生物資訊

動物細胞(細胞核、細胞質、細胞膜) • DNA位於細胞核內之「核仁」

DNA Double Helix (雙股螺旋）

DNA中核甘酸間之鍵結

核甘酸 • 核甘酸(Nucleotide)為核酸分子構成單元 • 核甘酸包含： • 五碳糖(去氧核糖, deoxyribose) • 磷酸基(phosphate group) • 含氮鹼基之一(A、G、C、T、U) 胞嘧啶 (C)

DNA四種含氮鹼基

DNA Double Helix (雙股螺旋）

DNA Sequence

DNA and RNA • Nucleotide (核甘酸)：腺嘌呤 (adenine, A) 鳥糞嘌呤(guanine, G) 胞嘧啶(cytosine, C) 胸腺嘧啶(thymine, T) 尿嘧啶(uracil, U) • DNA(deoxyribonucleic acid , 去氧核糖核酸) {A, G, C, T} (base pair: GC, A=T ) • RNA(ribonucleic acid, 核糖核酸) {A, G, C, U} (base pair: GC, A=U, GU )

DNA Length • The total length of the human DNA is about 3109(30億) base pairs. • 1% ~ 1.5% of DNA sequence is useful. • # of human genes: 30,000~40,000 • Conclusion from the human genome project • Expected # is 100,000 originally.

DNA Sequencing(定序) • Given DNA sequence: TGCACTTGACGCATGCT Cut the sequence after random A: ATGCT length=5 ACGCATGCT length=9 AACGCATGCT length=10 ACTTGAACGCATGCT length=15

DNA Sequencing • 電泳法(eletrophoresis)

DNA Sequencing

Amino Acids (胺基酸) 胺基酸：蛋白質的基本單位，共20種

General Structure of an Amino Acid 3 groups: Amino Group (胺基) Carboxyl Group (羧基) R Group (R 基團)

Amino Acids (胺基酸)分子

Protein (蛋白質)分子

Amino Acids and RNA 每三個核甘酸(codon，基因密碼)對應至一種胺基酸。 AUG is also the “start” codon.

From DNA via RNA to Protein

DNA TCCAACGGTGCTGAGGTGCAC Protein Gene DNA, Genes and Proteins • DNA: program for cell processes • Proteins: execute cell processes

Promoter(啟動子) and Gene

Regulation (調控) of Genes Transcription Factor (Protein) RNA polymerase (Protein) DNA Gene Regulatory Element By Blanchette

Regulation of Genes Transcription Factor (Protein) RNA polymerase DNA Regulatory Element Gene By Blanchette

Regulation of Genes New protein RNA polymerase Transcription Factor DNA Regulatory Element Gene By Blanchette

From DNA via RNA to Protein

From RNA to Protein

Primary Structure (一級結構) of Protein 牛的胰島素(一種蛋白質)之胺基酸序列

Secondary Structure (二級結構) of Protein

Tertiary Structure (三級結構) of Protein 血紅素分子三級結構

Quaternary Structure (四級結構) of Protein 血紅素分子四級結構

Problems on Different Levels

Some Problems in Bioinformatics • Sequence comparison • Longest common subsequence • Edit distance • Similarity • Multiple sequence alignment • Fragment assembly of DNA sequences • Shortest common superstring • Physical mapping • Double digest problem • Consecutive ones problem • Evolutionary trees • Molecular structure prediction • Protein folding

Sequence Comparison • Goals: • Database search: Given a sequence S and a set of sequences G, to find all the sequences in G, which are similar to S. • Similarity: To find which parts of the sequences are alike and which parts differ. - Sequence alignment (global alignment) - Local alignment

Sequence Alignement • Global alignment • Local alignment

Longest Common Subsequence(1) • To find a longest common subsequence between two strings. string1: TAGTCACG string2: AGACTGTC  LCS : AGACG • Dynamic programming:

Longest Common Subsequence(2) S2 S1 TAGTCACG AGACTGTC LCS:AGACG

Edit Distance(1) • To find a smallest edit process between two strings. S1: TAGTCACG S2: AGACTGTC Operation: DMMDDMMIMII

Edit Distance(2) S2 S1 TAGTCACG AGACTGTC DMMDDMMIMII

Similarity • Two sequences s1 and s2. • p is the match value if ai = bj, else it is the mismatch value. • g is the gap penalty.

Sequence Alignment a = TAGTCACG b = AGACTGTC  ----TAGTCACG TAGTCAC-G-- AGACT-GTC--- -AG--ACTGTC • Which one is better?

Sequence Alignment Formula c0,0 = 0 ci,0 =  i c0,j =  j if ai bj if ai = bj

Sequence Alignment Example TAGTCAC-G-- -AG--ACTGTC

Multiple Sequence Alignment s1 = ATTCGAT s2 = TTGAG s3 = ATGCT  alignment s1 = ATTCGAT s2 = -TT-GAG s3 = AT--GCT • If the number of sequences is k, and k is large, how to solve the problem? • NP-complete problem

Multiple Sequence Alignment - SP • Sum-of-pairs score =

Chapter 1

Chapter 1

Presentation Transcript

Chapter 1

CHAPTER 1

Chapter 1

Chapter 1

Chapter 1

Chapter 1

Chapter 1

Chapter 1

Chapter 1

Chapter 1

Chapter 1

Chapter 1

Chapter 1

Chapter 1

Chapter 1

CHAPTER 1 1

Chapter 1

Chapter 1

Chapter 1

Chapter 1.

Chapter 1 - 1

Chapter 1 1