1 / 45

Language Theory and Bioinformatics

Language Theory and Bioinformatics. Bailin Hao T-Life Research Center, Fudan University Institute of Theoretical Physics, Academia Sinica http://www.itp.ac.cn/~hao/. Statistical Analysis of DNA Sequences. A first and must step in any analysis: Frequency of appearance of strings

gerda
Download Presentation

Language Theory and Bioinformatics

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Language Theory and Bioinformatics Bailin Hao T-Life Research Center, Fudan University Institute of Theoretical Physics, Academia Sinica http://www.itp.ac.cn/~hao/

  2. Statistical Analysis of DNA Sequences A first and must step in any analysis: • Frequency of appearance of strings • Correlations of letters and strings • 1D and 2D DNA walks vs. random walk Summary in two lines according to Luo Liao-fu: 1. DNA sequences are not random. 2. Characteristics close to randomness.

  3. Hint: Statistical methods alone are not powerful enough to amplify the difference between DNA and random sequences and the difference among themselves. Need for new “deterministic” approaches.

  4. 超越概率统计方法 概率统计是基本功 频度和关联, 马可夫链和隐马可夫链 神经网络模型 贝叶斯(Bayes)统计、“先验”分布 随机序列是好的参考系吗? 足够长的符号序列具有不可避免的“规则性” 基因组序列够长吗?

  5. 具有确定后果的随机运动 因果论与目的论 终值分布决定的随机微分方程 超越郎之万:随机微分方程的其他提法 分子马达、沿细胞骨架的运动 语言学方法:语法和语义 语义问题、遗传“字典” Gnomics: A DNA Dictionary (1986) 目前:>5000转录因子结合位点 >300内切酶识别点 各种重复序列,卫星、微卫星

  6. Language Metaphor in Biology Transcription (转录) Translation (翻译) Edition (编辑) Modification (修饰)

  7. Words As landmarks, e.g., recognition sites for : Restriction endonucleases (REBASE) methylases (REBASE) transcription factors (TRANSFAC) As components of “sentences” : promoters (EPD), enhancers silencers, insulators, terminators splicing sites

  8. Sentences enhancer — silencer — enhancer — … promotor — ( exon — intron )k— exon — terminator Essays/Articles genes, “ junk”, … Encyclopedia Complete genome of a species Reference Library Kingdom Monera, …, kingdom Animalia

  9. 自然语言与遗传语言 相异处: 标点符号和间隔不同 两种语言的相互作用 二维、三维的相互作用 重复序列的数目和作用 相似处: 多义性 冗余度 容错和纠错 长程关联 均基于离散的排列组合系统 有某些语法,但不能完全生成 方言、个体差异性 演化、突变、灭绝 历史“垃圾”、古语、“化石” 外来语、横向交换

  10. 语言学(language 而非 philology)方法 统计语言学 “字”的频度和关联 Zipf 定律 代数语言学:生成语法和语法复杂性 串行生成:Chomsky体系 平行生成:Lindenmayer 体系(来自发育生物学) 可因式化语言

  11. 模糊语言学 形式推广不难:Z .G .Yu (2001) 如何定量地引用生物知识 Consensus 序列和权重矩阵 随机语法 隐马可夫链 =随机正规语法 更高阶的随机语法?

  12. Consensus Sequences • TATAAT ( Pribnov or -10 box ): T80A95T45A60A50T96 • TTGACA ( -35 box ): T82T84G78A65C54A45 • CAAT ( CAAT or –75 box ): GGYCAATCT • TATA ( TATA or Goldberger-Hogness box ): TATAWAW • CATG ( Transcription startpoint ): However, in Aful:ATG –76%GTG –22%TTG –2%

  13. An Observation u d c s b t charge, mass, flavor, charm, … p n e charge, mass, spin, magnetic momentum, … H C N O P … atomic number, ion radius, valence, affinity, … H2O NO CO2 … molecular weight, polarity, … a c g t A D E F G H … W Y V BRCA1 PDGF

  14. A PROGRAMME: Coarse-Grained Description of Nature Use of Symbols and Symbolic Strings Language Grammar and Complexity (Chomsky, Lindenmayer, etc.) So far this programme has been best realized in the study of dynamics by using Symbolic Dynamics. There have been preliminary attempts in analyzing biological sequences.

  15. It may not be a coincidence that the two systems in the universe that most impress us with their open-ended complex design — life and mind — are based on discrete combinatorial systems. Many biologists believe that if inheritance were not discrete, evolution as we know it could not have taken place. S. Pinker, The Language Instinct (1995)

  16. Simple Examples At the level of words: DOG GOD At sentence level: Dog bites Man Man bites Dog

  17. Ca 结合蛋白 NCEGF (Epidermal GF) N C Chymotrypsin (胰凝乳蛋白酶) N C Urokinase (UK) (尿激酶) N C Factor IX (凝血因子IX, X-mas抗血友病因子) N C Plasminogen (纤维蛋白融酶原) 几种丝氨酸蛋白酶的domain组合 B.Alberts 等,Mol.Biology of the Cell 第三版 1994. P.123 含3个-s-s-

  18. GC 语法复杂性 字母表  例1.  = {a, c, g, t} 例2.  = {A, C, D … W, Y} 例3.  = {a, … z, A, … Z, +, –, …} 字母表中各种字母组成的一切字母串 (包括空串) * *的任何子集是基于的一种语言 语法= {字母表,初始字母,产生规则} 基于该语法的语言

  19. Classification of Formal Languages Chomsky Hierarchy Sequential production rules Lindenmayer Systems Parallel production rules

  20. Generative Grammar S NP VP VP V NP NP (Art) Adj* N S if S then S S either S or S N boy | girl | scientist | … V sees | believes | loves | eats | … Adj young | good | beautiful | … Art a | one | the S Sentence NP Noun Phrase VP  Verb Phrase Adj  Adjective Art Article Non-Terminal and Terminal Symbols

  21. Chomsky 语法层次 N — 非终结字母集(工作用符号) T — 终结字母集 S  N 起始字母 P = {生成规则(x y)的集合} x, y 为字母串 关于 x, y 的不同规定导致不同语法 语法 G = (N, T, P, S) 0 类语法 x  (NT)* N(NT)* y  (NT)* 至少含有一个非终结字母

  22. 1 类语法 上下文有关语法 x = t1 a t2 t1, t2 T* a  N 2 类语法 上下文无关语法 x = a  N 3 类语法 正规语法 x = a y = b 或 bc a, c  N b =空 或 b  T

  23. A, B, …  Non-terminals (NT) , , …  Terminals (T) Regular Grammar: A A A  One symbol on LHS; One or none NT at the right-end of the RHS.

  24. Context-Free Grammar: A A B  B |  One symbol on the LHS; NT anywhere on the RHS. Context-Sensitive Grammar: A AB A  A  A  One or more symbols on LHS, but length  that of RHS; One or more NT on RHS. Recursively Enumerable Grammar: No restriction in production rules.

  25. 形式语言的Chomsky层次

  26. a b (i) (ii)  (a, R) = b A Finite State Automaton (FSA) A transfer function

  27. A Pushdown Automaton Pushdown list Stack First In Last Out (FILO)

  28. A Turing MachineAlan M. Turing (1912-1954) FSA +  R/W tape Church-Turing Thesis (1936): Any effective (mechanical) computation can be carried out by a Turing machine

  29. 形式语言的Chomsky层次

  30. Example: {ai b ici | i>0} CSL Terminals = {a, b, c} Non-terminal = {A, B} Sequential rules: B aBAc | abc bA bb cA Ac B abc B aBAc aabcAc aabAcc B abAc aaBAcAc aaBAAc aaabcAAc aaabAcAc aaabbAcc

  31. Rules to Generate Gene-Like Sequences( by David Searls ) gene upstream transcript downstream transcript 5’-untranslated-region start-codon coding-region 3’-untranslated-region coding-region codon coding-region | stop-codon | splice | coding region codon lys | asn | thr | met | glu | his | pro | asp | ala | gly | tyr | trp | phe | leu | ile | ser | arg | gln | val | cys start-codon met stop-codon taa | tag | tga

  32. leu tt purine | ct base (6) ser ag pyrimidine | tc base (6) arg ag purine | cg base (6) val gt base pro cc base (4) ala gc base gly gg base (4) thr ac base (4) ile at pyrimidine | ata (3) lys aa purine asn aa pyrimidine (2) gln ca purine his ca pyrimidine (2) glu ga purine cys tg pyrimidine (2) phe tt pyrimidine tyr ta pyrimidine (2) asp ga pyrimidine (2) met atg trp tgg base m a | c | g | t purine a | g primidine c | t

  33. splice intron intron gt | intron-body | ag splice aa intron splice cc intron splice tt intron splice g g intron a splice intron a c splice intron c t splice intron t g splice intron g upstream enhancer promotor enhancer enhancer … promotor … silencer … isolator …

  34. These rules are capable to generate an unlimited set of gene-like sequences, mostly biological nonsense. They may be used to recognize gene-like segments in long DNA sequences. Syntax versus Semantics: texts vs. grammar. Physics behind this coarse-grained description: stereochemistry, interaction between proteins and DNA chains, metallic ions etc.

  35. Development of Anabaena catenula (串珠藻项圈藻属) br ar ar albr bl al al blar br bl ar al albr blar Alphabet: S = {ar, al, br, bl} Production rules: Initial symbol (axiom) = ar Grammar: G = (S, P, ) Language: L (G)  S* P =

  36. Lindenmayer Systems Parallel production rules. Finer classification D0L –Deterministic, no interaction, i.e., context-free 0L – non-deterministic, no interaction IL – non-deterministic, with Interaction, i.e., context sensitive T0L – with Table of production rules TIL – E0L – Extended to non-terminal symbols ET0L – EIL REL of Chomsky

  37. CSL CFL RGL FIN DOL RGL Regular CFL Context-Free CSL Context-Sensitive REL Recursively Enumerable REL

  38. 0:REL EIL 1:CSL IND ET0L IL E0L Chomsky Lindenmayer Indexed 2:CFL T0L 3:RGL 0L D0L

  39. Example a la Lindenmayer L = {aibici | i > 0} CSL G = (S, T, )  = abc S = {a, b, c} T = {t1, t2} T1= {a aa, b bb, c cc} T2 = {a , b , c } T0L

  40. Gene-Finding Gene-structure model

  41. start stop 5’ Genomic DNA 3’ transcribe RNA Pol II +… Pre-mRNA splicesome u1u2u4u5u6RNP splice mRNA 5’-UTR 3’-UTR translate ribsome init. + elong. factors term. chaperonine AA seq ( protein primary seq ) fold Protein fold

  42. GT-AG Rule for Intron 5’ splicing donor site exon …A64G73G100T100A62A68G84T63… …12PyNC65A100G100 N…exon 3’ splicing acceptor site

  43. Transcription Translation Translation Transcription start start end end {()【(.)(.)(.)】()} • 【( First exon • )( Internal exon • )】 Last exon • {( Non-coding 5’ exon • )【 Non-coding 5’ exon • (.) Intron • 】( Non-coding 3’ exon (rare) • )} Non-coding 3’ exon (rare) • }{ Intergenic region

  44. Dyck language: A language of nested parentheses • Many types of parentheses • Finite depth of nesting • Context-free language Our case: • Only 3 types of parentheses • Shallow nesting • Conjecture (Xie): may be regular language

  45. Huimin Xie 谢惠民 Grammatical Complexity and 1D dynamical Systems Vol.6 inDirections in Chaos WSPC, 1996. 谢惠民 《复杂性与动力系统》 上海科技教育出版社, 1994 J.Hopcroft, J.Ullman, Introduction to Automata Theory, Languages and Computation, Addison-Wesley, 1979.

More Related