190 likes | 199 Views
Challenges for computer science as a part of Systems Biology. Benno Schwikowski Institute for Systems Biology Seattle, WA. Towards integrative models. Species. Conditions/time. Genes. DNA Sequence Genomic locus Domain content Intron/exon structure Regulatory motifs
E N D
Challenges for computer scienceas a part of Systems Biology Benno SchwikowskiInstitute for Systems BiologySeattle, WA
Towards integrative models Species Conditions/time Genes • DNA • Sequence • Genomic locus • Domain content • Intron/exon structure • Regulatory motifs • Chemical modifications • SNPs - Splice variants- Accessibility • Variation • mRNA • Abundance- Regulatory information- initiation/ termination signals • Proteininteraction • Interaction partner • Direct/indirect- Affinity • Effect • Protein- Abundance- State • Localization • 3D structure • Functional characterization • Half-life • Active sites • Biochemical function- Cellular role Benno Schwikowski
Challenge: Integrative models …Across genes and proteins: Many genes involved (e.g., multifactorial diseases) • …Across model systems: Lack of experimental platforms in target system • …Across levels of biological organization(e.g. gene regulatory processes involving phosphorylation) • …Across experiments: Robustness against errors in mass spectrometry, mRNA measurements • …Across timescales Benno Schwikowski
Challenge: Capturing evolutionary constraints DNA RNA Proteins Modules Organelles Cells Organs Individuals Populations Ecologies "Nothing in biology makes sense except in the light of evolution.“ Theodosius Dobzhansky Benno Schwikowski
Challenge: Choosing experiments • Machine LearningDetermine most likely classification/parameterization on the basis of a randomly sampled dataset • Active LearningAllow an algorithm to query selected data points, using the result of previous queries. Benno Schwikowski
Challenge: Relations between system variables can be quite complex Yuh, Bolouri, Davidson, Science, 1998 Benno Schwikowski
Challenge: Relations between system variables can be quite complex Yuh, Bolouri, Davidson, Science, 1998 Benno Schwikowski
Challenge: Develop models that allow extremely efficient algorithms AGTCGTACGTGAC... AGTAGACGTGCCG... ACGTGAGATACGT... GAACGGAGTACGT... TCGTGACGGTGAT... Benno Schwikowski
CLUSTALW(1.74) multiple sequence alignment Cotton ACGGTT-TCCATTGGATGA---AATGAGATAAGAT---CACTGTGC---TTCTTCCACGTG--GCAGGTTGCCAAAGATA-------AGGCTTTACCATT Pea GTTTTT-TCAGTTAGCTTA---GTGGGCATCTTA----CACGTGGC---ATTATTATCCTA--TT-GGTGGCTAATGATA-------AGG--TTAGCACA Tobacco TAGGAT-GAGATAAGATTA---CTGAGGTGCTTTA---CACGTGGC---ACCTCCATTGTG--GT-GACTTAAATGAAGA-------ATGGCTTAGCACC Ice-plant TCCCAT-ACATTGACATAT---ATGGCCCGCCTGCGGCAACAAAAA---AACTAAAGGATA--GCTAGTTGCTACTACAATTC--CCATAACTCACCACC Turnip ATTCAT-ATAAATAGAAGG---TCCGCGAACATTG--AAATGTAGATCATGCGTCAGAATT--GTCCTCTCTTAATAGGA-------A-------GGAGC Wheat TATGAT-AAAATGAAATAT---TTTGCCCAGCCA-----ACTCAGTCGCATCCTCGGACAA--TTTGTTATCAAGGAACTCAC--CCAAAAACAAGCAAA Duckweed TCGGAT-GGGGGGGCATGAACACTTGCAATCATT-----TCATGACTCATTTCTGAACATGT-GCCCTTGGCAACGTGTAGACTGCCAACATTAATTAAA Larch TAACAT-ATGATATAACAC---CGGGCACACATTCCTAAACAAAGAGTGATTTCAAATATATCGTTAATTACGACTAACAAAA--TGAAAGTACAAGACC Cotton CAAGAAAAGTTTCCACCCTC------TTTGTGGTCATAATG-GTT-GTAATGTC-ATCTGATTT----AGGATCCAACGTCACCCTTTCTCCCA-----A Pea C---AAAACTTTTCAATCT-------TGTGTGGTTAATATG-ACT-GCAAAGTTTATCATTTTC----ACAATCCAACAA-ACTGGTTCT---------A Tobacco AAAAATAATTTTCCAACCTTT---CATGTGTGGATATTAAG-ATTTGTATAATGTATCAAGAACC-ACATAATCCAATGGTTAGCTTTATTCCAAGATGA Ice-plant ATCACACATTCTTCCATTTCATCCCCTTTTTCTTGGATGAG-ATAAGATATGGGTTCCTGCCAC----GTGGCACCATACCATGGTTTGTTA-ACGATAA Turnip CAAAAGCATTGGCTCAAGTTG-----AGACGAGTAACCATACACATTCATACGTTTTCTTACAAG-ATAAGATAAGATAATGTTATTTCT---------A Wheat GCTAGAAAAAGGTTGTGTGGCAGCCACCTAATGACATGAAGGACT-GAAATTTCCAGCACACACA-A-TGTATCCGACGGCAATGCTTCTTC-------- Duckweed ATATAATATTAGAAAAAAATC-----TCCCATAGTATTTAGTATTTACCAAAAGTCACACGACCA-CTAGACTCCAATTTACCCAAATCACTAACCAATT Larch TTCTCGTATAAGGCCACCA-------TTGGTAGACACGTAGTATGCTAAATATGCACCACACACA-CTATCAGATATGGTAGTGGGATCTG--ACGGTCA Cotton ACCAATCTCT---AAATGTT----GTGAGCT---TAG-GCCAAATTT-TATGACTATA--TAT----AGGGGATTGCACC----AAGGCAGTG-ACACTA Pea GGCAGTGGCC---AACTAC--------------------CACAATTT-TAAGACCATAA-TAT----TGGAAATAGAA------AAATCAAT--ACATTA Tobacco GGGGGTTGTT---GATTTTT----GTCCGTTAGATAT-GCGAAATATGTAAAACCTTAT-CAT----TATATATAGAG------TGGTGGGCA-ACGATG Ice-plant GGCTCTTAATCAAAAGTTTTAGGTGTGAATTTAGTTT-GATGAGTTTTAAGGTCCTTAT-TATA---TATAGGAAGGGGG----TGCTATGGA-GCAAGG Turnip CACCTTTCTTTAATCCTGTGGCAGTTAACGACGATATCATGAAATCTTGATCCTTCGAT-CATTAGGGCTTCATACCTCT----TGCGCTTCTCACTATA Wheat CACTGATCCGGAGAAGATAAGGAAACGAGGCAACCAGCGAACGTGAGCCATCCCAACCA-CATCTGTACCAAAGAAACGG----GGCTATATATACCGTG Duckweed TTAGGTTGAATGGAAAATAG---AACGCAATAATGTCCGACATATTTCCTATATTTCCG-TTTTTCGAGAGAAGGCCTGTGTACCGATAAGGATGTAATC Larch CGCTTCTCCTCTGGAGTTATCCGATTGTAATCCTTGCAGTCCAATTTCTCTGGTCTGGC-CCA----ACCTTAGAGATTG----GGGCTTATA-TCTATA Cotton T-TAAGGGATCAGTGAGAC-TCTTTTGTATAACTGTAGCAT--ATAGTAC Pea TATAAAGCAAGTTTTAGTA-CAAGCTTTGCAATTCAACCAC--A-AGAAC Tobacco CATAGACCATCTTGGAAGT-TTAAAGGGAAAAAAGGAAAAG--GGAGAAA Ice-plant TCCTCATCAAAAGGGAAGTGTTTTTTCTCTAACTATATTACTAAGAGTAC Larch TCTTCTTCACAC---AATCCATTTGTGTAGAGCCGCTGGAAGGTAAATCA Turnip TATAGATAACCA---AAGCAATAGACAGACAAGTAAGTTAAG-AGAAAAG Wheat GTGACCCGGCAATGGGGTCCTCAACTGTAGCCGGCATCCTCCTCTCCTCC Duckweed CATGGGGCGACG---CAGTGTGTGGAGGAGCAGGCTCAGTCTCCTTCTCG Benno Schwikowski
Challenge: Developing models that allow extremely efficient algorithms AGTCGTACGTGAC... AGTAGACGTGCCG... ACGTGAGATACGT... GAACGGAGTACGT... TCGTGACGGTGAT... ACGT ACGT ACGT ACGG Parsimony score: 1 J. Comp Biol. 2002 Benno Schwikowski
An Exact Algorithm(generalizing Sankoff and Rousseau 1975) Wu [s] = min ( Wv [t] + d(s, t) ) v:child t of u … ACGG: + ACGT: 0 ... …ACGG: ACGT :0... …ACGG:ACGT :0... …ACGG:ACGT :0 ... … ACGG: 1 ACGT: 0 ... … ACGG: 2 ACGT: 1... … ACGG: 1ACGT: 1 ... … ACGG: 0ACGT: 2 ... Wu [s] = best parsimony score for subtree rooted at node u, if u is labeled with string s. 4k entries AGTCGTACGTG ACGGGACGTGC ACGTGAGATAC GAACGGAGTAC TCGTGACGGTG … ACGG: 0 ACGT: +... J. Comp Biol. 2002 Benno Schwikowski
What are good challenges to tackle? • Biological/medical questions asked • Experimental technologies to acquire a lot of relevant data • Available datasets with a formalized notion of “data quality” Benno Schwikowski
Memory complexity: O(k 42k ) per node Average sequence length Number of species Time complexity: Total time O(nk(42k + l )) Motif length J. Comp Biol. 2002 Benno Schwikowski
Technology-based challenges:Universal DNA Tag Systems • Existing applications in high-throughput technologies • Universal DNA arrays • Padlock probes • LYNX mRNA technology
Formalization Define: weight(A/T)=1, weight(C/G)=2 weight(AACTTG) = 1+1+2+1+1+2 = 8 melting temperature (AACTTG) = 2·weight l-ucode problemGiven two integers, l < u, find the largestset of tags such that Each tag has weight uEach string of weight l occurs at most once J. Comp Biol. 2000 & 2003
Challenge: Visualization Andrea Weston et al.@ ISB & Cytoscape Benno Schwikowski
Challenge: Visualization Cytoscape, pre-release 2.0 Benno Schwikowski
A computer scientist’s perspective “Biology is so digital, and incredibly complicated […] I can't be as confident about computer science as I can about biology. Biology easily has 500 years of exciting problems to work on, it's at that level.” Donald Knuth, 7 Dec 1993 Donald Knuth Benno Schwikowski