350 likes | 478 Views
Proteus, a Grid based Problem Solving Environment (PSE) for Bioinformatics : Architecture and Experiments. Presenter: Michael Robinson Agnostic: Javier Munoz Advanced Topics in Software Engineering CIS 6612 Florida International University
E N D
Proteus, a Grid based Problem Solving Environment (PSE) for Bioinformatics: Architecture and Experiments Presenter: Michael Robinson Agnostic: Javier Munoz Advanced Topics in Software Engineering CIS 6612 Florida International University July 31, 2006 Authors: Mario Cannataro1, Carmela Comito2, Filippo Lo Schiavo1, and Pierangelo Veltri1 (February 2004) 1 University of Magna Graecia of Catanzaro, Italy 2 University of Calabria, Italy
Organization Abstract ~60% is about Bioinformatics Proteus Architecture First Test Implementation Results of First Test Conclusion and Future Work
Abstract • Live sciences Bioinformatics Computer Science • Data Files sizes • Computer power
The Partners • What is Livesciences • What is Bioinformatics • Other Sciences used in Bioinformatics • What is Computer Science
Human Genome • The sum total of DNA in an organism is its genome. • The Human Genome Project (HGP) an international effort, began in October 1990, and was completed in 1999, 2003, 2004. (http://www.pbs.org/wgbh/nova/genome/program.html) • Project goals were to: • Determine the complete sequence of the 3 billion DNA bases • Identify all human genes • And make them accessible for further biological study
Human Genome • The bacterium E. coli and others were used to help develop the technology and interpret human gene function. • The Human Genome Project was sponsored by: The U.S. Department of Energy and The U.S. National Institutes of Health http://www.preventiongenetics.com/edu/genetics_nutshell.htm
DNA (ACGT) • Humans have from 10 to 100 trillion cells • Each Human cell has about 3 billion nucleotides • We have approximately 30,000 genes • Of the three billion letters of DNA that we have, only 1 to 1.5 percent of it is gene the rest is STUFF”. • The functions are unknown for over 50% of known genes
DNA (ACGT) Human Genome • 3,000,000,000 ~ dna bases • 30,000,000 ~ bases in genes • 2,970,000,000 ~ stuff • adenine (A) forms a base pair with thymine (T) guanine (G) forms a base pair with cytosine (C)
The gene sizes • Largest known human gene is dystrophin at 2.4 million bases. • Chromosome 21 is the smallest human chromosome. Three copies of this autosome causes Down syndrome, the most frequent genetic disorder associated with significant mental retardation. Academic groups from Germany and Japan mapped and sequenced it, it has 33,546,361 bp of DNA Analysis of the chromosome revealed: • 127 known genes, • 98 predicted genes, • and 59 pseudogenes. • Smallest bacterial genome, Mycoplasma genitalium size of 580 kbp
Bioinformatics • DNA RNA PROTEINS MUTATIONS, ILLNESSES MEDICATIONS CLONING
DNA (ACGT) • Pseudomonas Aeruginosas PA01 6,264,403 bases, 5565 genes • complement(6264226..6264360) 6264181 gcttgtcccg gtcgaagtcc cgactcacca cccgtaccgg ataaatcaga cggtcagacg 6264241 cttacggcct ttggcgcgac gacgcgacag aacctgacgg ccgttcttgg tggccatacg 6264301 ggcgcggaaa ccgtggacgc gagcgcgctt gagggtgctg ggttggaaag tacgtttcat 6264361 gattcggtac ctgggttgac gacttgaggt cgcagtgacc ccg
RNA • In RNA, thymine is replaced by uracil (U). DNA 6264181 gcttgtcccg gtcgaagtcc cgactcacca cccgtaccgg ataaatcaga cggtcagacg 6264241 cttacggcct ttggcgcgac gacgcgacag aacctgacgg ccgttcttgg tggccatacg 6264301 ggcgcggaaa ccgtggacgc gagcgcgctt gagggtgctg ggttggaaag tacgtttcat 6264361 gattcggtac ctgggttgac gacttgaggt cgcagtgacc ccg RNA 6264181 gcuugucccg gucgaagucc cgacucacca cccguaccgg auaaaucaga cggucagacg 6264241 cuuacggccu uuggcgcgac gacgcgacag aaccugacgg ccguucuugg uggccauacg 6264301 ggcgcggaaa ccguggacgc gagcgcgcuu gagggugcug gguuggaaag uacguuucau 6264361 gauucgguac cuggguugac gacuugaggu cgcagugacc ccg
Proteins (sequences) DNA 6264181 gcttgtcccg gtcgaagtcc cgactcacca cccgtaccgg ataaatcaga cggtcagacg 6264241 cttacggcct ttggcgcgac gacgcgacag aacctgacgg ccgttcttgg tggccatacg 6264301 ggcgcggaaa ccgtggacgc gagcgcgctt gagggtgctg ggttggaaag tacgtttcat 6264361 gattcggtac ctgggttgac gacttgaggt cgcagtgacc ccg RNA 6264181 gcuugucccg gucgaagucc cgacucacca cccguaccgg auaaaucaga cggucagacg 6264241 cuuacggccu uuggcgcgac gacgcgacag aaccugacgg ccguucuugg uggccauacg 6264301 ggcgcggaaa ccguggacgc gagcgcgcuu gagggugcug gguuggaaag uacguuucau 6264361 gauucgguac cuggguugac gacuugaggu cgcagugacc ccg PROTEIN MKRTFQPSTLKRARVHGFRARMATKNGRQVLSRRRAKGRKRLTV
Proteins: Pattern Matching G-H-E-X(2)-G-X(4,5)-[GA]
Proteins: Structures • Chemical properties that distinguish the 20 different amino acids cause the protein chains to fold up into specific three-dimensional structures that define their particular functions in the cell
Reality • Somewhere in this dense chemical forest are genes involved in deafness, Alzheimer, cancer, cataracts, etc. But where? This is such a maze scientists need a map. • Out of three billion base pairs in our DNA, just one single letter can make a difference.
Data Locations • GenBank in the US, 1974 1997 = 1.26 gigabases http://www.ncbi.nlm.nih.gov/ 2004 = 39 gigabases 2005 = 100 gigabases • EMBL in England, 1980 http://www.ebi.ac.uk/embl/ • DDBJ in Japan, 1984 http://www.ddbj.nig.ac.jp/
Some Databases • The Swiss Institute of Bioinformatics maintains the following databases: Ashbya Genome Database Cancer Immunome Database Eukaryotic Promoter Database (EPD) GermOnline MyHits PROSITE Swiss-Prot and TrEMBL SWISS-2DPAGE SWISS-MODEL Repository
Specialization • Plasmodb http://www.plasmodb.org/plasmo/home.jsp parasitic eukaryote Plasmodium the causative agent of the disease Malaria. apibugz@delphi.pcbi.upenn.edu
Conclusions and Future WorkExecution Times of the Application
References On the paper the authors cited 27 references
Questions Thank you