470 likes | 566 Views
Searching for Sense in the Genome. NKx 2.6. NKx 2.9. Jim Kent Genome Bioinformatics Group University of California Santa Cruz. Mouse mRNA in-situ’s of various transcription factors courtesy of Paul Gray. Hmx1. Mllt7. The Paradox of the Genome.
E N D
Searching for Sense in the Genome NKx 2.6 NKx 2.9 Jim KentGenome Bioinformatics GroupUniversity of California Santa Cruz Mouse mRNA in-situ’s of various transcription factors courtesy of Paul Gray. Hmx1 Mllt7
The Paradox of the Genome How does a long, static, one dimensional string of DNA turn into the remarkably complex, dynamic, and three dimensional human body? GTTTGCCATCTTTTGCTGCTCTAGGGAATCCAGCAGCTGTCACCATGTAAACAAGCCCAGGCTAGACCAGTTACCCTCATCATCTTAGCTGATAGCCAGCCAGCCACCACAGGCATGAGT
More Complex In Real Life Image from UCSC Genome Browser with “Known Genes” and RepeatMasker tracks. The two genes are TPARL on the forward strand, and CLOCK on the reverse strand. CLOCK regulates sleeping. The function of TPARL is unknown. Note bulk of genome is repeats & introns.
Transposons • Similar to retroviruses like HIV. • “Selfish DNA” or “molecular parasite” • The ALU transposon is a parasite on the LINE transposon. Things grow on things. • ~50% of human genome is relics of transposons. • Transposons and other duplications along with its sheer size made sequencing and assembling the human genome a challenge.
Introns • Human genes are interrupted by introns. • Introns typically are much larger than the rest of the gene, often 100k or more. • It’s possible that the first introns, perhaps a billion years ago, were a particular type of transposon. We do see transposons creating introns today. • Introns make finding genes in the human genome an ongoing challenge.
Lines of Evidence for Genes • Browser shot with computational gene finders, ESTs, comparative genomics, full length mRNA. UCSC Genome Browser presenting tracks of evidence for 2 genes
Hints of a Gene Very suggestive evidence for unknown gene.
Genome Progress • Genome DNA sequence: • 85% complete 2000 • 99% complete 2004 • Human gene set: • 85% complete 2004 • 99% complete 2008?
The Paradox of Genes How do 25,000 genes, each in the end just one dimensional strings of DNA turn into the remarkably complex, dynamic, and three dimensional human body? CLOCK TPARL FLJ13352 PDCL2 NMU KDR SEC3L1 KIAA0635 AK126014 KIT NRPS998 PPAT CR749824 SRP72 ARL9 PDGFRA HOP GSH-2 AF090902 SPINK2 CHIC2 BC057822 REST C4orf14 POLR2B IGFBP7 LNX SCFD2 RASL11B AY189288 USP46 AK021912 LOC132671 SGCB …
How to Understand Incredibly Complex Systems? • DNA is popularly considered the code of life. • Computer programs are complex systems that ultimately are built up of 0’s and 1’s, perhaps they are a model for a genome built of A,C,G and T? BUT…. • Human genome lacks documentation, has accumulated 3 billion years of cruft, and does not believe in local variables. • Therefore we must look to less than straightforward software programs as guides.
Bioperl CORBA module sub new { my ( $class, @args) = @_; my $self = $class->SUPER::new(@args); my ( $idl, $ior, $orbname ) = $self->_rearrange( [ qw(IDL IOR ORBNAME)], @args); $self->{'_ior'} = $ior || 'biocorba.ior'; $self->{'_idl'} = $idl || $ENV{BIOCORBAIDL} || 'biocorba.idl'; $self->{'_orbname'} = $orbname || 'orbit-local-orb'; $CORBA::ORBit::IDL_PATH = $self->{'_idl'}; my $orb = CORBA::ORB_init($orbname); my $root_poa = $orb->resolve_initial_references("RootPOA"); $self->{'_orb'} = $orb; $self->{'_rootpoa'} = $root_poa; return $self; }
Obfuscated C #define c(n,s)case n:s;continue char x[]="((((((((((((((((((((((",w[]= "\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b";char r[]={92,124,47},l[]={2,3,1 ,0};char*T[]={" |"," |","%\\|/%"," %%%",""};char d=1,p=40,o=40,k=0,*a,y,z,g= -1,G,X,**P=&T[4],f=0;unsigned int s=0;void u(int i){int n;printf( "\233;%uH\233L%c\233;%uH%c\233;%uH%s\23322;%uH@\23323;%uH \n",*x-*w,r[d],*x+*w ,r[d],X,*P,p+=k,o);if(abs(p-x[21])>=w[21])exit(0);if(g!=G){struct itimerval t= {0,0,0,0};g+=((g<G)<<1)-1;t.it_interval.tv_usec=t.it_value.tv_usec=72000/((g>> 3)+1);setitimer(0,&t,0);f&&printf("\e[10;%u]",g+24);}f&&putchar(7);s+=(9-w[21] )*((g>>3)+1);o=p;m(x);m(w);(n=rand())&255||--*w||++*w;if(!(**P&&P++||n&7936)){ while(abs((X=rand()%76)-*x+2)-*w<6);++X;P=T;}(n=rand()&31)<3&&(d=n);!d&&--*x<= *w&&(++*x,++d)||d==2&&++*x+*w>79&&(--*x,--d);signal(i,u);}void e(){signal(14, SIG_IGN);printf("\e[0q\ecScore: %u\n",s);system("stty echo -cbreak");}int main (int C,char**V){atexit(e);(C<2||*V[1]!=113)&&(f=(C=*(int*)getenv("TERM"))==( int)0x756E696C||C==(int)0x6C696E75);srand(getpid());system("stty -echo cbreak" );h(0);u(14);for(;;)switch(getchar()){case 113:return 0;case 91:case 98:c(44,k =-1);case 32:case 110:c(46,k=0);case 93:case 109:c(47,k=1);c(49,h(0));c(50,h(1 ));c(51,h(2));c(52,h(3));}}
Reverse Engineering Microsoft mouse blue screen of death Windows XP keyboard network elaborate proprietary process
Textbook Gene Regulation: Promoter Tells Where to Begin Different promoters activate different genes in different parts of the body.
Computing Baldness Idealized promoter for a gene involved in making hair. Proteins that bind to specific DNA sequences in the promoter region together turn a gene on or off. These proteins are themselves regulated by their own promoters leading to a gene regulatory network with many of the same properties as a neural network.
Genes can be transcription factors that activate or repress other genes, leading to regulatory networks such as this one from the development of the central nervous system. (Image from D’Haeseleer Somogyi 1999)
The Decisions of a Cell • When to reproduce? • When to migrate and where? • What to differentiate into? • When to secrete something? • When to make an electrical signal? The more rapid decisions usually are via the cell membrane and 2nd messengers. The longer acting decisions are usually made in the nucleus.
Nucleus Used to Appear Simple • Cheek cells stained with basic dyes. Nuclei are readily visible.
Mammalian Nuclei Stained in Various Ways Image from Tom Misteli lab
Artist’s rendition of nucleus Image from nuclear protein database
Turning on a gene: • Get DNA into right part of nucleus. • Unpack chromatin. • Attract RNA polymerase enzyme to transcribe gene from DNA into RNA.
Methods for Studying Gene Regulation • Genetics in model organisms. • Microarrays. • In situs. • Promoters hooked to reporter genes • Phylogenic footprinting
Drosophila Genetics antennapediamutant normal
UCSC Gene Sorter showing GNF Gene Atlas 2 microarray (gene chip) data on several transcription factors.
In Situ’s NKx 2.6 NKx 2.9 Hmx1 Mouse mRNA in-situ’s of various transcription factors courtesy of Paul Gray. Mllt7
Reporter Gene Constructs promoter to study easily seen gene Drosophila embryo transfected with ftz promoter hookedup to lacz reporter gene, creating stripes where ftz promoteris active.
Comparative Genomics Scott Schwatz
Conservation of Gene Features Conservation pattern across 3165 mappings of human RefSeq mRNAs to the genome. A program sampled 200 evenly spaced bases across 500 bases upstream of transcription, the 5’ UTR, the first coding exon, introns, middle coding exons, introns, the 3’ UTR and 500 bases after polyadenylatoin. There are peaks of conservation at the transition from one region to another.
Conservation in Multiple Alignments • As you add more species the phylogenic footprint gets sharper. • Currently genome.ucsc.edu shows multiple alignments between 8 species. • Alignment and conservation scoring algorithms are interesting, involve dynamic programming.
PhyloHMM on Drosophila • Drosophila proteasome alpha 7-1. In many genes like this one phylogenic footprint suggests promoter actually is downstream of transcription start site.
Other tools to cybernetically enhance your mind at genome.ucsc.edu
UCSC Gene Sorter • Swiss army knife for dealing with gene sets. • Presents functional data on genes including microarray expression information. • Hilights relationships and connections between genes. • Powerful data mining tool.
A Big Bioinformatics Web Site • genome.ucsc.edu gets > 100,000 hits by > 5000 scientists each day. • Involves 600,000 lines of C code, bits of awk, perl, bash, tcsh, java, r and tcl. • 1200 CPUs and 12 Terabytes of disk • 12 full time staff, 18 part time, grad student and post-doc.
Site Architecture • 8 web servers running Apache and MySQL • CGI’s written in C access genome data and user interface settings in MySQL. • Genome database is bottleneck, and is replicated on each server. • Cluster of 1000 CPUs, and smaller clusters of faster CPUs create annotation files which are loaded into database.
Site Sociology • 1/3 of group telecommutes. • Thursdays are devoted to reading and testing each other’s code and if necessary a one or two hour meeting. • We develop very incrementally, and do a new release once a week. • 1/4 of group is dedicated to quality assurance, I’m wanting to increase this to 1/3. • User support is shared by everyone.
Parasol and Kilo Cluster • UCSC cluster has 1000 CPUs running Linux • 1,000,000 BLASTZ jobs in 25 hours for mouse/human alignment • We wrote Parasol job scheduler to keep up. • Very fast and free. • Jobs are organized into batches. • Error checking at job and at batch level.
Conclusions • Spaghetti code is not so helpful in understanding the genome. • Human genome suggests that trial and error development is likely to yield a robust version of windows within 3 billion years. • Understanding the flow of control in the genome is a problem that fascinates biologists and computer scientists alike.
Individuals Institutions Acknowledgements Chuck Sugnet, Angie Hinrichs, Fan Hsu, Terry Furey, Heather Trumbower, Kate Rosenbloom, Hiram Clawson, Brian Raney, Rachel Harte, Bob Kuhn, Andy Pohl, Mathieu Blanchette, Donna Karolchik, David Haussler Bob Waterston, John Sulston, Eric Lander, Richard Gibbs, Francis Collins, Michael Brent, Olivier Jaillon, David Kulp, Ewan Birney, Greg Schuler, Deanna Church, Scott Schwartz, Ross Hardison, Webb Miller and everyone else! NHGRI, NCI, HHMI, The Wellcome Trust, Taxpayers in the US and worldwide. Baylor, Sanger, Wash U, Whitehead, Stanford, JGI/ DOE, Vancouver GSC, UW and the international sequencing centers. UCSC, NCBI, EBI, Ensembl, Genoscope, MGC, Intel, TIGR, Jackson Labs, Affymetrix, SwissProt.