E N D
1. BINF 733 Lecture 4: Feb 6, 2003
Gene Expression Databases
Homework Examples
2. Homework Assignment #1 Pick a dataset (I have several I can make available if you don’t have one you are working with)
Perform either
a biophysical analysis of the expected probe-target interactions for 10 genes – explain what assumptions you make if insufficient information is available, for example about probe concentration
OR
a replicate analysis to see how many replicates you would have recommended, using either Medvedovec, Baldi and Long or another method properly referenced.
3. MGED and MAGE MAGE-OM:
Microarray Gene Expression – Object Model
MAGE-ML:
Microarray Gene Expression – Markup Language
The proposal submitted to the OMG calls for a standard that addresses:
the representation of gene expression data
relevant annotations
mechanisms for exchanging these data.
4. MGED MGED is a working group (Microarrays and Gene Expression Databases) started by the EBI, whose first project was the submission of the MIAME standards (Minimal Information about a Gene Expression Experiment) and subsequently the MAGE-OM and ML recommendations to the OMG (Microarray Gene Expression Object Model and Markup Language; Object Management Group).
5. Relating DBs and tools to MIAME and MAGE - semantically correct references MIAME-supportive. MIAME describes data standards so that you can interpret and exchange data. A complete MIAME data sets will be made available as a validation check
MAGE-OM compliant and MAGE-ML compatible DB
A MIAME-supportive and Mage-ML compatible application
It was suggested that the group provide recommendations for reviewers and authors and to public repositories (GEO and ArrayExpress) to coordinate the provision of common or at least compatible accession numbers for cross-references.The journals Nature and Science now require MIAME support for microarray submissions.
6. Lacunae in MIAME and MAGE Data Processing and Transformation
The representation of data transformation should be formalized
The process and parameters of transformation should be tracked
A “replicate” should be more carefully defined
A standard reference should be more carefully defined
7. MAGE version 2 Mesh terminology with the Ontology working group and include the data transformation improvements. The Web site where on-going tasks are tracked is
www.mged.org/working_groups and www.sourceforge.net/mged
Tools for working with the Mage-OM and MAGE-ML are in the MAGE-stk toolkit
8. Ontology Working Group The ontology document allows you to enter instances, using the editor OIL. There was a working group meeting at the larger BioOntology meeting in Hinxton on November 16-21, 2002 (focusing on biomaterials).
The document currently has ~80 classes, ~80 attributes and ~80 instances but does not yet contain identifiers for the concepts.
9. GO:: Gene Ontology Consortium The Gene Ontology project is developing three independent, structured networks of terms being developed to describe three key aspects of biology.
Molecular function describes the activities/tasks performed by individual gene products at the molecular level.
Biological process describes broad biological goals that are accomplished by ordered assemblies of molecular functions
Cellular component encompasses subcellular structures, locations and macromolecular complexes (eg ribosomes)
10. GO terms The GO consortium develops vocabularies and provides terms to provide gene product (protein and functional RNA) annotations including highly curated annotations from model organisms in addition to those resulting from automated methods.
The GO project also develops software for querying, displaying and manipulating ontologies and annotations.
Additional bio-ontologies that other groups are developing include those for nucleotide and amino acid sequence features, organism anatomies, phenotypes.
http://www.geneontology.org
11. GO Tools: GO Browsers, DAG-Edit, GO Database and Other Tools AmiGO: you can search for a GO term and view all gene products annotated to it. You can browse ontologies to see relationships between terms. AmiGo accesses the GO mySQL database. The browser and documentation are available from http://www.godatabase.org/dev/
MGI GO Browser is specific for the mouse database at the Jackson Labs. http://www.informatics.jax.org/searches/GO_form.shtml
12. GO Tools cont Quick GO is a browser integrated into InterPro at the EBI. This has very complete mappings to SWISS-PROT, InterPro, Transport and EC classifications. http://www.ebi.ac.uk/ego/
EP:GO browser is built into the EBI’s Expression Profiler (a set of tools for analyzing microarray data). http://ep.ebi.ac.uk/EP/GO/
GoFish is available as a Java applet and lets you construct arbitrary Boolean queries using GO attributes, orders the gene products by how well they satisfy the query. http://llama.med.harvard.edu/~berriz/GoFishWelcome.html
13. GO Database GO Database information, inclduign the API documentation, schema diagrams and full descriptions of all tables (mySQL) is available at http://www.godatabase.org/dev/database/
DAG-Edit is a Java application that provides an interface to browse, query and edit GO or any other vocabulary that has a DAG (directed acyclic graph) data structure. The cvs for most current versions is maintained at the SourceForge project site. Documentation is available at http://www.geneontology.org/doc/dagedit_userguide/dagedit.html
Manatee (Manual Annotation Tool Etc Etc) is a Web-based gene evaluation and genome annotation tools developed at TIGR. A useful curators tool. http://manatee.sourceforge.net
14. GO and GE MAPPfinder is an accessory program for GenMAPP (http://www.genmapp.org/MAPPFinder.html ), a program that allows users to query any existing GenMAPP Expression Dataset Criterion against GO gene associations and GenMAPP MAPPS (microarray pathway profiles).
FatiGO (http://bioinfo.cnio.es/cgi-bin/tools/FatiGO/FatiGO.cgi ) is a web interface for clustering microarray data and simple datamining based on GO. The GO terms are related to Unigene Human and Mouse cluster ids and SGD.
15. GO and GE Onto-Express (http://vortex.cs.wayne.edu/Projects.html) searches the public databases and returns tables that correlate gene expression profiles with the cytogenetic gene locations and the three GO categories of terms.
Gene2Diseases (http://www.bork.embl-heidelberg.de/g2d ) is a database of candidate genes for mapped inherited diseases at EMBL. The data is the output of a comparison of the relations between phenotypic features and chemical objects and the chemical objects to the protein function (GO) terms, base on all of Medline and RefSeq.
16. MIAME Landmarks Array design: the layout or coneptual description of array that can be implemented as one or more physical arrays. The specification consists of the description of the common features of the array as a whole, and the description of each design element (eg each spot). MIAME distinguishes between three levels of array design element: the feature (the x,y location) the reporter (the nt sequence actually present at a location) and a composite sequence (a set of reporters collectively used to measure an expression of a particular gene).
Measurements: MIAME distinguishes between three levels of data processing: image (raw data), image analysis and quantitation (processed or standardized data) and the ‘final’ gene expression measurement data matrix (derived data, normalized, summarized and otherwise massaged).
17. MIAME to MAGE-OM
18. MGED Guide to authors, editors and reviewers of gene expression papers How do you describe a microarray experiment in sufficient detail that another scientist can reproduce your results? Most such experiments to date have not been replicable – actually even the analyses have not been replicable. (deRisi dataset has transposed columns, which does not help).
19. MAGE-OM specifically defines the objects of Microarray Gene Expression data independent of any implementation.
The model is developed and described using the Unified Modeling Language (UML) – a standard language for describing object models.
It is a graphical representation depicting the relationships between different entities and diagrams that are primarily meaningful for humans.
20. Example from the MAGE-OM for BioSequence -- A representation of a DNA, RNA, or protein sequence
21. Packages of the MAGE-OM The MAGE-OM is too large to be represented on a single diagram.
Related classes are grouped together into packages and presented on the same diagrams.
(http://www.mged.org/Workgroups/MAGE/mage-om.html )
For example:
ArrayDesign DesignElement
AuditAndSequrity Experiment
BioAssay Measurement
BioMaterial Protocol
BioSequence QuantitationType
22. MAGE-ML - BioSequence
23. Data File - BioSequence
24. Mapping from MAGE-OM to a Relational Database
25. Reference implementation of the MAGE-OM : OpenGeneX
26. Query Results
27. From Image to Intensity Data
28. Here is a list of MIAME compliant software: MIAMExpress - a MIAME annotation tool under development at the EBI.
Gene Traffic - a microarray data management and analysis software developed at iobion informatics.
Ipsogen Cancer Profiler - "A bioinformatics system composed of discovery software tools and of the ELOGE database, utilized to identify transcriptional signatures belonging to each cancer type. This system ensures the management of gene expression measurement, and traceability in a GLP environment".
BASE (BioArray Software Environment) - "a web-based open source MIAME-supportive microarray DB and analysis platform."
29. Homework, part 1
30. Concentration determinations We will use as a starting point a paper from the deRisi lab concerning microarrays to assay viral infections.
31. In the deRisi paper the concentration of the oligos is given already in molar amounts, but often the concentration is given in ug/ul, so you have to convert:
Mol. wt =
[(251 x nA) + (245 x nT) + (267 x nG) + (230 x nC) + (61 x n - 1) +(54 x n) + (23 x n - 1) + 2]
Where:
nA = number of adenine bases in the DNA sequence and n = total number of bases.
(61 x n - 1) accounts for the molecular weight of the phosphate groups.
(54 x n) accounts for the hydration of the DNA (approximately three water molecules per nucleotide as a rule of thumb).
(23 x n - 1) accounts for the sodium cations associated with the phosphate groups (note, if the DNA is an ammonium salt this is (17 x n - 1)).
32. If an A260 is given then you have to work out the extinction coefficient for the oligomer. The nucleotide micromolar extinction coefficients at 260 nm are:
dA=15.400
dT= 8.700
dG= 11.500
dC= 7.500
and to do the calculation you would use the following:
E260 = [(8.7 x nT) + (7.5 x nC) + (11.5 x nG) + (15.4 x nA)] x 0.9*
Where the 0.9 accounts for base stacking. Use 0.8 if you have double-stranded areas.
Here is an extinction coefficient calculator:
www.scripps.edu/mb/gottesfeld/ExtCoeff.html
and a set of conversion tables for both RNA and DNA:
http://www.ambion.com/techlib/append/na_mw_tables.html
33. 70 mer #1:
9632407_261 (deRisi designation) TGCGTGAGGGCGGAGGCGTACATGCCGCAAATGTCGTAAACATAGATGGGCTCCGAGAAGATGCCGATGT
13 x T
25 x G
14 x C
18 x A
The Scripps calculator gives a mass of 21927 daltons, a millimolar extinction coefficient of 800.6, and since I said that I was measuring an A260 of 20 from a 1:1000 dilution, the concentration of the stock is 24981.26uM (the same as pmol/ul) with a Tm of 94.7C based on a calculation of Tm = 81.5° + 0.41°(%GC) - 675°/length of oligo.
34. The paper itself says that oligonucleotides were resuspended at 50pmol/ul, so we would not have to do the calculation in this case, but I wanted to give an example for those papers that use ug/ul or some such units, common when PCR products are used as starting material.
35. If you go to the .gpr output of the results you will find that there is a header on the file describing the features:
ATF 1.0
27 43
"Type=GenePix Results 1.4"
"DateTime=2002/06/08 12:02:21"
"Settings=C:\Documents and Settings\Dave\Desktop\Good single virus hybs\VirochipP3-183.gps"
"GalFile=C:\DaveWang\GAL files\VirochipP3.gal"
"Scanner=GenePix 4000B [83872]"
"Comment="
"PixelSize=5"
"ImageName=635 nm 532 nm"
"FileName=C:\Documents and Settings\Dave\Desktop\Good single virus hybs\VirochipP3-183_635_nm.tif C:\Documents and Settings\Dave\Desktop\Good single virus hybs\VirochipP3-183_532_nm.tif"
"PMTVolts=600 530"
"ScanPower=100 100"
"FocusPosition=0"
"NormalizationFactor:RatioOfMedians=1.21422"
"NormalizationFactor:RatioOfMeans=1.16413"
"NormalizationFactor:MedianOfRatios=1.14034"
"NormalizationFactor:MeanOfRatios=1.16571"
"NormalizationFactor:RegressionRatio=0.992873"
"JpegImage=C:\Documents and Settings\Dave\Desktop\Good single virus hybs\ForWebsite\Adeno_183.jpg"
"RatioFormulation=W1/W2 (635 nm/532 nm)"
"Barcode="
"ImageOrigin=1800, 13240"
"JpegOrigin=1955, 13330"
"Creator=GenePix Pro 3.0.6.89"
"Temperature=29.07"
"LaserPower=3.43 1.6"
"LaserOnTime=49016 49164
36. "Block""Column" "Row" "Name" "X" "Y" "Dia."
1 1 1 "Human Pooled" 2275 13725 135
1 2 1 "Human Pooled" 2490 13720 130
37. The diameter is in microns. This can be used, with the length of the oligo, to calculate the concentration, if 50 pmol (but has it?) has been attached to an area of A = ?r2.
A = 3.14159 (67µ*67 µ) = 14102.6 µ 2. This is 0.0141 cm2.
The length of the 70-mer must be estimated: the allyl linker is ~1nm, the rise of A-form DNA is ~.255nm while the rise of B-form DNA is ~.34 nm/base so the 70-mer is starting out at 18.05 nm – 23.80 nm, so the total length is around 190 – 250 angstroms.
The ERV (the Benight paper, the effective reaction volume, since the oligo is tethered) is then a volume calculation of area x height: we have
(0.0141 cm2 x 0.0000025cm) = 3.525 *10 –8 ml or 3.525 *10–5 µl for the volume
38. Concentration:
How much oligo was actually spotted? It turns out you have to delve into yet another protocol from the deRisi lab to figure this out: the printer puts down 1-2 nl per spot. So we are depositing 0.05 – 0.10 pmol of oligonucleotide.
1 * 10-13 mol/3.525*10–11 l = 0.0028M
This is only true if all of the oligo is available for binding. No efficiency data are given. This is very high compared to what we saw for the Benight paper.
39. We want to be sure that the structure of this oligo is fairly close to a random coil. To do this we can use MFOLD (http://www.bioinfo.rpi.edu/applications/mfold/old/cgi-bin/nph-mfold-3.1-new.cgi)
and you can use the DNA hybridization prediction algorithm from SantaLucia at (http://ozone.chem.wayne.edu/
and the server itself at http://ozone2.chem.wayne.edu/Hyther/hytherinst.html ).
40. Structure 1 Ad17-1_63 dG = -0.42 dH = -41.6 dS = -122.6 Tm = 66.3
41. Structure 1 Ad17_1_63_rc dG = -0.01 dH = -39.2 dS = -116.6 Tm= 63.0
42. It is not straightforward to determine the hybridization conditions from the paper, but if you go to the DeRisi lab Web page a series of protocols is listed (http://derisilab.ucsf.edu/) from which you get standard hybridization conditions of 2.5X SSC and 63C. The sodium concentration is ~0.4M.
Note: 20X SSC is a solution formulated for use in nucleic acid hybridizations and transfer applications. SSC is used in concentrations ranging from 0.2X to 20X depending on the individual application. 20X SSC contains 3.0 M NaCl and 0.3 M sodium citrate.
Hybridizations proceed from 6-12 hours “depending on the complexity and concentration of the samples” but specifics are not given.
43. There are a variety of strategies that can be used to amplify genomic DNA. The deRisi paper uses degenerate primers and attempts to amplify all genomic DNA. It is unlikely that this works equally well for all sequences, and is very hard to assess the quality of. Another strategy is to amplify a specific target region. One such region is described by the sequence below:
6481 gcggacccca cggcatggga tgcgtgaggg cggaggcgta catgccgcaa atgtcgtaaa
6541 catagatggg ctccgagaag atgccgatgt tggtgggata acagcgcccc ccgcggatgc
6601 tggcgcgcac gtattcatac aactcgtgcg
44. The oligo covers from 6501-6570 TGCGTGAGGGCGGAGGCGTACATGCCGCAAATGTCGTAAACATAGATGGGCTCCGAGAAGATGCCGATGT
And the reverse complement is
ACATCGGCATCTTCTCGGAGCCCATCTATGTTTACGACATTTGCGGCATGTACGCCTCCGCCCTCACGCA
45. Structure 1 Ad17_1_target dG = -2.90 dH = -164.7 dS = -481.3 Tm = 69.0
46. It is going to be PCR amplified, so you will get both strands (be sure to take this into account when calculating concentrations – for PCR products this is usually done by weight, ie mg/ml). In this case the strategy of having both oligo and reverse complement on the array means that both PCR products have a target on the array. There will still be some solution-level competition between the product strands and fixed oligos, of course.
If you use the hybridization server, you will need the sequence
gcggacccca cggcatggga tgcgtgaggg cggaggcgta catgccgcaa atgtcgtaaa catagatggg ctccgagaag atgccgatgt tggtgggata acagcgcccc ccgcggatgc tggcgcgcac gtattcatac aactcgtgcg
dddddddddd dddddddddd acgcactccc gcctccgtat gtacggcgtt tacagcattt gtatctaccc gaggctcttc tacggctaca dddddddddd dddddddddd dddddddddd dddddddddd dddddddddd dddddddddd
note: the d is used to indicate a dangling end
47. There is a bug in HyTherm so that long dangling ends are not accepted. If you just look at what happens in the complementary region you get:
If the target is at 1 * 10-7M, in 0.4000 M NaCl and 0.0000 M MgCl2: ?Ho = -578.90 kcal/mol ?So = -1563.01 eu ?Go63.0 = -53.49 kcal/mol TM = 94.5 oC
You can apply a correction for loss of free diffusion, and include single dangling ends
In 0.4000 M NaCl and 0.0000 M MgCl2, if the target is at 1 * 10-7M: ? Ho = -555.80 kcal/mol ?So = -1523.98 eu ?Go63.0 = -43.51 kcal/mol TM = 88.8 oC
In 0.4000 M NaCl and 0.0000 M MgCl2 if the target is at 1 * 10-6M: ? Ho = -555.80 kcal/mol ? So = -1523.98 eu ? Go63.0 = -43.51 kcal/mol TM = 88.8 oC
So, the position of equilibrium does not change, but the rate at which you reach it will.
48. Rates are figured empirically – you can make some assumptions by extrapolating from the Benight or Southern paper. The take-home message should be a definition of the conditions under which measurable hybridization could occur.
To actually calculate this you would need to plot the concentration of the complex vs time over a range of concentrations and times and the equilibrium constant could be obtained by extrapolating back through the binding curve to find the concentration of the complex at time zero. You can estimate this from the free energy equations, where dGbinding=-RTln Ka
I am still looking for a paper that does binding kinetic studies of 70-mers.