370 likes | 499 Views
Automated Discovery in Biological Sciences. Erika Timar. Central Dogma of Molecular Biology. Definition of Gene. The fundamental physical and functional unit of heredity, responsible for specific traits such as eye color
E N D
Automated Discovery in Biological Sciences Erika Timar
Definition of Gene • The fundamental physical and functional unit of heredity, responsible for specific traits such as eye color • A gene is an ordered sequence of nucleotides located in a particular position on a particular chromosome that encodes a specific functional product (i.e., a protein or RNA molecule). • a segment of DNA that is involved in producing a polypeptide chain; it can include regions preceding and following the coding DNA as well as introns between the exons • The functional unit of of DNA (deoxyribonucleic acid). Genes are segments of chromosomes found in the nucleus of cells. This hereditary information usually directs the formation of a protein • A natural unit of the hereditary material, which is the physical basis for the transmission of the characteristics of living organisms from one generation to another
Causes of the rising need for Computational Discovery in Biology • Recent technologies have cause a exponential explosion of the information available • PCR • Microarrays
Microarray GREEN represents Control DNA, from normal tissue which is hybridized to the target DNA.RED represents Sample DNA, from diseased tissue which ishybridized to the target DNA.YELLOW represents a combination of Control and Sample DNA, where both hybridized equally to the target DNA.BLACK represents areas where neither the Control nor Sample DNA hybridized to the target DNA.
Current Computational Problems in Biology • Gene detection • Gene function • Protein structure • Protein function • Evolutionary relationships • Biomolecular pathways
Probabilistic Methods • Identifying gene modules and gathering information from them • Module Networks of regulation • Conditional expression modules in cancer • Evolutionarily conserved networks
Identifying regulatory modules • Goal to predict functions for regulators their targets and the conditions under which this regulation occurs • Regulatory module- set of genes that are regulated in concert as a function of the expression level of a small set of regulators • Regulation in Biology is diverse
Pathways can be highly complex • http://www.biocarta.com/genes/PathwayGeneSearch.asp?geneValue=g
Module Networks: identifying regulatory modules • Input- gene expression data set of 2355 genes in 137 arrays and a large precompiled set of candidate regulatory genes for Saccharomyces cerevisae (yeast) was used by Segal et al
Process • Algorithm searches for partition of genes and for a regulation program • Iterative procedure with 2 steps • Searches for regulation program for each module • Reassigns each gene to the module whose data best fits model proposed • Bayesian score to evaluate fit and Expectation Maximization algorithm to search for model with highest score
Results • From input of stress data set program inferred 50 modules which were then evaluated using external data sources to ensure that the gene products and regulation products were correct • Further three hypotheses of uncharacterized regulators were examined and were validated using experiments followed by statistical analysis
Cancer • Typical cause malfunction in cell’s regulatory ability • Prevention of cell death • Over-proliferation • Finding similarities gives targets for medications
Conditional activity of expression modules in cancer • Input Cancer compendium of 1975 microarrays containing 14145 genes and spanning 22 tumor types • Preprocessing- division into gene sets • Process- statistical analysis of gene-set pairs followed by hierarchical clustering which are tested for consistency and then inferred into modules
Results • Identification of 456 modules spanning different processes and functions Including • Similarities in hematologic tumors and hepatocellular carcinoma • Acute leukemia • Osteoblastic tumors- tumor proliferation and metastasis
Conserved Genetic Modules • Input- 3182 microarrays from humans, flies, worms and yeast • Process- orthologs identified with BLAST to define metagene. • Statistically computed co-expression of metagene- pairs • Combined all paired metagene into networks • Results- network contained 3416 metagene connected by 22163 expression interaction which were confirmed through other statistical and laboratory means
Protein Function • Approaches • Sequence Classification • Nearest neighbor • Motif (amino acid sequences) • Groups of motifs called fingerprints • Profiles- position scoring based on HMM or MSA • Structural Classification • Tools • Local multiple sequence alignment – MEME • Combinatorial approach
Discovery of Motif-based Protein Function Classifiers • A data-driven approach using machine learning to discover rules for assigning protein sequences to functional families on the basis of the presence or absence of specific motifs or combinations of motifs.
Method • Input- Prosite and MEME protein data used for test sets (80% used to train) • Process- Using family of decision tree induction algorithms create a decision tree that is then translated into rules • Uses a greedy procedure discussed in class • Post-pruning to compensate for any over fitting that may have occurred.
Results • Results measured in terms of accuracy, precision and recall • MEME- single-best better in precision and comparable in accuracy but worse in recall • Prosite- formed same pattern as MEME but did not have as good a fit • MEME based decision tree outperform Prosite • Clans outperform single best motifs • Program could group functionally important structures based on combination of motifs
General Automated Discovery • Goal- develop an autonomous discovery system that peruses large collections of data to find hypotheses that are interesting enough to warrant the expenditure of laboratory resources and subsequent publication. • HAMB- prototype discovery program with domain-independent heuristics that guide the program’s choice of relationships in data that are potentially interesting
HAMB • an agenda- and justification-based framework • consists of an agenda of tasks prioritized by their plausibility • RL- an inductive generalization program generates plausible hypotheses • Each task has justification called reasons and each reason must have a strength • Tasks are performed using heuristics
Algorithm • Discovery cycle- Loop (top-level control) (1) calculate the plausibilities of the tasks (2) select the task with the greatest plausibility (3) perform the task At the end of each iteration of this loop (called a discovery-cycle), a stopping condition • At end of discovery cycle stopping condition is checked • the plausibility of all tasks on the agenda falls below a user-specified threshold • or the number of completed discovery cycles exceeds a user-defined threshold. • Further deadlocks are looked for and if found broken by proceeding to next most interesting task
X-ray Crystallography • a technique which the pattern produced by the diffraction of x-rays through the closely spaced lattice of atoms in a crystal is recorded and then analyzed to reveal the nature of that lattice Crystallized DNA micrograph Davidson/FSU
Attributes of Macromolecules • The attributes in our augmented dataset include: • macromolecular properties — macromolecule name, macromolecule-class name, and molecular weight; • experimental conditions — pH, temperature, crystallization method, macromolecular concentration, and concentrations of chemical additives in the growth medium • characteristics of the grown crystal (if any)- descriptors of the crystal’s shape, for example, crystal-form, and space-groups-description, and its diffraction-limit (which measures how well the crystal diffracts x-rays).
Verification • Some information in categories II and III is not novel. • It is interesting because some of the discoveries are known techniques in X-ray crystallography and this verifies discoveries made by HAMB
Heuristics • The general heuristics in HAMB can be divided into three classes: (1) heuristics that select rule-induction targets and other goals worth pursuing, (2) heuristics that keep an item’s properties and relationships sufficiently up-to-date, (3) heuristics that reference domain-specific properties to improve the quality of reported discoveries.
Other applications and evaluations • Results of another study carried out in domain of 930 cases of patients in rehabilitation after a medical disability, such as stroke or amputation also showed promising results • Extensive evaluation of features of HAMB was carried out by Livingston et al • Domain independent heuristics and user- modified parameters allow flexibility needed for biological discovery