470 likes | 607 Views
Bioinformatics. Search Engines. Definition of Search Engines. A program that searches documents for specified keywords and returns a list of the documents where the keywords were found.
E N D
Bioinformatics Search Engines
Definition of Search Engines • A program that searches documents for specified keywords and returns a list of the documents where the keywords were found. • It works by sending out a spiderto fetch as many documents as possible. Another program, called an indexer, then reads these documents and creates an index based on the words contained in each document.
Bioinformatics Tools • The Bioinformatics Tools may be categorized into the following categories: • Homology and Similarity Tools. • Structural Analysis. • Protein Function Analysis. • Sequence Analysis.
Homology and Similarity Tools • The term homology implies a common evolutionary relationship between two traits. • Homologous sequences are sequences that are related by divergence from a common ancestor.
Homology and Similarity Tools • Thus the degree of similarity between two sequences can be measured while their homology is a case of being either true or false. • This set of tools can be used to identify similarities between novel query sequences of unknown structure and function and database sequences whose structure and function have been elucidated.
Structural Analysis • This set of tools allow you to compare structures with the known structure databases. The function of a protein is more directly a consequence of its structure rather than its sequence, with structural homologs tending to share functions. • The determination of a protein's 2D/3D structure is crucial in the study of its function.
Protein Function Analysis • Function Analysis is the Identification and mapping of all functional elements (both coding and non-coding) in a genome. • This group of programs allow you to compare your protein sequence to the secondary (or derived) protein databases that contain information on motifs, signatures and protein domains.
Protein Function Analysis • Highly significant hits against these different pattern databases allow you to approximate the biochemical function of your query protein.
Sequence Analysis • This set of tools allows you to carry out further, more detailed analysis on your query sequence including evolutionary analysis, identification of mutations, hydropathy regions, CpG islands and compositional biases. The identification of these and other biological properties are all clues that aid the search to elucidate the specific function of your sequence.
Biological Search Engines • NCBI: National Center for Biotechnology Information. • Importance of NCBI is that it helps to understand the elegant language of living cells (DNA sequence).
NCBI • Established in 1988. • NCBI creates public databases, conducts research in computational biology, develops software tools for analyzing genome data, and disseminates biomedical information - all for the better understanding of molecular processes affecting human health and disease.
NCBI • The NCBI has been charged with creating automated systems for storing and analyzing knowledge about molecular biology, biochemistry, and genetics. • Facilitating the use of such databases and software by the research and medical community.
NCBI • Coordinating efforts to gather biotechnology information both nationally and internationally; and performing research into advanced methods of computer-based information processing for analyzing the structure and function of biologically important molecules.
NCBITools (http://www.ncbi.nlm.nih.gov/) • Amino Acid Explorer: This tool allows users to explore the characteristics of amino acids by comparing their structural and chemical properties, predicting protein sequence changes caused by mutations, viewing common substitutions, and browsing the functions of given residues in conserved domains.
NCBI Tools • Map Viewer: NCBI's Map Viewer can be used to visualize an organism’s genome and annotation. • The organisms represented in the Map Viewer include human, mouse, rat, zebrafish, mosquito, fruit fly, yeast and others.
NCBI Tools • Coffee Break: is a resource at NCBI that combines reports on recent biomedical discoveries with use of NCBI tools. The result is an interactive tutorial that tells a biological story. Each report is based on a discovery reported in one or more articles from the recently published peer-reviewed literature. The report focuses on how a molecular understanding can provide explanations of observed biology and lead to therapies for diseases.
NCBI Tools • Blast: Basic Local Alignment Tool: • Finds regions of local similarity between biological sequences. • Compares nucleotide or protein sequences to sequence databases and calculates the statistical significance of matches.
NCBI Tools • BLAST can be used to infer functional and evolutionary relationships between sequences as well as to help identify members of gene families. • For Proteins. • For Microbial Genomes. • RefSeq Gene.
NCBI Tools • CD Tree is a stand-alone application for classifying protein sequences and investigating their evolutionary relationships. • CD Tree can import, analyze and update existing Conserved Domain (CDD) records and hierarchies.
NCBI Tools • Electronic PCR (e-PCR) is a computational procedure that is used to identify sequence tagged sites (STSs) within DNA sequences. • An STS is a short (200 to 500 base pair) DNA sequence that has a single occurrence in the genome and whose location and base sequence are known.
NCBI Tools • STSs can easily be detected by PCR using specific primers. • For this reason, they are useful for constructing genetic and physical maps from sequence data reported from many different laboratories.
NCBI Tools • When STS loci contain genetic polymorphisms (e.g. single nucleotide polymorphisms), they become valuable genetic markers, i.e. loci which can be used to distinguish individuals.
NCBI Tools • STSs are very helpful for detecting micro-deletions in some genes, for example some STSs can be screened by PCR to detect micro-deletions in AZF genes in infertile men. • AZF proteins and genes: Refer to one of several proteins or their genes, which are coded from the AZF region on the human male Y-chromosome. • Deletions in this region are associated with inability to produce sperm.
EMBL (European Molecular Biology Laboratory) • The EMBL Nucleotide Sequence Database (also known as EMBL-Bank) constitutes Europe's primary nucleotide sequence resource. • Main sources for DNA and RNA sequences are direct submissions from individual researchers, genome sequencing projects and patent applications.
ENA • The European Nucleotide Archive captures and presents information relating to experimental workflows that are based around nucleotide sequencing. • A typical workflow includes the isolation and preparation of material for sequencing, a run of a sequencing machine in which sequencing data are produced and a subsequent bioinformatic analysis pipeline.
ENA • ENA records this information in a data model that covers input information (sample, experimental setup, machine configuration), output machine data (sequence traces, reads and quality scores) and interpreted information (assembly, mapping, functional annotation).
Ensembl Genome Browsers: • Ensembl Genome browser: http://www.ensembl.org. • NCBI Map Viewer: http://www.ncbi.nlm.nih.gov/mapview/ • UCSC Genome Browser: http://genome.ucsc.edu.
Ensembl`s Aim • To provide annotation for the biological community that is freely available and of high quality. • Started in 2000. • Joint project between EBI and Sanger. • Funded primarily by the WellcomeTrust.
What Distinguishes Ensembl fromthe UCSC and NCBI Browsers? • The gene set: Automatic annotation based on mRNA and protein information. • Programmatic access via the Perl API (open source). • BioMart. • Integration with other databases (DAS). • Comparative analysis (gene trees).
Ensembl Genes-Biological basis • All Ensembl transcripts are based on proteins and mRNAs in: • UniProt/Swiss-Prot (manually curated) UniProt/TrEMBL: www.uniprot.org • NCBI RefSeq (manually curated): www.ncbi.nlm.nih.gov/RefSeq
Ensembl • Each gene has various information attached to it describing whether it is a known gene or corresponds to a SwissProt or trEMBL protein. • Some genes are novel genes which have been built by inference from similarities to other sequences. • These novel genes won't have a corresponding SwissProt or trEMBL protein.
Ensembl • Each translation has had a variety of protein analyses conducted on it and you can access information about the results of these.
Genes and Transcripts in Emsembl • Ensembl known transcripts. • Ensemblnovel transcripts. • Ensemblmerged transcripts (Havana).
What annotation is available? • Gene/transcript/peptide models (coding and noncoding. • IDs in other databases. • Mapped cDNAs.
What annotation is available? • Comparative data: orthologuesand paralogues, protein families, whole genome alignments. • Variation data: SNPs. • Data from external sources (DAS).
Paralogues and Orthologues • Paralogues: are genes related by duplication within a genome. • Orthologs: are genes in different species that evolved from a common ancestral gene by speciation. • Orthologs retain the same function in the course of evolution, whereas paralogs evolve new functions, even if these are related to the original one.
Names in Ensembl • ENSG### Ensembl Gene ID. • ENST### Ensembl Transcript ID. • ENSP### Ensembl Peptide ID. • ENSE### Ensembl Exon ID.
Names in Ensembl • For other species than human, a suffix is added: • MUS (Musmusculus) for mouse: ENSMUSG###. • DAR (Daniorerio) for zebrafish: ENSDARG###, etc.