540 likes | 788 Views
Bioinformatics for Genomic and Proteomic data analysis. -- Gene Prediction. Sequence Analysis. -- Alignment techniques (BLAST, PSI-BLAST). -- Major databases and retrieval techniques. -- Predicting Function, domains etc.
E N D
Bioinformatics for Genomic and Proteomic data analysis -- Gene Prediction • Sequence Analysis -- Alignment techniques (BLAST, PSI-BLAST) -- Major databases and retrieval techniques. -- Predicting Function, domains etc. -- finding homology between sequences, identifying repeats etc (DOTPLOT). -- Predicting phyico-chemical properties of protein (ProtParam). -- Predicting signal peptides and transmembrane proteins (SignalP). -- Phylogenetic analysis • Structure analysis -- Analysis of Protein structure and conformation (Rasmol, SwissPDBViewer, VMD etc). -- Protein structure predictions- Homology modeling (SwissModel, Modeller). • Some practical applications -- Sequence analysis -- Structure analysis
Major Bioinformatics databases, Search engines and data formats. By: Sachin Pundhir Bioinformatics sub-centre DAVV, Indore
Database • Collection of records and files • Organized for a particular purpose • Tables • Tuples (records) • Attributes • Values
1998 Name ID Grade Amy 123 A Joe 456 B Sue 789 C Table Tuple Attribute Value BIO520 Student Database . .
Database Operations • Tables • Create, delete • Tuples (Records) • Read,write, delete • Search, sort, modify, print… 1998 Name ID Grade Amy 123 A Joe 456 B Sue 789 C
International Nucleotide Sequence Database Collaboration (INSDC) • Consists of DDBJ (Japan) GenBank (USA) EMBL Nucleotide Sequence Database. • The three databases exchange new and updated data on a daily basis to achieve optimal synchronisation.
Secondary database Bioinformatics databases • Protein sequence database: • Genpept: Protein sequence database. • UniProtKB/Swiss-Prot: curated protein sequence database, minimal level of redundancy and high level of integration with other databases. • UniProtKB/TrEMBL: computer-annotated supplement of Swiss-Prot that contains all the translations of EMBL nucleotide sequence entries not yet integrated in Swiss-Prot. • Refseq: Well curated, non-redundant database. • Structure Database • PDB: Protein Data Bank • MMDB: Molecular Modeling Database • Nucleotide sequence database: • Genbank: Nucleotide sequence database. Highly redundant. • DDBJ: DNA Data Bank of Japan. • EMBL: nucleotide sequence database. • Refseq: integrated, non-redundant set of sequences, including genomic DNA, transcript (RNA), and protein products, for major research organisms. Primary databases
GenBank Record • Header • information that apply to the whole record • Features • annotations on the record • Sequence
GenBank Record GeneBank Record Header modification date Molecule Type Locus Name Sequence Length Accession Number Modification Date Version Number GenBank Division
FEATURE GeneBank Record Link to Seq
Sequence GenBank Record
Using Entrez An integrated database search and retrieval system
WWWAccess Entrez & BLAST
Entrez: Database Integration Word weight PubMed abstracts 3-D Structure 3 -D Structure Taxonomy VAST Genomes Phylogeny Protein sequences Nucleotide sequences BLAST BLAST
Database Searching with Entrez Using limits and field restriction to find human MutL homolog Linking and neighboring with MutL
Accession All Fields Author Name EC/RN Number Feature key Filter Gene Name Issue Journal Name Keyword Modification Date Organism Page Number Primary Accession Properties Protein Name Publication Date SeqID String Sequence Length Substance Name Text Word Title Uid Volume Field Restriction MutL Entrez Nucleotides: Limits Exclude bulk sequences
MutL Entrez Nucleotides: Limits Title == Definition Exclude Bulk Sequences
Adding Terms: Preview/Index Accession All Fields Author Name EC/RN Number Feature key Filter Gene Name Issue Journal Name Keyword Modification Date Organism Page Number Primary Accession Properties Protein Name Publication Date SeqID String Sequence Length Substance Name Text Word Title Uid Volume
GenBank Records Human MutL RefSeq
Literature Links PubMed OMIM
NM_000249: PubMed Books
Conserved Domain OMIM: Human Disease Genes
Sequence Links Nucleotide Protein
Genome Project BAC Original GenBank mRNAs Original GenBank genomic NM_000249: Related Sequences similarity
Taxonomy Link The Tax Browser NCBI’s Taxonomy
NCBI Protein Databases • GenPept GenBank, EMBL, DDBJ CDS translations • RefSeq mRNA based (NP_) and genome based (XP_) • Swiss-Prot curated high quality protein reviews • PIR protein information resource Georgetown University • PRF protein resource foundation • PDB Protein Databank sequences from structures
Protein Link BLAST Link Conserved Domains
Related Proteins: Redundancy Redundant Sequences
Sequence from MutL structure Related Proteins: Links
BLink: non-redundant relatives Arabidopsis homolog Conserved Domain
Mismatch Repair Domain ATPase Domain MLH1 Domain Structure: CDD
ATPase structural alignment ATP Binding site helix
orthologs orthologs paralogs frog A chick A mouse A mouseB chick B frog B A-chain gene B-chain gene gene duplication early globin gene The New Homologene • No longer UniGene based • Protein similarities first • Guided by taxonomic tree • Includes orthologs and paralogs
Entrez Genes: integrated gene-based access • LocusLink • Complete Genomes • eukaryotic • microbial • organelle