220 likes | 329 Views
Spring 2013. Bioinformatics. Ayesha Masrur Khan. Protein Family and Domains. Once a protein sequence is obtained, there are many questions that can be asked, such as -what is the protein’s overall identity? -what putative functions does it have? -what biological motifs are present?
E N D
Spring 2013 Bioinformatics Ayesha Masrur Khan
Protein Family and Domains Once a protein sequence is obtained, there are many questions that can be asked, such as -what is the protein’s overall identity? -what putative functions does it have? -what biological motifs are present? Different computational tools are needed to determine possible functional domains based on primary sequence data. Lec-4
Protein Family and Domains (contd.) • Therefore, family and domain databases are used to address the question- ‘what domains are contained within this sequence?’ or ‘what family does this protein belong to?’ BUT first: what are families and domains? Lec-4
Protein Family and Domains (contd.) Family---> A family of proteins was originally defined by Dayhoffet.al (1978) as a group of sequences with more than 50% identity when aligned with similar functions. Families are often also characterized by the presence of one or more domains with high sequence similarity. Domains---> Traditionally known as structurally independent folding units, are conserved functional units that may contain one or more motifs. Characterized by the following: 1- A spatially separated unit of the protein 3D structure2- May have sequence and/or structural resemblance to another protein structure or domain.3- May have a specific function associated with it. Lec-4
Protein Family and Domains (contd.) Motifs---> These include both short stretches of fixed residue length that act as sites for post translational modifications and longer sequences that form secondary structures for protein-DNA, protein-ion or protein-lipid interactions. Lec-4
Domain Example: Pyruvatekinase Quaternary structure: 4 subunits 3 domains Lec-4
Zinc finger motif: A sequence motif Sequence motif: A particular amino-acid sequence that is characteristic of a specific biochemical function Three zinc fingers bound spirally in the major groove of a DNA molecule. The coordination of a zinc atom by characteristically spaced cysteine and histidine residues in a single zinc finger motif Lec-4
Other examples: structural motifs & functional motifs Another type is the functional motif, which is a sequence or structural motif that is always associated with a particular biochemical function. Lec-4
Protein families • Protein families are related to one another by sequence similarity, domain composition, or structure. • These include proteins found across species orthologues) or within the same species (paralogs). • Family descriptors are derived from MSAs (multiple sequence alignments) that enable us to define traits that encompass all member sequences. • Family descriptors have been based on sequence identity (>50% identical), common domains (e.g. catalytic binding domains, calcium binding motifs etc.), structure, or a combination of these characteristics. Lec-4
Protein Domains • Domains represent discrete stretches within the protein, unlike protein families, which are commonly defined over the length of the sequence. • These units are conserved at the level of sequence and structure. • They can be described by: • combinations of short regions of highly conserved amino acids within a domain • all amino acids • structural features • Domain description is developed in the same way as the family descriptors. Lec-4
Family-Domain Databases • Because of the reuse of motifs and domains, similarities can be found within sequences that are otherwise unrelated evolutionarily. • Therefore, methods are needed to distinguish between similarities due to random variation and those of common origin or function. Family-domain databases provide the following benefits: Increase sensitivity, i.e. true matches are detected through MSA Increased specificity, i.e. detect only related proteins Classification of protein sequences to appropriate families Lec-4
Family-Domain Databases Some database references Lec-4
Searching sequence databases • Search methods engage in a series of sequence alignments to determine degrees of similarity between sequences and then return a list of matched sequences to the user. • Alignment Algorithms • Manually, we examine two or more sequences for similar residue patterns, match up identical residues, decide qualitatively whether they are aligned well, and determine statistically how identical or similar the sequences are. • The automation of this process requires a computer-based method to line sequences up against one another and a scoring method for evaluating the success of the alignment in terms of similarity or identity. Lec-4
DNARNAProtein • Sequence comparison and alignment is a central problem in computational biology. The most basic task is: given two known sequences (DNA, RNA or amino acids) and a scoring model, determine if they are related or not. Lec-4
Sequence alignment • When we align sequences, we assume that they share a common ancestor • -They are then homologous • Protein fold is much more conserved than protein sequence • DNA sequences tend to be less informative than protein sequences ATTGCGC ATTGCGC ATTGCGC AT-CCGC ATTGCGC ATCCGC C • An Alignment is a hypothesis of positional homology between bases/Amino Acids. Lec-4
Sequence alignment • The alignment of two sequences (DNA or protein) is a relatively straightforward computational problem. • There are lots of possible alignments. • Two sequences can always be aligned. • Sequence alignments have to be scored. • Often there is more than one solution with the same score. Lec-4
Identity vs. Similarity • Identity refers to an exact match between two nucleotides or amino acids • Similarity refers to a resemblance between two residues that is greater than one would expect at random. • Percent Sequence Identity • The extent to which two nucleotide or amino acid sequences are invariant. • 70% identical A C C T G A G – A G A C G T G – G C A G Lec-4
Alignment methods • By hand - slide sequences on two lines of a word processor • Dot plot • with windows • Rigorous mathematical approach • Dynamic programming (slow, optimal) • Heuristic methods (fast, approximate) • BLAST and FASTA Lec-4
Global and Local Alignment • Global alignment algorithms start at the beginning of two sequences and add gaps to each until the end of one is reached. -Used when an objective and optimal measure is needed to compare two sequences and it is valid to assume that the length of the sequences is equivalent • Local alignment algorithms finds the region (or regions) of highest similarity between two sequences and build the alignment outward from there. Lec-4
Global alignment • The the Needleman-Wunsch algorithm (1970) creates a global alignment over the length of both sequences. • Global algorithms are often not effective for highly diverged sequences - do not reflect the biological reality that two sequences may only share limited regions of conserved sequence. • Sometimes two sequences may be derived from ancient recombination events where only a single functional domain is shared. • Global methods are useful when you want to force two sequences to align over their entire length Lec-4
Local Alignment • This method identifies the most similar sub-region shared between two sequences • Smith-Waterman algorithm (1981) Lec-4