470 likes | 565 Views
Applications of knowledge discovery to molecular biology: Identifying structural regularities in proteins. Shaobing Su Supervisor: Dr. Lawrence B. Holder Committee: Dr. Diane J. Cook Dr. Edward Bellion. Outline. Motivation and goal of the research
E N D
Applications of knowledge discovery to molecular biology:Identifying structural regularities in proteins Shaobing Su Supervisor: Dr. Lawrence B. Holder Committee: Dr. Diane J. Cook Dr. Edward Bellion
Outline • Motivation and goal of the research • SUBDUE knowledge discovery system • Proteins and PDB • Methods and results • Discussion and conclusion • Future research
Motivation and Goal • Explosive amount of molecular biology info need to be analyze to help understanding the underlining structure-function relationship in protein and other macromolecules. • Apply SUBDUE to the Brookhaven Protein Data Bank (PDB) to identify biologically meaningful patterns
SUBDUE knowledge discovery system • SUBDUE discovers patterns (substructures) in structural data sets • SUBDUE represent data as a labeled graph • Inputs: vertices and edges • Outputs: discovered patterns and instances
Example Vertices: objects or attributes Edges: relationships shape triangle object shape square on object 4 instances of
SUBDUE’s search algorithm • Minimum Description Length (MDL) principle: The best theory to describe a set of data is the one that minimizes the DL of the entire data set • DL of the graph: the number of bits necessary to completely describe the graph • Search for the substructure that results in the maximum compression
Inexact graph match approach Find instances with a slight distortion: insertion, deletion, and substitution of edges/vertices. Threshold parameter: specify amount of distortion allowed.
Overview of proteins • most important biomolecule • composed from 20 amino acids • structural hierarchy • very diverse structure and function
Structural hierarchy in proteins • Primary structure (sequence of protein) • Secondary structure (helix, sheet, random) • Tertiary structure (3-D)
Primary Structure of proteins • Average 100-150 residues (a.a.) linked in head to tail • N-terminus and C-terminus • Peptide bond, alpha-carbon N-terminus C-terminus R1 O H R2 O + - H3N - C1 - C - N - C2 - C - O first a.a second a.a peptide bond
Secondary structure elements • Ordered backbone arrangement: helix and sheet • Helix (0 % to 90 %; average 11 a.a; several types) • Sheet (2 to 15 strands per sheet; parallel and anti-parallel; average 6 a.a. per strand)
Tertiary Structure of protein • Highly complicated 3-D arrangement • Folding of its secondary structure elements
Brookhaven Protein Data Bank (PDB) • Brookhaven National Laboratory • Over 6000 Experimentally determined 3-D structure of biomolecules • Majority: protein structures
Contents of PDB • SEQRES: sequence of a.a. (three letter code) • HELIX: starting, ending, and type • SHEET: starts, ends, sense • ATOM: (x, y, z) coordinates for each atoms in protein
Applications of SUBDUE to PDB- Methods and Results • July 1997 PDBTM release (6000 PDB) • Global data set (4000 PDB) • Category data sets hemoglobin Myoglobin Ribonuclease A
Flowchart of Research Preprocessing Application Inputs to SUBDUE Brookhaven PDB Patterns in Category Graphic representation Instance mapping Patterns in Global others
Preprocessing • compile PDB list for each category • model.c: extract first model • seq.c: extract sequence info convert to graphic format • secondary.c: extract secondary structure info and convert to graphic format • coor.c: extract 3D coordinates convert to grahic format
Primary structure and its representation • Sample PDB lines: SEQRES 1 150 ALA ASN LYS THR 1ASH 139 SEQRES 2 150 LYS SER LEU GLU 1ASH 140 • Sequence (N-terminus to C-terminus): ALA ASN LYS THR LYS SER LEU GLU • SUBDUE graphic input (ALA ASN): v 1 ALA - - - ALA residue v 2 ASN - - - ASN residue e 1 2 bond - - - a peptide bond between ALA and ASN
Secondary structure and its representation -HELIX • Sample PDB lines(starting, ending, type):HELIX 1 ASN 1 HIS 13 1 HELIX 2 ASN 20 ASN 36 1 • vertex: h_type_length • Helix Length:Hlength = SeqNum(last a.a.) - SeqNum(first a.a.) • SUBDUE graphic input:v 1 h_1_12 - - - helix 1, type 1, length 12 v 2 h_1_16 - - - helix 2, type 1, length 16
Secondary structure and its representation - SHEET • Sample PDB lines(sense, length):SHEET 1 TYR 284 ILE 286 0 SHEET 2 HIS 292 THR 294 - 1 • vertex: s_sense_length • SUBDUE graphic input:v 1 s_0_2 - - - strand 1, sense 0, length 2 v 2 s_-1_2 - - - strand 2, sense -1, length 2
Overall secondary structure representation • PDB line: SUBDUE graphic input HELIX 1 THR 3 MET 13 1 v 1 h_1_10 HELIX 2 ASN 24 ASN 34 1 v 2 h_1_10 e 1 2 sh HELIX 3 SER 50 GLN 60 1 v 3 s_0_7 e 2 3 sh SHEET 1 LYS 41 HIS 48 0 v 4 h_1_10 e 3 4 sh SHEET 2 MET 79 THR 87 -1 v 5 s_-1_8 e 4 5 sh • sequential relationship is represented as edge “sh” • Visualization: N-terminus C-terminus
Tertiary structure and its representation • Sample PDB lines:X Y ZATOM CA ALA 1 10.369 0.997 10.519 ATOM CA ASN 2 6.691 0.239 9.830 • vertex: backbone carbon; edge: distance (vs, s) • Distance (Å): distance = ((x2-x1)2 + (y2-y1)2 + (z2 - z1)2)1/2 • v 1 CA_ALA v 2 CA_ASN e 1 2 vs - - - very short distance
Rationale for representation choice-Criteria • Patterns identified by SUBDUE must be representative for each category • Patterns discovered by SUBDUE should discriminate one category from others
Primary sequence • vertex - a.a. residue name • edge - peptide bond e 1 2 bond e 2 3 bond bond bond ARG GLU ALA v 1 ARG v 2 GLU v 3 ALA
Secondary structure elements • Type of the helix • starting and ending points (a.a name and seq number) Helix 1 type length 1 12 starts ends ASN … HIS N-terminus C-terminus
Other ways of representing helix • Separate type and length • combine type and length Helix 1 Helix_1_12 type length 1 12
Tertiary structure • (x, y, z) coordinates vary with different origin choice • avoid numeric number, use vs (4 Å), s (4 Å < dist 6Å) 10.4 6.7 x x y vs y 1.0 C1 C2 0.2 z z 10.5 9.8
Results:Primary structure patterns Hemo_seq (63/65) Hemo_sequence: THR LYS THR TYR PHE PRO HIS PHE ASP LEU SER HIS GLY SER ALA GLN VAL LYS GLY HIS GLY LYS LYS VAL ALA ASP ALA LEU THR ASN ALA VAL ALA HIS VAL ASP ASP MET PRO ASN ALA LEU SET ALA LEU SER THR LEU ALA ALA HIS LEU PRO LAL GLU PHE THR PRO ALA VAL HIS ALA SET LEU ASP LYS PHE LEU ALA SET VAL SER THR VAL LEU THR SER LYS TYR Myo_seq (67/103) Myoglo_sequence: VAL LSU SER GLU GLY GLU TRP GLN LEU VAL LEU HIS VAL TRP ALA LYS VAL GLU ALA ASP VAL ALA GLY HIS GLY GLN ASP ILE LEU ILE ARG LEU PHE LYS SER HIS PRO GLU THR LEU GLU LYS PHE ASP ARG Ribo_A (59/68) Ribonuclease_A_sequence: GLY GLN THR ASN CYS TYR GLN SER TYR SER THR MET SER ILE THR ASP CYS ARG GLU THR GLY SER SER LYS TYR PRO ASN CYS ALA TYR LYS THR THR GLN ALA ASN LYS HIS ILE ILE VAL ALA CYS GLU GLY ASN PRO TYR VAL PRO VAL HIS PHE ASP ALA SER VAL
Primary structure patterns • Unique to each sample category • hemoglobin and myoglobin proteins share little sequence similarity
Results:Hemo secondary structure patterns 1: h_1_14 -> h_1_15 -> h_1_6 -> h_1_6 -> h_1_19 -> h_1_8 -> h_1_18 -> h_1_20 7: h_1_15 -> h_1_15 -> h_1_6 -> h_1_1 -> h_1_19 -> h_1_8 -> h_1_18 -> h_1_20
Results:Myo secondary structure patterns 1: h_1_15 -> h_1_15 -> h_1_6 -> h_1_6 -> h_1_19 -> h_1_9 -> h_1_18 -> h_1_25
Results:Ribo_A secondary structure patterns 1: h_1_10 -> h_1_10 -> s_0_7 -> s_0_7 -> h_1_10 -> s_0_3 -> s_0_3 -> s_-1_4 -> s_-1_4 -> s_-1_8 -> s_-1_1 -> s_-1_10 -> s_-1_10 -> s_-1_8 -> s_-1_8 -> s_-1_5 -> s_-1_3 10: h_1_10 -> h_1_10 -> s_0_7 -> h_1_10 -> s_0_3 -> s_-1_4 -> s_-1_8 -> s_-1_8 -> s_-1_6
Results:Tertiary structural patterns • SUBDUE finds small patterns (2 or 3 a.a.) • not unique for each category of proteins • not biologically meaningful
Visualization of secondary structure patterns -hemoglobin complete hemoglobin 2 instances of pattern structure N-terminus C-terminus
Visualization of secondary structure patterns -myoglobin complete myoglobin 1 instance of pattern structure N-terminus C-terminus
Visualization of secondary structure patterns -ribonuclease_A complete ribonuclease_A 1 instance of pattern structure N-terminus C-terminus
Discussion-Hemoglobin • Hemoglobin: A, B, C, D chains • Two types of patterns identified by SUBDUE One for A, C chains, the other for B, D chains • Patterns exist in a majority of hemoglobin proteins • No instances of the best hemoglobin pattern found in other proteins in the global data set
Discussion-Myoglobin • Myoglobin: one chain • One dominant pattern identified by SUBDUE • Patterns exist in most of myoglobin proteins • No instances of the best myoglobin pattern found in other proteins in the global data set
Discussion:-Hemoglobin and Myoglobin • Similar secondary structure patterns Hemoglobin B, D chains (from N- to C-terminus) h_1_14 -> h_1_15 -> h_1_6 -> h_1_6 -> h_1_19 -> h_1_8 -> h_1_18 -> h_1_20 Myoglobin chain (from N- to C-terminus) h_1_15 -> h_1_15 -> h_1_6 -> h_1_6 -> h_1_19 -> h_1_9 -> h_1_18 -> h_1_25 Hemoglobin A, C chains (from N- to C-terminus) h_1_15 -> h_1_15 -> h_1_6 -> h_1_1 -> h_1_19 -> h_1_8 -> h_1_18 -> h_1_20
Discussion:-Hemoglobin and Myoglobin • Consistent with the genetic studies • Hemoglobin and myoglobin share one ancestral gene • Divergence occurred in the course of evolution. One copy of gene for myoglobin, four copies for hemoglobin. • The last helix of the hemoglobin is shorter; One of the helix in hemoglobin A, C chains almost disappear: allow conformational change
Discussion:-ribonuclease A proteins • All patterns have three helices of the same size • Several strands appear twice indicating participation in two sheet formation. • Ribonuclease S protein (S-protein fragment) also has the pattern.
Conclusion of the results • Secondary structure patterns discovered by SUBDUE are representative to each category • Secondary structure patterns discovered by SUBDUE are distinct for each category • SUBDUE has the ability to discover biologically interesting patterns from PDB and other similar MB data bases
Comparison with other related studies • Different graphic representation • predefined patterns with exact or inexact graph match • Not applied systematically to PDB or other DB • SUBDUE would perform similar task if the inexact graph match routine is incorporated
Conclusions of the study • Abstraction over 3D structure to its secondary structural elements is suitable for discovery • SUBDUE discovered secondary structure patterns for each category can be used as a signature for its class • Inexact graph match is useful for finding similar patterns • SUBDUE is suitable for knowledge discovery in MB structural DB
Future Research • More consistent and detailed description of secondary structure • Add relative positions of the secondary structural elements to represent spatial relationship • Investigate alternative representation: more suitable 3D coordinates representation; weighting on different edges • Inexact graph match in predefined substructure • More collaboration with domain scientists