Biological Networks CSci 732: Introduction to Bioinformatics

Biological NetworksCSci 732: Introduction to Bioinformatics Anne Denton Assistant Professor Department of Computer Science North Dakota State University, Fargo, ND

The Promise of Bioinformatics • Rich supply of data from high-throughput experiments • Sequencing • Microarray experiments • Scores of specialized high-throughput experiments • Data made available • Most biological data disseminated in large biological databases • Dissemination is a condition of funding (in US) • Makes biology different from most other disciplines!

Challenge and Opportunity • Traditional approach • Studies of one or a few gene at a time • Conclusions based on thorough domain knowledge • Challenge • Standard data analysis techniques inadequate for hundreds or thousands of genes • Opportunity • Massive data valuable to quantify evidence • New aspects can be studied, such as network structure

From Sequencing to Functional Genomics • Sequencing genomes is a mature discipline experimentally and computationally • Whole-genome sequencing centrally involved computations • But what do the proteins do? • Sequence comparison (BLAST) among species • Great computational and experimental challenges • Networks have a fundamental role in specifying relationships

Computer Scientist’s Introduction to Genomics • Encoding • Smallest unit of information • Computer: 1 bit (0 or 1) • Cell: 1 nucleotide of DNA (A,C,T, or G) • Most practical unit of information • Computer: 8 bit are 1 byte (can represent 256 values) • Cell: 3 nucleotide determine 1 amino acid of the protein (20 different amino acids)

Further Analogy between Cells and Computers • Encoded values can serve very different purposes and are stored in the same location • Computer: Instructions and data are stored in the same memory (von Neumann architecture) • Cell: Proteins in the cell do very different things • Catalyze chemical reactions (proteins are part of the process) • Regulate other proteins (proteins change the process)

Networks in Bioinformatics • Many network definitions: • Protein-protein interactions • Biochemical pathways • Annotation Networks • Different definitions in each category Scientific American 05/03

Why Study Networks? • Biochemical pathways tell us about functioning of cell • Chemical processes (Metabolic pathways) • Control of other proteins (Regulatory pathways) • Neighbors in networks often have similar function • Structure of networks can tell us about evolution • Combined study of networks and data can uncover yet more information about cells

Outline • Part 1: Properties of the Networks • Scale-free networks • Part 2: Networks and data • Relational data mining • Problem: Similarity between network neighbors • Solution: Focus on differences between neighbors • Comparison of different networks

Example of a Biological Network:Physical Protein-Protein Interactions • Proteins interact (attach to each other) • Tests if proteins are stable in a close position • Proteins may perform function together (not tested) • Mathematically: Undirected graph • Only one definition of interactions between proteins? No! • Definition based on function: Genetic • Definition based on evolution: Domain fusion

Physical Protein-Protein in Yeast Scientific American 05/03

Scale-free Networks • Properties • Barabasi, Bonabeau 1998 • Some nodes have large number of links, most have only a few • Number of nodes that have a particular number of links decreases as a power law • Robust against accidental failures • Hubs with high connectivity

Power-law Behavior • Probability that any node is connected to k other nodes is 1/k n with n between 2 and 3 • For k = 2: Probability of having twice as many links is a quarter as likely Scientific American 05/03

Robust against accidental failures • As many as 80% of nodes can fail in a scale-free network without breaking up the entire cluster • I.e., even with large number of random mutations in genes, unaffected proteins continue to work together • Note that random removal of nodes is unlikely to remove hubs • Removal of only a few hubs does break up network significantly

Examples Outside Biology • Hyperlink structure of the Internet • Physical structure (routers and communication lines) of the Internet • Social networks • Airline system • Scientific papers connected by citations

Scale-free Network (Airline System) Scientific American 05/03

Random Networks in Contrast • Links placed randomly • Mathematical model (Erdos 1959) • Example: Highway system • Bell-shaped (Poisson, similar to Gauss) distribution of number of nodes around typical value • For large number of links k, probability of k links decreases exponentially • Very unlikely to have nodes with a very large number of links

Random Network (Highway System) Scientific American 05/03

Reasons for the development of scale-free networks • Networks grow over time • Older nodes have had longer to accumulate links • The most connected nodes in E.coli metabolic network have an early evolutionary history • Preferential attachment • "The rich get richer“ cf. many people hyper-link to Google

"Small World" Property • Game "Six Degrees of Kevin Bacon ": Acting in the same movie as links connects most actors in 6 steps • Internet pages are typically 19 clicks apart • Any two chemicals in a cell are only 3 reactions apart! • Small-world property is not limited to scale-free networks

Are Protein Networks Pure Scale-free Networks? • Clusters of tightly connected nodes • Example • Proteins that perform function together • Recovering scale free network • Clustering of nodes leads to groups that interact as scale-free networks

Summary of Scale-free Networks • Characterized by • Few hubs with a large number of edges • Many nodes with few edges • Show small-world property • Any node can be reached from any other in only a few steps • Ubiquitous in biology and outside

Has Everything Interesting Related to Networks Been Done? • Graph theory is an old topic: Euler 1736 • Work on scale- free network has added to it • Surely now everything is done!

Name: YAL040C Function: Cell Cycle & DNA Process Localization: nucleus Class: Cyclins Complex: Cyclin-dependent kinases Motif: PDOC00013 Phenotype: Cell cycle defects Protein-Protein Interaction Networks Name: YAL003W Function: Protein Synthesis Localization: Cytoplasm Class: GTP/GDP-exchange factors Complex: Translation complexes MOTIF: PDOC00648

Outline (2): Data and Networks • Relational rather than graph-theoretic approach to mining of data on a graph • Problem of similarity between neighbors • New Algorithm • Focuses on differences • Comparison of Networks • Different definitions of Networks • Generalization of difference-based algorithm

Questions of Interest • How do data between nodes relate? • Are there typical patterns among interacting proteins? • Can we find relationships that are not yet known to biologists? • It is expected that proteins in the nucleus interact with other proteins in the nucleus • It is more surprising if proteins in the nucleus interact with proteins in the mitochondria • How do different networks compare?

Frequent Patterns in Tables:Association Rule Mining • 1st step: Finding sets of items that are frequent • Originally: Items in shopping carts • Here: Properties of proteins • Support: Fraction of transactions, in which the set of items occurs • 2nd step: Finding associations A -> B • If we find set A, we are likely to find set B • Similar to correlation, but goes in one direction only • Confidence: Fraction of transactions that have A, which also have B

Relational Approach 0. 1.

Generalization • Works for any number of nodes • Even 2-node structure allows complex rules • Results become harder to interpret for many nodes • “Small world property”: Any protein can be reached in 3-4 hops

Effect of Joining on Item Sets

ARM on Joined Tables • Protein names (key) used for joining but not in ARM • One node table participates multiple times • Items labeled to keep track of instance • Typical transactions • Simplest approach had been done overemphasizes similarities

Problem with Naive Implementation • Many rules that are not interesting • Rules that reflect similarity between neighbors • 0.nucleus → 1.nucleus • Rules involving only one protein • 0.transcription → 0.nucleus • Note: Different support and confidence compared with ARM on node relation (protein data ignoring network) • Rules that are a consequence of both above • 0.transcription → 1.nucleus

Algorithm

Properties of Algorithm • Significant pruning at transaction level • Fewer items in transactions • Fewer transactions • Note: Pruning at rule or item set level would not be consistent • Example: 0.transcription → 1.nucleus • Differs from conventional setting: pruning at itemset level is gold standard • Modular approach • Unique operation can be combined with different ARM implementations

Results for {0.transcription}  {1.nucleus} • Standard ARM: • Support = 0.29%, Confidence = 28.38% Rule: {0.transcription}  {0.nucleus} (0.70%, 69.59%) Rule: {0.nucleus}  {1.nucleus} (5.74%, 29.06%) • Differential ARM: • Support 0.02%, Confidence 2.08% • Typical range for differential ARM • Support 0.2-2%, Confidence 6-20%

Focus on Differences • Similarity can be tested through calculation of correlation of items with themselves • Computationally easy • Association meaningless • Typical rules that are interesting to biologists • Which kinds of proteins show compartmental cross-talk? E.g., proteins in the nucleus interacting with proteins in the mitochondria • How do interaction definitions differ? E.g., which protein families show physical interactions but no domain fusion interactions

Other Results of Differential ARM • Expected Cross-Talk past analysis papers • {1.mitochondria}  {0.cytoplasm} (1.2%, 27.3%) • Interesting Related Rules not found before • {1.mitochondria}  {0.nucleus} (0.72%, 16%) • {1.ER}  {0.mitochondira} (0.21%, 6%)

Performance

Comparison of Different Protein-Protein Interaction Networks • Different Definitions • Physical interactions • Genetic interactions • Domain fusion • Are the resulting networks biologically equivalent? • Common assumption: All networks signify similarity in neighbors

Physical Interactions • Tests whether proteins physically interact when brought close together • Yeast-2-hybrid method • A gene is cut in half, and each half attached to one of the proteins in question • The gene can only perform its job if its parts come close through the protein-protein interactions • Can be done in any cell, i.e. does not test functions in cell

Genetic Interactions • In vivo analysis, i.e., in the living organism • Typical scenario • Assume gene A and B can individually be deleted and the organism survives • Assume deleting A and B together means the organism does not survive • Other combinations are possible • Organism does not survive deletion of A and B individually but does survive combined deletion • Or: Organism survives but is noticeably changed

Domain Fusion Interactions • Comparative Genome Analysis • Purely computational analysis • Based on evolutionary relationships • Assume species A has one gene with two domains 1 and 2 • Assume species B has two genes that have the same evolutionary origin (orthologs), 1' and 2' • Likely that proteins from genes 1' and 2' interact to generate the same function • 24,000 protein-protein interactions in yeast • No experimental verification!

Can ARM Results be Compared Between Networks? • Not all proteins studied for all interactions • Networks have very different properties

Construction of Network Comparison Basis Set

Algorithm • Only nodes are considered that are involved in both networks • Items are eliminated if they occur in either of the two interaction types 1.A 1.C 1.D 2.A 2.B 0.C 0.E

Results • Rules based on physical interactions have higher confidence than genetic or domain fusion • Example: Physical compared with Domain Fusion {0.ABC trans family signature(PDOC00185)}  {1.ATP/GTP binding site motif A(PDOC00017)} (0.45%, 90%) • ABC family is known to function together with ATP binding site • Nevertheless ABC family signature don’t occur in one gene together with ATP binding site in other species (no domain fusion) • Possible reason: many proteins with ATP binding site

Conclusions of Part 2 • Differential algorithm to ARM in relations that describe networks contrasts • Neighbors within neighbors • Multiple networks • Solves other problems of network setting • Overwhelming number of rules due to neighbor similarity • Problem of rules that don’t involve all nodes • Rules that follow from combinations of both above problems • Some results confirmed by biologists • Some results interesting but plausible to biologists

Overall Conclusions • Computer science techniques can uncover patterns in data that could not be identified by simple inspection • Large amount of data • Complex data • Exciting times are only starting • Functional genomics only at its beginning • Mutual understanding of language takes time

Overall Conclusions (Data Miner’s perspective) • Bioinformatics excellent “playing field” for data miners • Data more easily available than in most other disciplines (possible exception: astronomy) • Results can directly benefit researchers in biology • Algorithms can/have to pass test of reality • Real data motivate fundamentally new algorithms

Acknowledgements (Part 2) Computational side • Christopher Besemann Biological interpretation (Collaborators from NDSU Dept. of Biology) • Ajay Yekkirala • Ron Hutchison • Marc Anderson The work was funded by the Dept. of CS, EPSCoR and the NDSU Research Foundation

Biological Networks CSci 732: Introduction to Bioinformatics