Basic Graph Terminology and Advanced Tree Concepts

CS 336 March 19, 2012 Tandy Warnow

Basic Graph Terminology • Nodes, vertices, edges, degrees, paths, cycles, connected components, adjacency, isolated vertices, trees, forests • Directed graphs: indegree, outdegree, trees

Advanced terminology • Cliques • Independent sets • Chromatic number and vertex colorings • Eulerian cycles and Eulerian paths • Hamiltonian paths • Matchings • Dominating Set • Vertex Cover

Paths, Connected Components, etc. • A path is a sequence of vertices v1, v2, …, vn so that vi is adjacent to vi+1 for i=1,2,…,n-1. A simple path is one that does not have repeated vertices. • A graph is connected if every pair of vertices in the graph is connected by some path. • A connected component is a maximal subset of the vertices that is connected.

Cycles • A cycle in a graph is a path that starts and ends at the same vertex. • A simple cycle is a cycle that does not have any repeated vertices (other than the start and end vertex). • A graph is acylic if it has no simple cycles.

Trees • Two types: rooted and unrooted • Unrooted (simplest): acylicconnected graph • Rooted: take an unrooted tree, pick one node to be the root, and direct all edges away from the root. Voila!

Theorems about trees Let T be a connected acyclic graph (i.e., a tree) with n vertices (n>0). Then: • T has at least one leaf (node with degree 0 or 1). • T has n-1 edges. • Every edge in T is a cut-edge. • Every tree can be 2-colored.

Theorem: Every tree has at least one leaf (node of degree 1) Theorem: For any tree T with at least one vertex, T has at least one leaf (node with degree 0 or 1). Proof: • If n=1, then T is a single vertex which is a leaf. • Else, n>1. Let P be a longest simple path in T, so P=v1,v2,…,vk. • If vk has degree 1, we are done. Otherwise, vk has at least two neighbors, and so some neighbor w other than vk-1. If w is in P, then we have a simple cycle in T, contradicting that T is a tree. If w is not in P, then we can extend P and get a longer path, contradicting that P is a longest simple path in T. • Hence, vk has degree 1, and we are done.

Theorem: Any tree with n>0 nodes has n-1 edges • Proof: by induction on n. • Base case: n=1 (trivial) • Inductive hypothesis: for some positive n, any tree on n nodes has exactly n-1 edges. • Let T be a tree on n+1 nodes. We want to show T has exactly n edges.

Proof (cont’d) • Let v be a node in T with degree 1. • Remove v from T. The result is a tree T’ with n nodes, and hence n-1 edges (by the inductive hypothesis) • T’ contains one fewer edge and one fewer vertex (node) than T, and so T has n edges.

Theorem: every edge in a tree is a cut-edge Proof (by contradiction). • Suppose T is a tree, e=(v,w) is an edge in T that is not a cut-edge. • Then G=T-{e} (but keeping v and w) is connected. Hence there is a simple path P from v to w in G. Since e is not in G, P does not include edge e. • Therefore, we can form a simple cycle C by adding edge e to P. • Since every edge in C is in T, this means that T is not acyclic, contradicting the assumption that T is a tree (connected acyclic graph).

Vertex Coloring • A (proper) vertex coloring of a graph is a function c: V -> {1,2,…,k}, s.t. no two adjacent vertices are mapped to the same color. • The chromatic number of a graph is the minimum number of colors needed to properly color the graph. • How many colors does a tree need?

2-coloring a tree • Theorem: every connected acyclic graph (i.e., tree) can be 2-colored. • Proof: by induction on the number of vertices.

Proof that every tree can be 2-colored • Let G be a tree on n vertices. The base case is n=1. Clearly every tree on 1 vertex can be 2-colored. • The Inductive Hypothesis is that for some positive integer n, any tree on n vertices can be 2-colored. • Let G be a tree with n+1 vertices. We want to show that G can be 2-colored.

Proof (cont’d) • Let v be a node in G that has degree 1, and let w be its unique neighbor in G. • Consider the graph G’ formed by deleting v (and its incident edge but not w) from G. • G’ is also acyclic (why?) and has n-1 vertices. • Therefore, by the inductive hypothesis, G’ can be 2-colored. • We extend the coloring from G’ to G, by letting c(v) be 1 if c(w)=2, and c(v)=2 if c(w)=1. • Note that this coloring is proper for G. • Hence G can be 2-colored.

Structural Induction • This was a proof by structural induction. • Proofs by structural induction can be applied more generally!

Theorem about rooted trees • A rooted tree in which every node has 0 or 2 children is called a “binary tree” • Theorem: every binary tree with n nodes has (n-1)/2 internal nodes (defined to be nodes with more than 0 children). • Proof: by strong induction on n. • Base case: n=1. Such a tree has no internal nodes, so it is true.

Proof, cont’d. • Strong Inductive hypothesis: for some n>0, and for all positive integers k up to n, all rooted binary trees with k nodes have (k-1)/2 internal nodes. • Let T have n+1 nodes, and let the children of the root be A and B. (We know the root has two children, since if it had no children, T would have 1 node, contradicting our hypothesis.) We want to show Int(T) = n/2

We want to show Int(T) = n/2 • TA, the subtree of T rooted at A, is a binary tree; let nA be the number of nodes in TA • TB, the subtree of T rooted at B, is a binary tree; let nB be the number of nodes in TB • Let Int(T) be the number of internal nodes of T, and Int(TA) and Int(TB) be similarly defined.

We want to show Int(T) = n/2 • Then nA and nB are both at most n, and by the inductive hypothesis Int(TA) = (nA-1)/2 Int(TB ) = (nB-1)/2 • Therefore Int(T) = (nA-1)/2 + (nB-1)/2 + 1

We want to show Int(T) = n/2 We have established that Int(T) = (nA-1)/2 + (nB-1)/2 + 1 Simplifying this, we get Int(T) =(nA-1 + nB -1 + 2)/2 = (nA + nB)/2 Note nT = nA + nB + 1 Therefore, Int(T) = (nT - 1)/2 Recall that nT =n+1. Therefore, Int(T) = n/2 Q.E.D.

Genome Assembly • Given a DNA sequence, technology can allow you to get a collection of k-mers (substrings of length k) that come from analyses of the sequence. • From these k-mers, your objective is to come up with the sequence.

Genome Assembly • Let X be a very long DNA sequence • Consider all k-mers in X, with k big enough so that no k-mer appears two or more times • Goal: reconstruct X from its set of k-mers

Genome Assembly, attempt #1 Approach 1: • Make a node for each k-mer, and put a directed edge from v to w if the k-1 suffix of v is the k-1 prefix of w. • Create the graph for the following string, using k=5 • ACATAGGATTCAC

Genome Assembly, attempt #1 Approach 1: • Make a node for each k-mer, and put a directed edge from v to w if the k-1 suffix of v is the k-1 prefix of w. • Every such graph has a Hamiltonian Path, as long as no k-mer appears more than once!

Hamiltonian Path • A Hamiltonian Path in a graph visits every node exactly once

Genome AssemblyAttempt #1 • Create the graph for the following string, using k=5 • ACATAGGATTCAC • Does the graph have a Hamiltonian Path? • Is it unique? • Can you reconstruct the sequence from the path?

Hamiltonian Path • A Hamiltonian Path in a graph visits every node exactly once • Determining if a graph has a Hamiltonian Path is NP-Complete • So this approach to Genome Assembly is computationally intensive (infeasible)

Eulerian Cycles • An Eulerian cycle is one that goes through every edge exactly once • It is easy to see that if a graph has an Eulerian cycle, then every node has even degree. The converse is also true, but a bit harder to prove. • For directed graphs, the cycle will need to follow the direction of the edges (also called “arcs”). In this case, a graph has an Eulerian cycle if and only if the indegree is equal to the outdegree for every node.

Eulerian Paths • An Eulerian path is one that goes through every edge exactly once • It is easy to see that if a graph has an Eulerian path, then all but 2 nodes have even degree. The converse is also true, but a bit harder to prove. • For directed graphs, the cycle will need to follow the direction of the edges (also called “arcs”). In this case, a graph has an Eulerian path if and only if the indegree(v)=outdegree(v) for all but 2 nodes (x and y), where indegree(x)=outdegree(x)+1, and indegree(y)=outdegree(y)-1.

de Bruijn Graph Input: the set of k-mers for the DNA sequence Output: the de Bruijn Graph • Vertices: the (k-1)-mers • Directed edges: from v->w if the (k-2)-suffix of v is the (k-2)-prefix of w, and the k-mer formed by starting with v and ending with w is one of the k-mers in the input

de Bruijn Graph • If the k-mer set comes from a sequence and no k-mer appears more than once in the sequence, then the de Bruijn graph has an Eulerian path!

Using de Bruijn Graphs Given: set of k-mers from a DNA sequence Algorithm: • Construct the de Bruijn graph • Find an Eulerian path in the graph • The path defines a sequence with the same set of k-mers as the original

de Bruijn Graph • Create the de Bruijn graph for the following string, using k=5 • ACATAGGATTCAC • Find the Eulerian path • Is the Eulerian path unique? • Reconstruct the sequence from this path

Basic Graph Terminology and Advanced Tree Concepts

Basic Graph Terminology and Advanced Tree Concepts

Presentation Transcript

March 19

2012 – 2013 Budget Presentation March 19, 2012

CS 394C March 21, 2012

North American Panel March 19, 2012

March 2012

CROWDSOURCING Disruptive Thinkers 19 March 2012

CS 394C March 19, 2012

CS 336 Feb 13, 2012

March 2012

Air Quality Conference Athens, Greece March 19, 2012-March 23, 2012

19 March 2012

March 2012

March 19, 2012

Monday, March 19 th 2012

19 March 2013

KOMET NEWS Monday, March 19, 2012

March 2012

ISBA 2012 SIXTH ISBA CONFERENCE 17-19 March 2012, Pune

March 2012

CS 336/536 Computer Network Security

Intelligent Information Retrieval CS 336

CS 144 Advanced C++ Programming March 19 Class Meeting