180 likes | 597 Views
Computational Biology. Dr. Isabel Darcy EC 3.914 972-882-4435 darcy@utdallas.edu www.utdallas.edu/~darcy. Human Genome:. 3 billion base pairs 46 chromosomes ~5% of genome codes for protein. 95% junk DNA??? Thryroglobin gene: (extreme example) introns: more 100,000 bp
E N D
Computational Biology Dr. Isabel Darcy EC 3.914 972-882-4435 darcy@utdallas.edu www.utdallas.edu/~darcy Math 6390
Human Genome: • 3 billion base pairs • 46 chromosomes • ~5% of genome codes for protein. 95% junk DNA??? • Thryroglobin gene: (extreme example) introns: more 100,000 bp exons: only 8500 bp • Gene Expression: some proteins are 1000x more common then other proteins. Math 6390
Areas used/needed in computational biology • Biology • Computer science • Statistics • Graph Theory • Linear Algebra • Topology • Algebraic Geometry • etc. Math 6390
Mathematics: • 1 + 1 = 2 always. • Topology means … Biology: • G always pairs with C (I.e. usually) • Topology means … Math 6390
My definition of computational biology: The translating of biological concepts/questions into rigorous mathematical and/or computational problems, solving these problems, and translating the answers back into useful biological information. Example: Determining the packing of mitochondrial DNA of the parasite trypanosome which infects the Tse-tse fly which infects humans and animals with sleeping sickness. Math 6390
The trypanosome mitochondrial DNA consists of 5000 DNA mini-circles and about 25 DNA maxi-circles. • Question: How are the mini-circles of DNA linked together? • Assumptions: The linking is uniform throughout the network. • Tool: The network can be randomly broken into much smaller networks. The number of DNA mini-circles in the resulting small networks can be determined (via gel electrophoresis) Math 6390
We will cover • Ch 1: Intro to Molecular Biology • Ch 2: Some basics (strings, graphs, algorithms). • Ch 3: Sequence comparison and data base search • Ch 6: Phylogenetic Trees • Ch 7: Genome Rearrangements • Microarrays • Protein Folding (ch 8?) Math 6390
Possible Projects: • Microarrays • Protein Folding • DNA computing (Ch 9) • Human Brain Project • Chemical Chirality • Gel Electrophoresis (DNA Topology) • Or any other approved subject. Math 6390
Web page creation For instructions specific to setting up a UTD web page see http://www.utdallas.edu/ir/tcs/labs/ unixdocs/provider.html For instruction on how to create an html document see http://www.ncsa.uiuc.edu/General/Internet/ WWW/HTMLPrimer.html or http://www.hypernews.org/HyperNews/get/ www/html/lang.html OR • Use another format and convert to html. For example, Microsoft Word documents and PowerPoint slides can be converted to html using the save as command. Latex files can be converted to html files using the command latex2html file.tex. This will create the directory file in which includes file.html, figures, etc. • You can also learn from other people’s web pages by looking at/copying their html code. From netscape or internet exploror, click on Source located under View. Math 6390
Web page creation (cont.) • Use netscape composer. • First create file.html • Under Communicator, click Composer. Netscape Composer will pop up. • Click Open button. • Enter URL (http://www.utdallas.edu/~your_name/file.html) or click Choose file and click file.html (after changing into apporpriate directory). • Use a PC product such as Front page. Math 6390
2.1 Strings • Alphabet: A finite set e.g. {A, T, C, G}, {amino acids}, {a, …, z}. • Character or symbol: an element of the alphabet. • Sequence or String: an ordered succession of characters. • e.g: ATATCAGTTGCC • Length of a string s = |s| = number of characters in s. • s[i] = the ith character in string s. • Empty string = e = the string of length zero. • Subsequence of s is a sequence that can be obtained from s by removing some characters. • t is a supersequece of s if s is a subsequence of t. • A substring of s is a su sequence where the characters are consecutive in s. • t is a superstring of s if s is a substring of t Math 6390
An interval, [i..j], is a set of consecutive indices such that • s[i..j] = s[i]s[i+1]…s[j] if s[i..j] = e if i = j+1. • st = s[1..n]t[1..m] is the concatenation of s = s[1..n] and t =t[1..m]. • prefix(s,j) = s[1..j] is a prefix of s = s[1..n]. • Suffix(s, j) = s[n-j+1…n] is a suffix of s = s[1..n]. • k is the killer agent that destroys characters it operates on. Note: Concatenation is not associative (ATk)CT = ACT ATT = AT(kCT) |k| = -1 prefix(s,j) = sk|s|-j suffix(s,j) = k|s|-js s[i..j] = ki-1sk|s|-j Math 6390
v1 v2 v4 = v1 v2 v3 v4 v3 A graph, G = (V, E), is a collection of vertices, V = {va| },and edges, E = {(va, vb) | } If G is a undirected graph, (va, vb) = (vb, va) for all Example: G = (V, E) where V = {v1, v2, v3, v4} and E = {(v1, v2), (v1, v3), (v1, v4), (v2, v3), (v2, v4), (v3, v4)} Math 6390
G directed: (u,v) (v,u). • G undirected: (u,v) = (v,u). • Simple graph: no loops • (u,u) E, 2 copies of (u,v) E. • |V| = # of vertices, |E| = # of edges. • u and v are the endpoints of the edge (u,v). u and v are incident to (u,v). • If (u,v) is directed, u is the tail of this edge and v is the head. • u and v are adjacent if (u,v) is in E. • The degree of v is the number of edges adjacent to it. If G is directed: the outdegree of v is the number of edges in E of the form (v,x). the indegree of v is the number of edges in E of the form (x,v). • A graph is weighted if there exists a real number associated to it. This real number is called the weight or the cost or the distance between u and v depending on the application. Math 6390
G’ = (V’,E’) is a subgraph of G = (V,E) if If G’ is a subgraph of G and • If then G’ is a proper subgraph of G. • If V = V’, G’ is a spanning subgraph of G. • If V’ = {v | v is an endpoint of an edge in E’}, then G’ is the graph induced by E’. • If E’ = {(v,w) | v,w V’}, then G’ is the graph induced by V’. • A path is an ordered list of distinct vertices (v1, v2, …, vk) such the (vi, vi+1) is a edge in G. • A cycle in an undirected graph is a path where vk = v1 and no edge is repeated. A simple cycle is a cycle where all vertices except the first and the last are distinct. • A vertex v is reachable from vertex u if there is a path between u and v. • The weight of a path is the sum of the weights of its edges. Math 6390
An undirected graph is connected if every vertex is reachable from every other vertex. • The connected components of G is the set of all connected subgraphs of G such that no element of the set is a subgraph of another element of the set. • A directed graph is strongly connected if every vertex is reachable from every other vertex. A directed graph is weakly connected if the underlying undirected graph is connected (every vertex is reachable if we ignore edge direction). A directed graph is not connected if it is neither strongly nor weakly connected. • An acyclic graph is a graph without cycles. • A complete graph is a graph such that v, w V implies (v,w) in E. • A bipartite graph G = (V,E) is a graph such that V = V1 V2 where V1 V2 = empty set and every edge has one endpoint in V1 and the other endpoint in V2. • A tree is a graph which is acyclic and connected. A forest is a graph whose connected components are trees. Math 6390
Trees: • A node is a vertex • A leaf is a node with degree 1. All other nodes are interior nodes. • A tree is rooted is one of its nodes is distinguished. This distinguished node is called the root (denoted by r). • If v is a node in the path from r to u, then v is an ancestor of u and u is a descendant of v. • If u and v are adjacent and v is an ancestor of u, then v is the parent of u and u is the child of v. Note leaves are nodes without children, interior nodes have children, the root has no parents. • The depth of a node v is the number of edges on the path from v to r. • The lowest common ancestor of u and v is the deepest node that is ancestor of both u and v (I.e. the closest node to u and v which is an ancestor of both u and v). Interval graphs: • An interval graph G = (V,E) is an undirected graph obtained from a collection C of intervals on the real line. To each interval in C there corresponds a vertex in G. The edge (u,v) is in E if and only if their corresponding intervals intersect. Math 6390