90 likes | 366 Views
Genome Assembly. Charles Yan 2008. Fragment Assembly. Given a large number of fragments, such as ACC AC AT AC AT GG … , the goal is to figure out the original sequence that consists of each and every of the fragment. Overlaps.
E N D
Genome Assembly Charles Yan 2008
Fragment Assembly • Given a large number of fragments, such as ACC AC AT AC AT GG … , the goal is to figure out the original sequence that consists of each and every of the fragment.
Overlaps • The overlap between string T and S is the longest suffix of S that is also the prefix of T. S=ATCGATCCG T=CGATCCGATTAT overlap(T, S)= CGATCCG
A Simplified Problem Shortest common superstring problem: Given a set of strings, to find a minimal length string S that each and every one of the input strings appears as a substring of S.
Directed Graph Model • Nodes: Each input fragment is a node. (Each node is labeled with an input fragment) • Edge(v,w) is labeled with overlap (W,V), where W and V are the node labels of w, and v respectively. The edge weight is |overlap (W,V)|. • To find a superstring is to find a directed path that traverse each and every node once (Hamilton path problem) • Shortest superstring: A Hamilton path with the maximal sum of edge weight.
Directed Graph Model • NPC • No efficient solution that can give accurate results for all cases • Heuristic
Genome Assembly Difficulties Repeats Bidirectional nature of DNA Errors