250 likes | 394 Views
Scaffolding Problems. Gao Song 2010/04/27. Outline. Concepts Problem definition Non-error Case Edge-error Case Disconnected Components Simulated Data Future Work. Concepts. Contig : Edge (PET ): library size Scaffolding: a sequence of contigs Happy Edge:
E N D
Scaffolding Problems Gao Song 2010/04/27
Outline • Concepts • Problem definition • Non-error Case • Edge-error Case • Disconnected Components • Simulated Data • Future Work
Concepts • Contig: • Edge (PET): library size • Scaffolding: a sequence of contigs • Happy Edge: • Real distance <= expected distance • Orientation of both contigs are correct
Problem Definition • Version 1: Given a set of contigs and a set of edges, find a scaffold which has at most p unhappy edges • Version 2: Given a set of contigs and a set of edges, find a scaffold which has at most p unhappy edges and is also the optimal solution
Non-error Case • Connected graph • Partial Layout: • Dangling Edge: only one end in partial layout • Active region: the sequence from the first contig having dangling edges to the end of partial layout; less than library size • Domain of a partial layout: all nodes in partial layout
Non-error Case • Theorem: if two partial layout l1 and l2 have same active region and dangling set, then • (1) they have same domain • (2) both or neither of them can extend to a solution • Proof:
Procedure • Find the unassigned node • Select the nearest node as next assigned node • Update current partial layout • Remove all dangling edges incident to new node • Add new dangling edges of new node • Remove contigs from active region
Main Procedure • Find all nodes which has no ancestors and select one to start • From an active region, get all unassigned nodes, and update the partial layout • Remember all visited partial layout • If dangling edge set is empty, output the results
Time and space complexity • Two possibilities • k vertices in active region – one possible next nodes • Less than k vertices in active region – n possible next nodes • Comlexity • O(nk)*O(1) • O(nk-1)*O(n) • Total time complexity: O(nk) • Total space complexity: store all visited partial order
Introduce Edge Error • Types of edge error • Chimeric PETs: • Mapping error • Misassembled contigs • Solution • Filtering – filter chimeric PETs • Select x% of PETs • Shuffle them to get chimeric PETs • Cluster them to find threshold • Local threshold . . . . . .
Introduce Edge Error • There are p unhappy edges in final scaffolding • Partial layout • Dangling edges: real dangling edges; wrong edges
Equivalent Class • Active region, dangling edges’ set, count of current wrong edges • Same domain • Assumption: the partial order is a connected graph
Get Unassigned Nodes • Sort the unassigned nodes • Properties of nodes: • Steps to reach this node • Distance to the end of active region • Unhappy edges introduced due to this node
Sort Unassigned Nodes • Breadth-first search • Select the smallest possible distance: > threshold • Sort nodes: • Less than 5 steps, compare with distance; same distance, compare with unhappy edges
Update Partial Layout • Check if all incident un-wrong dangling edges are happy • If yes, just remove all those edges and add new node • If no, check if setting all unhappy edges as omitted will result in disconnected graph • If no, just add new node and remove dangling edges • If yes, discard current partial layout – to avoid insert disconnected component into sequence • Add new dangling edges • Remove all dangling edges which is not happy – check connectness
Main Procedure • If active region is empty • Current connected component is finished • Check if dangling edge set is empty • If yes, output the result • If no, using dangling edges to find a new node and start another scaffolding
Disconnected Components • First find all the connected components and sort them according to the number of nodes • From the first component, find a solution, which omits p1 edges • For ith component, if there is no solution omits p-sum(p1,…, pi-1) edges, remember all the stop point, return to (i-1)th component, and see if it can find a solution which omits less than pi-1 edges. If yes, continue from the stop point of ith component.
If ith component finishes the whole search and found more than one solutions. Then, only remember the solution with minimum pi. Then, in the future, when comes to this component, just use this solution as part of the partial results
Optimal Solution • Branch and Bound P’ edges
Simulated Data Result • Node Num: 1522 nodes • Contig length: 600 - 10,000
Future Work • Find the optimal solution • Wrong contigs • Repeats • How to deal with large p • Find a good way to sort the unassigned nodes