1 / 23

Scaffolding Problems

Scaffolding Problems. Gao Song 2010/04/27. Outline. Concepts Problem definition Non-error Case Edge-error Case Disconnected Components Simulated Data Future Work. Concepts. Contig : Edge (PET ): library size Scaffolding: a sequence of contigs Happy Edge:

reegan
Download Presentation

Scaffolding Problems

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Scaffolding Problems Gao Song 2010/04/27

  2. Outline • Concepts • Problem definition • Non-error Case • Edge-error Case • Disconnected Components • Simulated Data • Future Work

  3. Concepts • Contig: • Edge (PET): library size • Scaffolding: a sequence of contigs • Happy Edge: • Real distance <= expected distance • Orientation of both contigs are correct

  4. Problem Definition • Version 1: Given a set of contigs and a set of edges, find a scaffold which has at most p unhappy edges • Version 2: Given a set of contigs and a set of edges, find a scaffold which has at most p unhappy edges and is also the optimal solution

  5. Non-error Case • Connected graph • Partial Layout: • Dangling Edge: only one end in partial layout • Active region: the sequence from the first contig having dangling edges to the end of partial layout; less than library size • Domain of a partial layout: all nodes in partial layout

  6. Non-error Case • Theorem: if two partial layout l1 and l2 have same active region and dangling set, then • (1) they have same domain • (2) both or neither of them can extend to a solution • Proof:

  7. Procedure • Find the unassigned node • Select the nearest node as next assigned node • Update current partial layout • Remove all dangling edges incident to new node • Add new dangling edges of new node • Remove contigs from active region

  8. Main Procedure • Find all nodes which has no ancestors and select one to start • From an active region, get all unassigned nodes, and update the partial layout • Remember all visited partial layout • If dangling edge set is empty, output the results

  9. Time and space complexity • Two possibilities • k vertices in active region – one possible next nodes • Less than k vertices in active region – n possible next nodes • Comlexity • O(nk)*O(1) • O(nk-1)*O(n) • Total time complexity: O(nk) • Total space complexity: store all visited partial order

  10. Introduce Edge Error • Types of edge error • Chimeric PETs: • Mapping error • Misassembled contigs • Solution • Filtering – filter chimeric PETs • Select x% of PETs • Shuffle them to get chimeric PETs • Cluster them to find threshold • Local threshold . . . . . .

  11. Introduce Edge Error • There are p unhappy edges in final scaffolding • Partial layout • Dangling edges: real dangling edges; wrong edges

  12. Equivalent Class • Active region, dangling edges’ set, count of current wrong edges • Same domain • Assumption: the partial order is a connected graph

  13. Get Unassigned Nodes • Sort the unassigned nodes • Properties of nodes: • Steps to reach this node • Distance to the end of active region • Unhappy edges introduced due to this node

  14. Sort Unassigned Nodes • Breadth-first search • Select the smallest possible distance: > threshold • Sort nodes: • Less than 5 steps, compare with distance; same distance, compare with unhappy edges

  15. Update Partial Layout • Check if all incident un-wrong dangling edges are happy • If yes, just remove all those edges and add new node • If no, check if setting all unhappy edges as omitted will result in disconnected graph • If no, just add new node and remove dangling edges • If yes, discard current partial layout – to avoid insert disconnected component into sequence • Add new dangling edges • Remove all dangling edges which is not happy – check connectness

  16. Main Procedure • If active region is empty • Current connected component is finished • Check if dangling edge set is empty • If yes, output the result • If no, using dangling edges to find a new node and start another scaffolding

  17. Disconnected Components • First find all the connected components and sort them according to the number of nodes • From the first component, find a solution, which omits p1 edges • For ith component, if there is no solution omits p-sum(p1,…, pi-1) edges, remember all the stop point, return to (i-1)th component, and see if it can find a solution which omits less than pi-1 edges. If yes, continue from the stop point of ith component.

  18. If ith component finishes the whole search and found more than one solutions. Then, only remember the solution with minimum pi. Then, in the future, when comes to this component, just use this solution as part of the partial results

  19. Optimal Solution • Branch and Bound P’ edges

  20. Simulated Data Result • Node Num: 1522 nodes • Contig length: 600 - 10,000

  21. Future Work • Find the optimal solution • Wrong contigs • Repeats • How to deal with large p • Find a good way to sort the unassigned nodes

  22. Thank you

More Related