1 / 25

An Algorithm For Exploring Patterns In Clinical Genomic Data

An Algorithm For Exploring Patterns In Clinical Genomic Data. Richard Mushlin and Aaron Kershenbaum IBM T.J. Watson Research Center. Some Questions. What do people with a disease (cases) have in common that people without the disease (controls) don’t? How “exact” is the answer?

jnugent
Download Presentation

An Algorithm For Exploring Patterns In Clinical Genomic Data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. An Algorithm ForExploring Patterns InClinical Genomic Data Richard Mushlin and Aaron Kershenbaum IBM T.J. Watson Research Center

  2. Some Questions • What do people with a disease (cases) have in common that people without the disease (controls) don’t? • How “exact” is the answer? • What does the answer tell us about the disease etiology?

  3. Our Approach • Find patterns in cases • Compare frequencies with same patterns in controls • Score patterns • Find relationships between patterns • Evaluate effect of patterns on biological pathways

  4. Find Patterns • Combination of • Brute force • Thresholds • Directed search

  5. Raw Data

  6. Bipartite Graph Representation “F1=a” P1 “F2=a” P2 “F2=b” “F3=a” P3 “F3=c” Feature values People

  7. Find Maximal Bicliques • Biclique: • Subgraph of bipartite graph • All people are connected to all feature values • Maximal biclique: • Cannot add person to biclique without losing feature value • Cannot add feature value to biclique without losing people

  8. Adjacency Representation

  9. Start With Singletons

  10. Save Acceptable Candidates • Candidate is acceptable (so far) if: • Biclique [{S},{D}] is maximal • {S} has “enough” elements • {D} has “enough” elements • Score (figure of merit) is “good enough” • Thresholds for {S}, {D}, score, set as parameters • Candidates saved in priority queue by score

  11. Compute Neighbor Set • For candidate C = [{Sc},{Dc}], neighbor set {Nc} is the set of feature values in the original data that has at least one person in {Dc} • For singleton candidate [{S1},{D1,D3}], the neighbor set is {S4, S5, S7, S9}

  12. Expand Candidates • Pop “current best” candidate Co off priority queue • Create new candidates from neighbors {NCi} by taking unions and intersections: • SCi = {SCo} U {SNi} • DCi = {DCo} ∩ {DNi} • Save acceptable candidates in queue

  13. Expansion Example

  14. F.O.M. And The Priority Queue • Various criteria can be used to calculate a figure of merit (score) • Working queue size set as parameter • Working queue is allowed to fill up • Buffer is emptied when trigger reached F.O.M.

  15. F.O.M. And Search Strategy • The search strategy is embodied in the evaluation of the “<“ operator for bicliques • The candidate queue is prioritized in the same order as the scores • The scoring function can be externalized • The search strategy can be changed without changing the search machinery

  16. Scoring Case/Control Problems Given Measured Derived

  17. Scoring Example ( a * d ) Odds Ratio (OR) = ( b * c ) FOM = abs ( log ( OR ) )

  18. Statistical Significance (a+b)! (c+d)! (Ncases)! (Ncontrols)! p = a! b! c! d! (Ntotal)! q = ∑ p, for all tables with same margins and better FOM FOMadj = FOM * ( 1 – q )

  19. Structure Of The Output • Algorithm yields a collection of related patterns (bicliques) • Question of when to “lump” and when to “split” related patterns • Lattice structure helps us decide

  20. Lattice (simplified) • A lattice can be represented as a graph with special properties (Hasse diagram) • In the context of bicliques, each node B is characterized by 2 sets: S and D • A directed edge exists from node B1 to B2 if and only if • S1 is a subset of S2 and • D2 is a subset of D1

  21. Lattice Example Null; 1,2,3,4 B; 2,3,4 C; 1,3,4 A; 1,2,3 A,B; 2,3 A,C; 1,3 B,C; 3,4 C,D; 1,4 A,C,D; 1 A,B,C; 3 B,C,D; 4 A,B,C,D; Null

  22. Lattice Score jump

  23. Real SNP Data (a few rows)

  24. SNP Lattice

  25. “Reading” The Lattice Background context Small Ns Large Nd Low FOM Score jump +S -D Effect of adding S in the context of High FOM Children of may have better or worse scores, but are “similar” to +S’ -D’ +S” -D” Lower FOM Higher FOM Scores similar

More Related