An Algorithm For Exploring Patterns In Clinical Genomic Data

An Algorithm ForExploring Patterns InClinical Genomic Data Richard Mushlin and Aaron Kershenbaum IBM T.J. Watson Research Center

Some Questions • What do people with a disease (cases) have in common that people without the disease (controls) don’t? • How “exact” is the answer? • What does the answer tell us about the disease etiology?

Our Approach • Find patterns in cases • Compare frequencies with same patterns in controls • Score patterns • Find relationships between patterns • Evaluate effect of patterns on biological pathways

Find Patterns • Combination of • Brute force • Thresholds • Directed search

Raw Data

Bipartite Graph Representation “F1=a” P1 “F2=a” P2 “F2=b” “F3=a” P3 “F3=c” Feature values People

Find Maximal Bicliques • Biclique: • Subgraph of bipartite graph • All people are connected to all feature values • Maximal biclique: • Cannot add person to biclique without losing feature value • Cannot add feature value to biclique without losing people

Adjacency Representation

Start With Singletons

Save Acceptable Candidates • Candidate is acceptable (so far) if: • Biclique [{S},{D}] is maximal • {S} has “enough” elements • {D} has “enough” elements • Score (figure of merit) is “good enough” • Thresholds for {S}, {D}, score, set as parameters • Candidates saved in priority queue by score

Compute Neighbor Set • For candidate C = [{Sc},{Dc}], neighbor set {Nc} is the set of feature values in the original data that has at least one person in {Dc} • For singleton candidate [{S1},{D1,D3}], the neighbor set is {S4, S5, S7, S9}

Expand Candidates • Pop “current best” candidate Co off priority queue • Create new candidates from neighbors {NCi} by taking unions and intersections: • SCi = {SCo} U {SNi} • DCi = {DCo} ∩ {DNi} • Save acceptable candidates in queue

Expansion Example

F.O.M. And The Priority Queue • Various criteria can be used to calculate a figure of merit (score) • Working queue size set as parameter • Working queue is allowed to fill up • Buffer is emptied when trigger reached F.O.M.

F.O.M. And Search Strategy • The search strategy is embodied in the evaluation of the “<“ operator for bicliques • The candidate queue is prioritized in the same order as the scores • The scoring function can be externalized • The search strategy can be changed without changing the search machinery

Scoring Case/Control Problems Given Measured Derived

Scoring Example ( a * d ) Odds Ratio (OR) = ( b * c ) FOM = abs ( log ( OR ) )

Statistical Significance (a+b)! (c+d)! (Ncases)! (Ncontrols)! p = a! b! c! d! (Ntotal)! q = ∑ p, for all tables with same margins and better FOM FOMadj = FOM * ( 1 – q )

Structure Of The Output • Algorithm yields a collection of related patterns (bicliques) • Question of when to “lump” and when to “split” related patterns • Lattice structure helps us decide

Lattice (simplified) • A lattice can be represented as a graph with special properties (Hasse diagram) • In the context of bicliques, each node B is characterized by 2 sets: S and D • A directed edge exists from node B1 to B2 if and only if • S1 is a subset of S2 and • D2 is a subset of D1

Lattice Example Null; 1,2,3,4 B; 2,3,4 C; 1,3,4 A; 1,2,3 A,B; 2,3 A,C; 1,3 B,C; 3,4 C,D; 1,4 A,C,D; 1 A,B,C; 3 B,C,D; 4 A,B,C,D; Null

Lattice Score jump

Real SNP Data (a few rows)

SNP Lattice

“Reading” The Lattice Background context Small Ns Large Nd Low FOM Score jump +S -D Effect of adding S in the context of High FOM Children of may have better or worse scores, but are “similar” to +S’ -D’ +S” -D” Lower FOM Higher FOM Scores similar

An Algorithm For Exploring Patterns In Clinical Genomic Data

An Algorithm For Exploring Patterns In Clinical Genomic Data

Presentation Transcript

New Paradigms for Clinical Drug Development in the Genomic Era

Inferring Phylogeny using Permutation Patterns on Genomic Data

Fishing for Patterns in Data Streams

An Efficient Algorithm for Mining Time Interval-based Patterns in Large Databases

SMART Platform and API for Clinical Genomic Apps

CEMiner – An Efficient Algorithm for Mining Closed Patterns from Time Interval-based Data

An Ontology for Clinical Trial Data Integration

Algorithm Design Patterns and Anti-Patterns

Clinical Data Management -An Introduction

An Efficient GA-Based Algorithm for Mining Negative Sequential Patterns

An Effective Disk Caching Algorithm in Data Grid

An Automated Classification Algorithm for Multi-wavelength Data

An Efficient Algorithm for Enumerating Closed Patterns in Transaction Databases

Fishing for Patterns in Data Streams

Algorithm Design Patterns and Anti-Patterns

Strategy for Exploring Data

2.1 Exploring Patterns

2.1 Exploring Patterns

IDENTIFICATION OF GENES IN GENOMIC DATA

Lecture III: Interpreting genomic information for clinical care

EXPLORING DATA PATTERNS INTRO TO FORECASTING METHODS

Cancer Clinical Trials in the Genomic Era