490 likes | 779 Views
Easier than Excel: Social Network Analysis of DocGraph with Gephi. Janos G. Hajagos Stony Brook School of Medicine Fred Trotter fredtrotter.com. DocGraph. Based on FOIA request to CMS by Fred Trotter Pre-released at Strata RX 2012 Medicare providers (more than doctors)
E N D
Easier than Excel: Social Network Analysis of DocGraph with Gephi • Janos G. Hajagos • Stony Brook School of Medicine • Fred Trotter • fredtrotter.com
DocGraph • Based on FOIA request to CMS by Fred Trotter • Pre-released at Strata RX 2012 • Medicare providers (more than doctors) • CY 2011 dates of service • Share 11 or more patients in a 30 day forward window • Initial access restricted to MedStartr funders
DocGraph by the numbers • Directed graph • Average total degree 52.8 • 940,492 providers (graph nodes/vertices) • 49,685,810 shared edges
Geographic visualization http://isurfsoftware.com/blog/2012/12/13/visualizing-geographic-connections-between-us-doctors/
NPPES • National Plan and Provider Enumeration System • Source of NPI (National Provider Identifier) • No cost download • Information is entered and updated by provider • Data quality is good to poor • CSV file with 314 columns • A custom MySQL load script is used to normalize the database • Bloom.api open source project to make data easier to access • http://www.bloomapi.com/
Graph data Relation between authors and MeSH terms from PubMed http://dx.doi.org/10.6084/m9.figshare.94595 10
Graph types • Undirected graph • Facebook friendships • Directed graph • Twitter: follow and be followed • Bipartite graph • Multipartite • RDF graph model • Property graph model • Allow parallel edges • RDF graph Model 11
Graphs in healthcare • Prescriber and patient (bipartite) • NCPDP data with NPI • Referral data sets • Shared patients • DocGraph • Social networks • Tweeting about a disease • Limited by imagination 13
Generating GraphML • XML based file format for graphs • Readable by a large number of tools • Gephi • Mathematica • igraph (R) • NetworkX a Python library for graphs which can export to GraphML • GraphML is not a file format for really large graphs • GraphML is not readable by d3.js
Gephi 16
Gephi • Java based open source tool • Focused on interactivity • Fast graphics • Multi-threaded • Visual updates • Strong graph analytics • Graphs stored in memory • Upper limit is about 100,000 nodes • Netbeans plugin architecture • Integration with Neo4J • Additional layout algorithms
Downloading Gephi http://gephi.org/users/download/ 18
Downloading sample files https://dl.dropboxusercontent.com/u/21690634/DocGraph/docgraph_tutorial_examples.zip 19
Subsets are generated using a Python script python extract_providers_to_graphml.py "npi='1750499653'" sterrence Leaf-edges Opening connection referral Configuration Selection criteria for subset graph: npi='1750499653' Referral table _name: referral.referral2011 NPI detail table name: referral.npi_summary_primary_taxonomy Nodes will be labeled by: provider_name Leaf-to-leaf edges will be exported? False … Imported 1 nodes … Imported 986 nodes … Imported 1724 edges Edge types imported {'core-to-leaf': 866, 'leaf-to-core': 856: None : 2} Leaf-to-leaf edges were not selected for export Writing GraphML file
Generating a subset: some concepts Core nodes Connecting core nodes Adding leaf nodes Connecting to leaf nodes Connecting leaf nodes
Sample files • jamestown_core_provider_graph.graphml • Providers selected with practice addresses in Jamestown, NY • Small city in far western New York (approximately 30,000 residents) • 179 nodes with 5,560 edges • jamestown_core_and_leaf_provider_graph.graphml • Includes providers above and those who are linked to them • 1,322 nodes with 12,457 edges • albany_core_provider_graph.graphml • Providers selected with practice addresses in Albany, NY • A small city in New York (approximately 100,000 residents) • 1,368 nodes with 44,711 edges
Sample files (continued) • bronx_core_provider_graph.graphml • Providers selected with practice addresses in Bronx, NY • Urban community (1.4 million residents) • 3,268 nodes and 53,828 edges
Navigating the graph • Best experience with a three button mouse with a scroll wheel • Right click and hold to pan • Scroll wheel to zoom in and out • Left click to select • Right click for context menus • MacBook users • command key and click and hold down on trackpad to pan • Two fingers to zoom on trackpad • Click on trackpad to select • Control click for context menus 29
Varying node size based on importance • Step 1: Need to select a measure for node importance • Degree • PageRank • Eigenvector centrality • Step 2: Run the measure against the graph • Step 3: Ranking tab and “Size/Weight” • Step 4: Set size range 32
Graph measures • Degree • In-degree • Out-degree • Graph structure measures • Clustering (global and local) • Network diameter • Centrality Measures • Eigenvector centrality • PageRank (Google search) • Community measures • And more . . . . . 33
Interactively viewing node attributes Click the “T” icon on the bottom to turn on node labeling 34
Saving your graph • Save your graph in .gephi format • xml based format • preserves layout, size, and color • Save in GraphML format for use with outside programs 38
Hints for filtering nodes • Drag field filter “is_physician” from the top pane to the lower pane • Set the value to filter on • Value should equal 1 • 1 is equivalent to true • Click “Filter” to apply 40
Producing a final graph We need to rescale the edge weights in the graph 41
Challenge questions • Which institution is the most “important” provider for the Bronx? • Hint: try a centrality measure • Can you determine if geography plays a role in patient sharing in the Bronx? • Which parameter could be used to partition the graph? • Can you filter the graph to show only radiologists? • Which radiologist has the highest “authority” in the graph? 44
Other tools for graph analysis • NetworkX • Python • Lots of algorithms • igraph • R and Python • Gremlin – graph traversal and manipulation • Groovy shell • Gremlin interface is implemented for Neo4J • And more . . . 45
Scaling the analysis to the entire DocGraph • Most healthcare graphs will be big (millions of nodes) • What we learn at the local level can be applied at the global level • Importance of geography • Supernodes (radiologist, ER docs, pathologist, transportation, …) • Many graph measures don’t scale well • Maximal cliques • Currently exploring how to use Faunus to scale the analysiswith Hadoop 46
Links http://strata.oreilly.com/2012/11/docgraph-open-social-doctor-data.html (information) https://github.com/jhajagos/DocGraph (code) http://notonlydev.com/docgraph-data/ (open source $1 covers bandwidth fees) https://groups.google.com/forum/#!forum/docgraph (mailing list)
Questions Try to publish your own healthcare dataset as a graph!