380 likes | 618 Views
Scalable Methods for the Analysis of Network-Based Data MURI Project: University of California, Irvine Principal Investigator: Padhraic Smyth Kick-off Meeting November 18 th 2008. Goals for Today’s Meeting. Review overall goals and research of MURI project University research groups
E N D
Scalable Methods for the Analysis of Network-Based DataMURI Project: University of California, Irvine Principal Investigator: Padhraic Smyth Kick-off MeetingNovember 18th 2008
Goals for Today’s Meeting • Review overall goals and research of MURI project • University research groups • learn about each other’s research • See opportunities for collaboration • MURI team and ONR/Navy • MURI team: learn about ONR interests • ONR: learn about expertise and plans of MURI team • Action items • Future meetings and collaborative activities • Review Year 1 research goals Butts
Outline • Introductions • Review today’s agenda • Schedule of talks • Logistics • Overview of our MURI research project • Themes and goals • Tasks
MURI Investigators Padhraic Smyth UCI David Eppstein UCI Carter Butts UCI Michael Goodrich UCI Mark Handcock U Washington Dave Mount U Maryland Dave Hunter Penn State
MURI Project Participants • Postdocs • UC Irvine • Romain Thibaux (Computer Science) • Graduate Students (all UC Irvine) • Computer Science • Darren Strash • Lowell Trott • Statistics • Chris DuBois • Social Science • Chris Marcum • Lorien Jasny • Emma Spiro • Zack Almquist
Outline • Introductions • Review today’s agenda • Schedule of talks • Logistics • Overview of our MURI research project • Themes and goals • Tasks
Agenda for MURI Kickoff Meeting at UC Irvine November 18th 2008Location: UC Irvine, Bren Hall, Room 4011 MORNING SESSION8:30 Arrive, continental breakfast available9:00 Introductions and overview of MURI proposal Padhraic Smyth (UCI)9:30 Research and application challenges from ONR's perspective Martin Kruger (ONR, MURI Program Manager)10:00 Brief discussion/Q&A session between ONR representatives and PIs
Agenda for MURI Kickoff Meeting at UC Irvine 10:15 Break10:30 Tutorial Session: Statistical models for network data: Mark Handcock (U Washington), Carter Butts (UCI), Dave Hunter (Penn State) - Fundamentals of exponential family random graph models (ERGMs) - Parameter estimation in ERGMs: principles and computational challenges - Alternative statistical frameworks such as latent space models LUNCH12:00 Lunch for PIs and UCI visitors at the University Club (next door to Bren Hall)
Agenda for MURI Kickoff Meeting at UC Irvine AFTERNOON SESSION: 15-minute brief Research Presentations: 1:30 - Studying networks through an algorithmic lens Michael Goodrich (UCI) 1:45 - Fast algorithms for computing network statistics: David Eppstein (UCI)2:00 - Data structures for dynamic and kinetic multidimensional point sets: Dave Mount (U Maryland)2:15 - Modeling dynamic and relational event data: Carter Butts2:30 - Statistical modeling of large text collectons: Padhraic Smyth (UCI)2:45 - Modeling partially observed network data: Mark Handcock (U Washington)
Agenda for MURI Kickoff Meeting at UC Irvine 3:00 Break and Informal Discussion3:30 Brief talks on Software and Data Sets: - R software for network analysis: Dave Hunter (Penn State) - Experimental results on real-world networks PhD students from Carter Butt's group (Sociology Dept, UCI) - Large network data sets for experimentation: Chris DuBois (PhD student, Statistics Department, UCI)
Agenda for MURI Kickoff Meeting at UC Irvine 4:15 Open Discussion on Research Plans - relation of MURI research topics to military applications - further opportunities for collaboration within the team - year 1 research goals 5:00 Organizational Issues and Wrap-up - future meetings (frequency, location) - encouraging interaction between team members (conference calls, weekly research meetings, etc) - use of collaborative Web pages - action items5:30 Adjourn
Logistics • Meals • Lunch at University Club - for PIs and non-UCI folks • Coffee breaks • Wireless • Should be able to get 24-hour guest access • Slides • Will be available online by the end of today
Outline • Introductions • Review today’s agenda • Schedule of talks • Logistics • Overview of our MURI research project • Themes and goals • Tasks
MURI Project: Scalable Methods for Analysis of Network-Based Data • 4 universities collaborating, 7 PIs • Support for (approx) 8 graduate students and 3 postdocs or research associates • 3-year project with possible extension to 5 years • Time Period • Funding arrived at UCI in September 2008 • At other universities in Sept/Oct 2008 • Official project start/end • June 1 2008 to May 30 2011/2013
A Small Social Network Eppstein Butts Butts Hunter Mount Smyth Handcock Goodrich
A Small Social Network Eppstein Butts Hunter Mount Smyth Handcock Goodrich Statistics
A Small Social Network Eppstein Butts Hunter Mount Smyth Handcock Goodrich Statistics Algorithms & Data Structures
Figure from Carter Butts
Statistical Modeling of Network Data • Statistics = principled approach for inference from data • Basis for optimal prediction • querying = computation of conditional probabilities/expectation • Principles for handling noisy measurements • e.g., noisy edge observation process • Integration of different sources of information • e.g., combining edge information with node covariates • Quantification of uncertainty • e.g., which model is a better explanation of the data
Limitations of Existing Methods • Network data over time • Relatively few statistical models for dynamic network data • Heterogeneous data • e.g., few techniques for incorporating text, spatial information, etc, into network models • Computational tractability • Many network modeling algorithms scale exponentially in the number of nodes N
Example • G = {V, E} V = set of N nodes E = set of directed binary edges • Exponential random graph model (ERGM) P(G | q) = f( G ; q ) / normalization constant The normalization constant = sum over all possible graphs How many graphs? 2 N(N-1) e.g., N = 20, we have 2380 ~ 1038 graphs to sum over
Key Themes of our MURI Project • Foundational research on new statistical estimation techniques for network data • e.g., principles of modeling with missing data • New algorithms for heterogeneous network data • Incorporating time, space, text, other covariates • Faster algorithms • E.g., approximation methods for very large data sets • Software • Make network inference software publicly-available (in R)
Key Themes of our MURI Project Fast Algorithms Statistical Methods Richer models Large Heterogeneous Data Sets Software
Tasks A: Fast network estimation algorithms Eppstein, Butts B: Spatial representations and network data Goodrich, Eppstein, Mount C: Advanced network estimation techniques Handcock, Hunter D: Scalable methods for relational events Butts E: Network models with text data Smyth F: Software for network inference and prediction Hunter
Task A: Fast Network Estimation Algorithms Investigators: Eppstein, Butts • Problem: • Statistical inference algorithms can be slow because of repeated computation of various statistics on graphs • Goal • Leverage ideas from computational graph algorithms to enable much faster computation – also enabling computation of more complex and realistic statistics • Projects • Dynamic graph methods for change-score computation • Rapid subgraph automorphism detection for feature counting • Dynamic connectivity
Task B: Spatial Representations and Network Data Investigators: Goodrich, Eppstein, Mount • Problem: • Spatial representations of network data can be quite useful (both latent embeddings and actual spatial information) but current statistical modeling algorithms scale poorly • Goal • Build on recent efficient geometric data indexing techniques in computer science to develop much faster and efficient algorithms • Projects • Improved algorithms for latent-space embeddings • Fast implementations for high-dimensional latent space models • Techniques for integrating actual and latent space geometry
Task C: Advanced Estimation Techniques Investigators: Handcock, Hunter • Problem: • Current statistical network inference models often make unrealistic assumptions, e.g., • Assume complete (non-missing) data • Assume that exact computation is possible • Goal • Develop new theories and techniques that relax these assumptions, i.e., methods for handing missing data and techniques for approximate inference • Projects • Inference with partially observed network data • Approximation methods • Approximate likelihood techniques • Approximate MCMC algorithms • Will leverage new techniques developed in Tasks A and B
Task D: Scalable Temporal Models Investigator: Butts • Problem: • Few statistical methods for modeling temporal sequences of events among a network of actors • Goal • Develop new statistical relational event models to handle an evolving set of events over time in a network context • Projects • Specification of relational event statistics • Rapid likelihood computation for relational event models • Predictive event system queries • Interventions, forecasting, and “network steering” • Can build on ideas from Tasks A, B, C
Task E: Network Models and Text Data Investigator: Smyth • Problem: • Lack of statistical techniques that can combine network and text data within a single framework (e.g., email communication) • Goal • Leverage recent advances in both statistical text mining and statistical network modeling to create new combined models • Projects • Latent variable models for text and network data • Text as exogenous data for statistical network models • Modeling of text and network data over time • Fast algorithms for statistical modeling of text/networks • Can build on ideas from Tasks A, B, C and D
Network of email communication patterns in a corporate research lab
Task F: Software for Network Inference and Prediction Investigator: Hunter • Goal • Disseminate algorithms and software to research and practitioner communities • How? • By incorporating our new algorithms into the R statistical package • R = open source language for stat computing/graphics • MURI team has significant prior experience with developing statistical network modeling packages in R • network (Butts et al, 2007) • latentnet (Handcock et al, 2004) • ergm(Handcock et al, 2003) • sna (Butts, 2000) • Will integrate algorithms and techniques from earlier tasks
Data Sets • Traditional social network data sets • “Next generation” data sets • Dynamic network data • E.g., WTC communications • Network data with text • E.g., political blogs, Enron emails • Often much larger and richer than traditional data sets • See afternoon talks by PhD students Lorien Jasny and Chris DuBois
Evaluation Methods • Traditional statistical metrics • Log-likelihood on training data • Model comparisons using penalized and marginal likelihood • Predictive metrics • E.g., for dynamic networks, prediction of edge and node properties “out of sample”, and assessment of the accuracy of these predictions • Classification accuracy, precision-recall (ROC), etc • Computational metrics • Worst and average-case analysis • Empirical evaluations of computation time • Trade-offs of statistical/predictive accuracy with computation time
Summary • Statistical modeling is a key approach for quantitative analysis and prediction using network data • Existing statistical network modeling techniques are potentially very powerful • But are currently computationally limited to small networks • Leverage ideas from computer science to extend the “reach” of statistical network modeling to larger networks • Benefits • Computationally tractable modeling of much larger networks • More sophisticated representations for network models • New applications of statistical network modeling
Agenda for MURI Kickoff Meeting at UC Irvine November 18th 2008Location: UC Irvine, Bren Hall, Room 4011 MORNING SESSION8:30 Arrive, continental breakfast available9:00 Introductions and overview of MURI proposal Padhraic Smyth (UCI)9:30 Research and application challenges from ONR's perspective Martin Kruger (ONR, MURI Program Manager)10:00 Brief discussion/Q&A session between ONR representatives and PIs