1.12k likes | 1.25k Views
Data from Far and Wide: Finding IT, Managing IT, Using IT. Professor Robert Hollebeek NSCP - University of Pennsylvania 7th International Conference on High Performance Computing, December 18, 2000 Bangalore, India. Outline. The importance of Data Intensive Computing Data and Medicine
E N D
Data from Far and Wide: Finding IT, Managing IT, Using IT Professor Robert Hollebeek NSCP - University of Pennsylvania 7th International Conference on High Performance Computing, December 18, 2000 Bangalore, India
Outline • The importance of Data Intensive Computing • Data and Medicine • Data and Maps • Data Infrastructure Conclusions R. Hollebeek
data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data
Data Intensive Computing: Particularly Interesting (hard) when • Data comes from distributed sensors • is controlled or stored in distributed databases or caches • is secure or semi-private • is large scale (terabyte to petabyte) • is made of multi-component data R. Hollebeek
Difficulty Increases with data diversity, size, speed requirements Diversity and Complexity Current Projects explore all three dimensions Govt Data Medical Data Size NSCP3-parallel hardware Speed R. Hollebeek
The Power of Data Mining Network Traffic on a 500 node LAN Destination Computer run Source Computer
Destination Node The network data shown here contains a lot of information but displayed this way, yields little insight or knowledge about the underlying activity. Source Node R. Hollebeek
NSCP BlockNess Algorithm Rearranged, sorted and clustered, we see that there are several major groups of processors with joint activities.
Data Mining Prerequisites • Finding IT: Find Interesting Data • Data Intensive Applications • Social Science, Economics, Medicine, Science • Managing IT: Data Infrastructure and Data Organization • Parallel Storage above the Terabyte Level • Using IT: Finally you get to do Mining • Data Intensive -> Semi-automated R. Hollebeek
Talk Will Highlight Examples of Data Intensive Applications from NSCP@PENN (http://nscp.upenn.edu) • NDMA: National Digital Mammography Archive • NIS-P: Neighborhood Information system for Philadelphia • Parallel Data Infrastructure : NSCP Massive Distributed Secure Diverse Web enabled Secure Ultra high speeds for massive data R. Hollebeek
Outline - Data and Medicine • The importance of Data Intensive Computing • Data and Medicine • Finding IT • Managing IT • Using IT • Data and Maps • Data Infrastructure Conclusions R. Hollebeek
X-rays mammograms MRI cat scans endoscopies ….. Finding IT • Hospitals • Very large data sources - great clinical value to digital storage and manipulation and significant cost savings • 7,000 Gigabytes per hospital per year • dominated by digital images • Why we chose Mammography • clinical need for film recall • large volume ( 4,000 GB/year ) • standards exist • great clinical value to this application R. Hollebeek
Managing IT R. Hollebeek
Major Components Hospital Portal Systems “RadAR” Large Scale Storage and Indexing Network Infrastructure R. Hollebeek
RadAR : NSCP@PENN • High capacity radiology storage developed by NSCP 1996-1999 • RadiologyActive Repository R. Hollebeek
Large Disks Parallel CPU Control (MA R) Hi-speed Interconnect RadAR Components R. Hollebeek
Large Disks RadAR MetaData MetaData R. Hollebeek
Large Disks MetaData Logs Records Dicom SR Birads RadAR Contents Not to scale Images R. Hollebeek
Large Disks Parallel CPU Control (MA R) Images MetaData Logs Records Hi-speed Interconnect RadAR + Portals Portal Systems at HUP, UNC, UC, SWH MAP/MAQ NDMA/NSCP R. Hollebeek
Map - MA system portal Hospital Network VPN Win 2000 Linux Two Dual Processor IBM/Netfinity 5100 systems R. Hollebeek
Large Disks Parallel CPU Hospital Network VPN Control (MA R) Win 2000 Linux Hi-speed Interconnect Portals + RadAR R. Hollebeek
NSCP High Capacity Archive 100 TB, million record per day pilot system developed by NSCP and demonstrated at SC98 RadAR R. Hollebeek RadAR R. Hollebeek
Control spcw sp02 NSCP – IBM/SP2 Hardware Components MAR Serial Ports High Performance Switch ATM sp01 Primary Node BackupPrimary Node Disk Pool 1 Disk Pool 2 Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data
Status Data Data Node Data Data sp03 sp03 sp03 sp03 Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Serial HPS ATM Node Node Node Node Disk Pool Disk Pool Disk Pool Disk Pool
Lab Tour R. Hollebeek
Scale of the Problem Recent FDA approval and cost and other advantages of digital devices will encourage digital radiology conversion • 2000 Hospitals x 7 TB per year x 2 • 28 PetaBytes per year • (1 Petabyte = 1 Million Gigabytes ) • Pilot Problem scale in NDMA • 4 x 7 x 2 = 56 Terabytes / year R. Hollebeek
Storage Hierarchy Hospital / Clinic 7 R @ 4,000 TB/yr 20 A @ 100 TB/yr 15 H @ 7 TB/yr Goal: Distribute Storage Load and Balance Network and Query Loads R. Hollebeek
Networks • 7 TB / yr in each hospital is ~2% of an OC3 • Typical T1 to DS-3 connects today at Clinics are almost sufficient • Study size and transmission time to remote reader is a more important constraint requiring higher speeds • 1.5 Minutes at DS-3 • 2 sec at OC48 R. Hollebeek
NDMA • NSCP@Penn: • Digital Storage, Search and Retrieval • Oak Ridge National Lab: • Network (VPN) and Security • Hospitals of • University of Pennsylvania • University of Chicago • University of North Carolina • University of Toronto
Large scale radiology testbed Regional and Area Archives (A) R. Hollebeek
Layout matches growth pattern of national networks R. Hollebeek
Portal Systems in the test lab at NSCP/PENN R. Hollebeek
First Hospital portal systems being installed at the Hospital of the University of Pennsylvania
Construction of the remaining Portal systems R. Hollebeek
1200 Gigabyte fast disk under test in a joint program with Lucent and CyberStorage Systems.
Using IT • Store Records for retrieval • typical request would retrieve 3-4 yrs • Audit and log transmissions • Parse, Index and Store incoming information • Support Computer Assisted Diagnostics • Support Radiologist Training and Evaluation R. Hollebeek
Training, Teaching, Evaluation R. Hollebeek
Network and Data Security • Virtual Private Network • used to assure system security • User Authentication • password + token or biometric • Roles • Doctor, Administrator, Assistant, ... • Client Authorization • required for Medical Records
NDMA Data Mining Challenges • Fuzzy matching for records • feature matching in images • clustering - outcomes, other variables • outlier search in many dimensions • computer assisted diagnosis R. Hollebeek
NDMA - http://nscp.upenn.edu/ndma
NSCP with Children’s Hospital • To provide fast parallel • processing over high speed nets • so that functional MRI can be • used in real time clinically • On the right: an individual • noisy frame of a human brain R. Hollebeek
Functional MRI • J. Yu graduate student Degree in 2000 Now on Wall Street R. Hollebeek