450 likes | 646 Views
What's on My Dashboard Today ?: The SEER Cancer Database. Dr. Brand Niemann Director and Senior Enterprise Architect – Data Scientist Semantic Community http://semanticommunity.info/ AOL Government Blogger http://gov.aol.com/bloggers/brand-niemann/ December 21, 2012
E N D
What's on My Dashboard Today?:The SEER Cancer Database Dr. Brand Niemann Director and Senior Enterprise Architect – Data Scientist Semantic Community http://semanticommunity.info/ AOL Government Blogger http://gov.aol.com/bloggers/brand-niemann/ December 21, 2012 http://semanticommunity.info/AOL_Government/Government_Information_and_Analytics_Summit/SEER
SEER*Stat Web Page 1 http://seer.cancer.gov/
Process • ASCII text version of the data (MY NOTE: I did this): • [Windows Executable - 244 MB] [ZIP - 244 MB]Download this file if you would like to use your own programs to analyze the data in text format. You do not need to download these files if you are using SEER*Stat. SEER does not provide programming support for the analysis of these data. To use this configuration, download the file in the preferred format and uncompress it. File descriptions and documentation are included and are available online in Documentation for the ASCII Text Data Files. Storing the downloaded file and its uncompressed contents will consume approximately 2 gigabytes of disk capacity. • Binary version of the data and the SEER*Stat software (MY NOTE: I did this): • Windows Executable- 914 MB]Download this file if you would like to use your computer system for data storage and processing. These binary data files can only be analyzed using the SEER*Stat software. To use this configuration, download the file, uncompress it, and install SEER*Stat on your local system. Storing the downloaded file and its uncompressed contents will consume approximately 2.1 gigabytes of disk capacity. • SEER Research Data and SEER*Stat on DVD: • The SEER research data and the SEER*Stat software are available on a DVD. This method is preferable if you are using SEER*Stat and do not have high speed Internet access, and have a DVD drive on your computer. The same files are available for download without the delay caused by shipping. However, downloading requires a high speed Internet connection and significant disk space. The DVD includes the SEER*Stat software, the binary format of the data, and the ASCII data files that can be analyzed with your own software. The DVD contains the binary data in uncompressed format and the ACSII data in compressed format. You will not need to be connected to the Internet when using the data, and can access the binary data and run the software directly from the DVD. http://seer.cancer.gov/data/options.html
SEER*Stat • 14,400 files in three folders: • Bin • Install • Data • Readme • All SEER*Stat files: • Frequency Files (*.sf; *.sfm) • Rate Files (*.si; *.sim) • Survival Files (*.ss; *.ssm) • Prevalence Files (*.sp; *.spm) • MP-SIR Files (*.sm; *.smm) • Case Listing Files (*.sl; *.slm)
SEER*Stat 1 • The SEER*Stat statistical software provides a mechanism for the analysis of SEER and other cancer databases. It can be used to view individual cancer records and to produce frequency, rate, and survival statistics. These statistics are useful in studying the impact of cancer on a population. • SEER*Stat was produced by Information Management Services, Inc., in consultation with the Surveillance Research Program of the Division of Cancer Control and Population Sciences, National Cancer Institute. The SEER*Stat Web site is located at: http://seer.cancer.gov/seerstat • Read about: Statistics Calculated by SEER*StatView the statistics that SEER*Stat calculates, and the sessions used to calculate them. • Features: An overview of SEER*Stat’s selection, reporting, and exporting capabilities. • SEER*Stat Basics: Basic information on how to use SEER*Stat.
SEER*Stat 2 • Citation of Software and Data Source: • Citations for use of the SEER*Stat software and SEER Limited-Use data should follow these formats. • Software Citation • Replace <version number> below with the appropriate number for the version of SEER*Stat used to perform your analyses. This citation is also available on the Help menu under Suggested Citations and on the session and results matrix print-outs. • Surveillance Research Program, National Cancer Institute SEER*Stat software (http://seer.cancer.gov/seerstat) version <version number>. • Data Citation • Each database provided by SEER has a unique suggested citation, which can be found on the Data tab when the appropriate database is selected, and on the session and results matrix print-outs. The citation contains four important pieces of information: • That the data is provided by the SEER Program (and NCHS, in the case of mortality data) • The specific name of the database • The database's submission date (when it was received from the SEER registries) • The database's release date
SEER*Stat 3 • Features: • SEER*Stat provides advanced features for data manipulation and analysis, including: • handling of complex selection statements to subset the data (supports full Boolean logic and multiple primary selections) • multi-dimensional matrices for reporting results • support for copying and pasting data and statistics into spreadsheet or graphing packages MY NOTE: I need this! • ability to export records and statistics to a text data file for input by other statistical systems such as SAS • self-documenting results • unlimited overlapping of categories (groupings) within variables • multiple document interface allows working on more than one session at a time
SEER*Stat 4 • Copying Results to the Windows Clipboard: • Data from a SEER*Stat results matrix may be copied to the Windows clipboard and pasted into other programs. For some purposes, this may be more efficient than exporting the data to a file. • To copy data, follow these steps: • Click on the matrix window first to make sure it is active. • Open the Edit menu, then the Copy sub-menu. • Select Cell to copy the contents of the selected cell, Page to copy the contents of the current page, All Pages to copy the entire contents of the matrix, or Session Information to copy the parameters of the session from which the matrix was generated. • Note that Ctrl+C, Ctrl+P, Ctrl+M, and Ctrl+R are keyboard shortcuts for these functions. • In a Case Listing matrix, the first three options are replaced by Selected Cells (Ctrl+C), which copies the highlighted cell, columns, or rows. • Open the application in which you want to use the data, and paste it in. Look on the application's Edit menu for a Paste command (usually Ctrl+V is a keyboard shortcut), and refer to the application's help files if you have difficulty. • You can paste SEER*Stat data into spreadsheet applications and other software that handles tabular data. • If you copy a whole page or matrix, the column and row headers will be copied as well. Footnotes, flag characters, and titles will also be copied if they are included in the matrix. These features can be switched on or off in the matrix options.
SEER*Stat 5 • Data: • SEER Incidence Data: Cancer incidence data collected by the SEER registries are distributed with SEER*Stat. • U.S. Mortality Data: Mortality data collected by the National Center for Health Statistics can be analyzed with SEER*Stat. • U.S. Population Data: Modified U.S. Census data are provided with the SEER*Stat software. • Citation of Software and Data Source: Use of SEER*Stat and these data for publication purposes should include a citation. • Standard Population Data: Numerous standard populations are provided with SEER*Stat. • Expected Survival: Expected survival tables for calculating survival are provided with SEER*Stat. • Using Your Own Data: Use SEER*Prep to prepare your data for analysis.
SEER*Stat 6 • SEER*Stat Basics: • Session: Use SEER*Stat sessions to define the parameters of your analysis. • Dictionary: Use the dictionary to format the variables in your analysis. • Selection Statements: Use selection statements to subset records in the database and to define a grouping in a merged variable. • Results Matrix: Analysis results are displayed as a SEER*Stat matrix.
SEER*Stat 7 • Frequency Session • Frequency sessions can be used to calculate frequencies and trends in frequencies. • Rate Session • Rate sessions can be used to calculate crude rates, age-adjusted rates, and trends in rates over time. • Survival Session • Survival sessions can be used to calculate observed survival, net survival, conditional survival, and crude probability of death. • Left-Truncated Life Tables • A Left-Truncated Life Tables session allows survival to be figured from a certain age rather than from the DX Date as in a Survival session. Left-Truncated Life Tables sessions make it possible to see if people who survive cancer live longer than those who did not have cancer because they take better care of themselves and whether cancer survivors have the same causes of death as people who did not have cancer. • Limited-Duration Prevalence Session • Prevalence is a statistic of primary interest in public health because it identifies the level of burden of disease or health-related events on the population and health care system. • MP-SIR Session • MP-SIR (Multiple Primary -- Standardized Incidence Ratios) sessions can be used to compare incidence of cancer in a defined cohort of persons previously diagnosed with cancer to the incidence of cancer in the general population. • Case Listing Session • Case Listing sessions can be used to view individual cancer records. • Equations and Algorithms • Rate Algorithms • Trend Algorithms • Survival Algorithms • Limited-Duration Prevalence Algorithms • Standardized Incidence Ratio and Confidence Limits
SEER*Stat Web Page 2 http://seer.cancer.gov/seerstat/tutorials/
SEER*Stat Web Page 3 http://seer.cancer.gov/seerstat/tutorials/howto/stratify.html
SEER*Stat Sessions • Frequency • Two data sets • Rate • Three data sets • Survival • One data set • Limited-Duration Prevalence • Three data sets • MP-SIR • Two data sets • Case Listing • 16 data sets MY NOTE: Focus here because state, county & HSAs
My 5-Step Method • So what I like to do to illustrate (data science) and explain (data journalism) is the following (like a recipe): • Put the Best Content into a Knowledge Base (e.g. MindTouch*) • The SEER*Stat Web Pages • Put the Knowledge Base into a Spreadsheet (Excel*) • Linked Data to Subparts of the Knowledge Base • Put the Spreadsheet into a Dashboard (Spotfire*) • Data Integration and Interoperability Interface • Put the Dashboard into a Semantic Model (Excel*) • Data Dictionaries and Models • Put the Semantic Model into Dynamic Case Management (Be Informed*) • Structured Process for Updating Data in the Dashboard * Examples of tools used.
To Get to 5-Stars With Open Data * Examples of tools used. Source of Star and Definition: http://www.w3.org/DesignIssues/LinkedData.html
SEER Knowledge Base http://semanticommunity.info/AOL_Government/Government_Information_and_Analytics_Summit/SEER
SEER Knowledge Base in a Spreadsheet http://semanticommunity.info/@api/deki/files/20218/SEERData.xlsx
SEER-Spotfire:Cover Page https://silverspotfire.tibco.com/us/library#/users/bniemann/Public?SEER-Spotfire
SEER-Spotfire:SEER*Stat Export 1 https://silverspotfire.tibco.com/us/library#/users/bniemann/Public?SEER-Spotfire
SEER-Spotfire: SEER Variable Dictionary: November 2011 MY NOTE: Master Data Dictionary https://silverspotfire.tibco.com/us/library#/users/bniemann/Public?SEER-Spotfire
SEER-Spotfire:US_State Shape https://silverspotfire.tibco.com/us/library#/users/bniemann/Public?SEER-Spotfire
SEER-Spotfire:US_County_AK_2000_2004 Shape https://silverspotfire.tibco.com/us/library#/users/bniemann/Public?SEER-Spotfire
SEER-Spotfire:US_County_2000_2004 Shape https://silverspotfire.tibco.com/us/library#/users/bniemann/Public?SEER-Spotfire
SEER-Spotfire:HSA_NCI_Modified_AK_StateShape https://silverspotfire.tibco.com/us/library#/users/bniemann/Public?SEER-Spotfire
SEER-Spotfire:HSA_NCI_ModShape https://silverspotfire.tibco.com/us/library#/users/bniemann/Public?SEER-Spotfire
SEER-Spotfire:HSA Shape https://silverspotfire.tibco.com/us/library#/users/bniemann/Public?SEER-Spotfire
Conclusions • The SEER*Stat software was found to be cumbersome to see, export, and analyze the data. • The SEER*Stat Web Pages and data files were readily repurposed into a Knowledge Base and Spreadsheet. • My 5-Step Method to Get to 5-Stars with Open Data were used. • A Spotfire Dashboard was constructed from the SEER*Stat Knowledge Base and Spreadsheet. • The SEER*Stat spatial data is available on a HSA, county, and state level, but the temporal data is more extensive. • Spotfire provides an inventory of data tables and data elements for Master Data Management. • The six Shape files were used in Spotfire by geo-referencing HSA, county names and FIPS codes, and state names in the data.