220 likes | 332 Views
Working with Statisticians. At some point, a statistician is likely to be asked to analyze your data. This can lead to much unhappiness. Statisticians come in many shapes and sizes. But. Data formats. Ideally, use a normalized database with validated data entry as part of LIMS…
E N D
Working with Statisticians At some point, a statistician is likely to be asked to analyze your data. This can lead to much unhappiness.
Data formats • Ideally, use a normalized database with validated data entry as part of LIMS… • But 99% of the time => Excel spreadsheet • Some statisticians prefer to work with raw data (i.e. FCS files) but not common • Scott will cover consistent annotation for raw data at another lecture
Basic principle #1 • Statisticians do not like Excel • The first thing they will try to do is export to a CSV or delimited file, for import into SAS or R • If this is difficult to do, they will not like you
Excel rules for happy statisticians • 1 worksheet = 1 table • 1 cell = 1 value • Data? • Metadata? • Formatting? • Validation?
1 worksheet = 1 table • A table has column headers and a number of rows and nothing else – it is RECTANGULAR • Do not put more than 1 table in a worksheet • Do not use non-rectangular tables • Example of good worksheet
1 cell = 1 value • Easy to filter by tube, sample or subject • Easy to write validation rules or lookup table
1 cell = 1 value • ID column has 3 different values • Need to do text parsing to recover information – very error prone
Data: column names • Consistent column names across worksheets • Singlets/Lymphocytes • Singlet/Lymphs • Singlets / Lymphocytes • Singlets/Lymphoctyes • Use full gating path for column name • Singlets/Lymphocytes/Viable/CD4+/CM/IFN+
Data: What to record • Better to have more data than less data • Sample type (PBMC, whole blood) • Recovery • Viability • Better to have basic than derived data • Counts better than relative frequencies • Keep link to raw data for reproducibility • Path to FCS file on server • Use special indicator for missing data (e.g. NAN), not zero • Can have extra column for notes • Ideally codified so Error 23 rather than “Sample sat > 8 hours before processing”
Data: Versioning • Do not change the data in the worksheet once it has been handed to statistician. • If there are errors that must be corrected, make a new copy, label the filename with date and version, and send that to statistician • ArcticRatExperiment_07May2013_Version01.xlsx • ArcticRatExperiment_17May2013_Version02.xlsx
Metadata • Should have SOP document for metadata • How missing data is represented (e.g. NA or blank) • Keys for interpretation – e.g. Table of error codes • Contact person: phone #, email • Metadata can be in 2nd worksheet or separate document • Gating scheme with labeled gates matching cell subsets used in column names (PDF or PPT) • Panel information • Antibodies, clones, batches, fluorochromes, peptide mixes • Path to Flowjo .jo or .xml analysis file
Metadata • There are minimal information standards that should be followed • MiFlowcyte • MIATA • Google for them if you’re not familiar with them – increasingly these are required by journals for publication, so worth making it an SOP for documentation of results
Formatting • Don’t do it. • Avoid putting information via: • Highlighting • Fancy spacing • Different fonts and font effects • Merging cells • Comments • Will it survive a round-trip from Excel to CSV and back again?
Formatting - After Comments are lost Highlighting is lost Bad cell formatting is lost Merged cells become missing information
Validation • Can set up validation rules in Excel to minimize data entry errors: • Number range (0, 10000000) • Can use lookup tables for codes to use • E.g. Error codes with explanation • If possible, once format for data is decided, get local Excel wizard to create template and lookup rules
Questions? • If no questions and need to kill time, watch Biologist talks to Statistician video • http://www.youtube.com/watch?v=Hz1fyhVOjr4