1 / 79

SMD Data Quality Assessment and Repository Tools Tutorial

SMD Data Quality Assessment and Repository Tools Tutorial. November 10, 2007 Catherine Ball Janos Demeter. SMD: Getting Help. Click on the “Help” menu Tool-specific links will be listed at the top. Use the SMD help index to look for specific subjects Send e-mail to:

Download Presentation

SMD Data Quality Assessment and Repository Tools Tutorial

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. SMD Data Quality Assessment and Repository Tools Tutorial November 10, 2007 Catherine Ball Janos Demeter

  2. SMD: Getting Help • Click on the “Help” menu • Tool-specific links will be listed at the top. • Use the SMD help index to look for specific subjects • Send e-mail to: array@genome.stanford.edu

  3. Quality Assessment and Repository Tools Tutorial • Quality Assessment Tools • Ratios on Array • HEEBO/MEEBO plots • Graphing tool • Q-score • Repository • Repository • SVD • Synthetic Gene Tool • kNNimpute

  4. SMD Data Repository Help • How to use the tool • Limitations of file sizes • Sharing data • Options • Links to help for analysis methods, data file formats, data retrieval and clustering

  5. SMD Help: File Formats

  6. UID is the Unique Identifier for the Spot/Reporter GWEIGHT indicates the weight the Spot/Reporter is given in clustering Values are for each spot/reporter on each array (usually log ratios) NAME sequence label for the Spot/Reporter EWEIGHT indicates the weight the Array/Experiment is given in clustering File Formats: Pre-clustering (PCL) File Names and orders of arrays (if arrays are not clustered)

  7. File Formats: Clustering Design Tree (CDT) File

  8. SMD Data Repository • What is the SMD Data Repository? • What is the repository? • Using the repository to save or upload data • Using the repository to share data • Using the repository to analyze data • Options for PCL files via the repository • View • Data • Delete • Edit • Cluster • Filter • SVD • Synthetic Genes • KNN Impute • Options for CDT files via the repository • GeneXplorer • TreeView • View Clusters, spots

  9. What is the SMD Repository? • A method to save data sets to prevent repeatedly performing the same data retrieval • A method to share processed data with others • A way SMD can provide you with access to new and/or computationally-intensive tools

  10. Accessing the SMD Data Repository Here!

  11. SMD Data Repository

  12. Uploading files to Repository • If uploading clustered data, enter “CDT” files • If uploading pre-clustering data, enter “PCL” files • Choose an organism • Give a unique name to your data set • Provide a useful description to your data set

  13. Using Your Repository: CDT Deposits • View cluster using GeneXplorer or TreeView • View cluster images • View retrieval and clustering report • Download files • Assign access

  14. Cluster data View information about your repository entry Apply SVD to data Delete the repository entry Using Your Repository: PCL Deposits Apply “Synthetic Genes” to data Edit the entry Filter data Estimate missing data with KNN impute Download data

  15. Using the Repository: CDT File Options CDT files have a few other options GeneXplorer Clustering with Proxy and Spot images TreeView Clustering with Spotimages Clustering with Proxy images

  16. Viewing Repository Entries • Name • Organism • Number of genes • Number of arrays • Size of file • Date uploaded • Description • Data retrieval summary

  17. Downloading Repository Entries Downloading puts file(s) into a folder labeled with your SMD user name onto your computer’s desktop

  18. Deleting Repository Entries • Details about your repository entry • Asks you to confirm before deleting!

  19. Editing Entries -- How to Share! • Change repository entry name • Change description • Add access to repository entry to a GROUP • Add access to a repository entry to a SMD USER

  20. Filtering Data in Repository Entries • If your repository entry is a PCL file, you can re-enter the SMD filtering pipeline

  21. SVD: Singular Value Decomposition • The goal of SVD is to find a set of patterns that describe the greatest amount of variance in a dataset • SVD determines unique orthogonal (or uncorrelated) gene and corresponding array expression patterns (i.e. "eigengenes" and "eigenarrays," respectively) in the data • Patterns might be correlated with biological processes OR might be correlated with technical artifacts

  22. SVD: The Concept (easy version) • Let’s imagine we have a three-dimensional cigar, as shown in A • We can represent this in one dimension, by looking at its lengthwise shadow (B) • Looking at its cross-wise shadow (C), we get an orthogonal view of the cigar that tells us more about the three-dimensional object than B alone.

  23. SVD: Missing Data Estimation • Some algorithms (such as SVD) cannot operate with missing data • You can use this simple method or you can use KNNImpute to estimate missing data

  24. SVD Display in SMD

  25. SVD: Raster Display • Each row represents an “eigengene” -- an orthogonal representation of the genes in the dataset • The topmost eigengene contributes the most to the data set

  26. SVD: View Projection • Clicking on a row in the Raster Display brings you the Projection View • You can select genes that have high and low contributions from an eigengene and download them in a PCL file • In this way, you might use SVD to help classify subtypes

  27. SVD: Eigenexpression • Each bar show the probability of expression of each eigengene • You can compare the probabilities to see which eigengenes contribute more to the overall “view” of the data

  28. SVD: Plot selected eigengenes • You can plot as many or as few eigengenes as you like • This plot gives you an easy-to-understand view of the behavior of each eigengene

  29. Synthetic Genes • Purpose: average data based on arbitrary groupings of genes - for biological reasons - for technical reasons • Can average data using: - common genelists - your own genelists • After averaging: - a new row for the synthetic gene data - Original data can be removed/included

  30. Synthetic Genes • Common lists available (only mouse and human data): • Unigene (all clones/oligos that report on a given Unigene id will be averaged and shown as the Unigene id) • LocusLink (same as above, but for LocusLink id) These lists are useful to collapse data by gene, rather than suid/luid. They allow comparison of experiments between different platforms - oligo print to cDNA print or spotted arrays to Agilent arrays where the arrays don’t share common suids. Also can be used to compare cDNA prints with h/meebo arrays These synthetic gene lists are updated on a regular basis.

  31. Synthetic Genes • Other common synthetic gene lists: • chromosome arms • cytobands • 5 Mb tiles based on GoldenPath mappings • Tissue types • tumor types • processes • Additional lists see: http://smd.stanford.edu/help/synthGenes.shtml

  32. Synthetic Genes • You can use your own genelists: • 1 genelist for each synthetic gene • Name of the genelist is the synthetic gene’s name • - tab-delimited text file • File must have header (NAME, WEIGHT) • NAME contains cloneid • WEIGHT can be -1 to 1 (weight of clone • during averaging) • - Can have comment lines (start with #)

  33. Synthetic Genes • Tool only works on pcl files in repository • During data retrieval the ‘include UIDs’ option should not be used • After collapsing, file can be downloaded, added to your repository, and/or clustered • Currently works only for human and mouse data

  34. Synthetic Genes/Merge PCL Files • Related tool: Merge PCL Files • On main page (lists menu -> all programs) under tools section • Can be used to combine 2 pcl files from different sources into a single pcl file. • Cloneids that belong to the same gene can be combined into single row (based on a translation file provided).

  35. Synthetic Genes/Merge PCL Files

  36. Synthetic Genes/Merge PCL Files • Same experiments in the pcl files can be averaged • Averaging method can be mean/median • Translation file: • Tab-delimited text file • First column: desired final identifier • Second column: desired final annotation • Third and subsequent columns: identifiers (first column of a pcl file) in the pcl files that should be collapsed to the identifier in the first column. • Data for identifiers not included in the translation file will not be collapsed

  37. KNNImpute: The Missing Values Problem • Microarrays can have systematic or random missing values • Some algorithms aren’t robust to missing values • Large literature on parameter estimation exists • What’s best to do for microarrays?

  38. Complete data set Data set with 30% entries missing (missing values appear black) Data set with missing values estimated by KNNimpute algorithm Why Estimate Missing Values?

  39. j j 2 | | 5 | 7 | 3 | 1 2 |4.3| 5 | 7 | 3 | 1 Gene X Gene X 2 | 4 | 5 | 7 | 3 | 2 2 | 4 | 5 | 7 | 3 | 2 Gene B Gene B 3 | 5 | 6 | 7 | 3 | 2 3 | 5 | 6 | 7 | 3 | 2 Gene C Gene C KNNimpute Algorithm • Idea: use genes with similar expression profiles to estimate missing values

  40. Clustering: Cluster Image • Scale is indicated on the color bar • Gene names are at the right • Tree generated by hierarchical clustering is at the left

  41. Clustering Display: Clustered Spot Images • Spot images can also be viewed in a clustered image • This can give you a visual impression of the data that are the basis of your analysis

  42. Clustering Display: Adjacent Cluster and Clustered Spot Images

  43. GENEXPLORER

  44. TREEVIEW

  45. SMD: Getting Help • Click on the “Help” menu • Tool-specific links will be listed at the top. • Use the SMD help index to look for specific subjects • Send e-mail to: array@genome.stanford.edu

  46. Quality Assessment and Repository Tutorial • Quality assessment tools • Ratios on Array • H/Meebo plots • Graphing tool • Q-score • Repository • Repository • SVD • Synthetic Gene Tool • kNNimpute

  47. Ratios on Array Tool • Accessible from the display data -> view data pages • Ratios on array

  48. Ratios on Array Tool • Quick visualization of log-ratio distribution on the slide • Color assignments are based on log-ratio values and also intensity • Can visualize normalized or non-normalized log-ratios • PLUS: ANOVA analysis to detect spatial bias (print-tip or plate)

  49. Ratios on Array Tool • Not normalized vs. normalized (loess intensity, print-tip)

  50. Ratios on Array Tool • One way ANOVA to test dependence of log-ratios on print-tip and printing plate • F-statistic is given for the hypothesis: no bias in data • In the example, normalization significantly improved print-tip bias

More Related