1 / 20

Expression Profile Clustering

Expression Profile Clustering. Expression Profile = the pattern of signal values for one gene over several chips. Expression Profile Clustering = the clustering of “similar” profiles Why? Similar expression profiles suggest regulation (by shared factor or cluster member)

afries
Download Presentation

Expression Profile Clustering

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Expression Profile Clustering • Expression Profile = the pattern of signal values for one gene over several chips. • Expression Profile Clustering = the clustering of “similar” profiles • Why? • Similar expression profiles suggest • regulation (by shared factor or cluster member) • related function • ALSO: opposite expression profiles suggest • regulation (e.g. inhibition) • Software: • EpClust • J-Express • Etc….

  2. Guided Tour of EpClust

  3. Data file format • Can enter in a variety of formats. • See EpClust’s data upload page • One simple way: • Download tab-delimited data from NASC • Open with (or paste into) excel • Delete all but one “name” column and Signal value columns. • Do not include any hyphens! • (I try to avoid all punctuation) • Save as tab delimited text • Annotation can be added as either: • A second file with the same first column of names • A single second column in the signal file (if specified)

  4. To Upload Data

  5. To select uploaded data

  6. Select Specific Experiments within Input File

  7. To Remove Unreliable Data

  8. To Choose Algorithm Type K-means Initially choose a set number (K) of clusters 1) Chooses the K most different profiles 2) Clusters each remaining profile with one of the K So, much less time than Hierarchical (if many genes) Cluster size can vary so can return too many or few genes with no way to select. Then must repeat with different K values. Hierarchical Measures distance between each profile (i.e. gene) with each other profile So time increases exponentially with each gene Then clusters closest genes together, followed by increasingly distant ones Into a tree of clusters within clusters can be visualised and finally split at chosen distancewith knowledge of tree

  9. So Which is “Better”? • K-Means is better able to manage large data-sets. • Hierarchical seams a more objective aproach • In that you don’t need to decide cluster number at start But which is more biologically informative? • Opinion divided. • Both artificial. • Don’t prove, just suggest. • K-means gives more consistent results. • ….and then there’s SOTAtoo!

  10. Alternative Distance Measurements

  11. Pearson-Based Distance Measurement The most commonly used How similar the SHAPES of the two profiles are Based on average of values and the standard deviation Rates from identical (1) to completely uncorrelated(0) to perfect opposites (-1) Centred Test? If profiles have identical shape, but offset from each other by a fixed value (or magnitude) Centred: Identical (1) Uncentred: Not (<1) AbsoluteTest? If profiles have perfect opposite expression patterns Absolute: Identical (1) Non-Absolute: perfect opposites (-1) Parametric Test? Parametric: assumes normal distribution. More rigorous where there are no outliers. Non-parametric (=spearman rank): More rigorous where there are outliers.

  12. Euclidean-Based Distance Measurement between gene expression levels directly based on magnitude of changes More about signal VALUES. Less about profile shapes. data must be suitably normalized e.g. use log-ratios of signals Euclidean: shortest path between points Manhattan: the sum of distances along each dimension

  13. To Choose Algorithm Type

  14. RESULTS

  15. RESULTS _MyGene

  16. _MyGene

  17. Extra Options

  18. Phylip Format Format Results Picture Statistics Export Lists as Text Search for Promoter Motifs Extra Options Can get from TAIR using a list of gene names Can get from querying a database such as PlantCare using the upstream sequence of our particular gene of most interest Export clusters as: text (lists of gene names)tree cut at chosen height. Or formatted for Phylip(tree drawing software) Visualise promoter region next to each profile Highlight motifs You must tell it the sequences of motifs though. Must prepare file of upstream sequences Currently must arrange with EpClust staff to upload

  19. PlantCARE Cis-Acting Regulatory Elements http://intra.psb.ugent.be:8080/PlantCARE/ • Database of CAREs • Tools • Search for CARE • Enter upstream sequence of a gene • Lists and then highlights known motifs from database • Motif Sampler • Enter upstream sequences of your cluster • Highlights 8mers conserved more in cluster members than in genome • Other tools • Clustering • Query for info on motifs

  20. Gene Ontology A hierarchical structure to describe gene function. As PlantCARE compares: Expression Profile Clusters to Promoter Motif Conservation There are also tools to compare: Expression Profile Clusters to Gene Ontology

More Related