630 likes | 738 Views
Visualization and data repositories. Using genome browsers. Motivation. Aside from R, genome browsers are arguably the most important tool in computational genomics …but is not widely used in the experimental community
E N D
Visualization and data repositories Using genome browsers
Motivation Aside from R, genome browsers are arguably the most important tool in computational genomics …but is not widely used in the experimental community The browser gives you an immediate edge - you can look at data, form hypothesis and up-and download data!
In this course 1: How to use the web interface; understanding the data types 2: How to download and upload data to the browser; interaction with R 3: How to make complex analyses between data types; Galaxy and R
Today's teaching: • Lectures with genome browser examples • Short discussions with your neighbour • Exercises
Kick starting with a challenge • You are a major sequencing center • You have sequenced the killer whale (Orca) genome - you have the whole genome as a stretch of ACGTs • How do you make sense of this and show it to others? What value does the data have in itself? • 2 minutes with your neighbour
Jim Kent, assembly-guru. Some profound words about the genome sequence “Well, it has a lot of G, C, A and Ts”
Genomes are worthless • …without any annotation • What type of annotations do we want to put on genomes? • 2 minutes with your neighbour
Examples: • 'DNA' annotation: • Known genes • Predicted genes • Repeats, transposons, CpG islands • Conservation across species • 'Dynamic' annotation: • Known transcripts • Expression data • DNA modifications
How to present this data? • Plain text files are useless..for most biologists • Use the genome sequence as a frame, on which we map real data or predictions
The idea of the browser • Based on the genome, we can • Zoom up and down, and scroll sideways • See the data in different representations • Select WHAT data we want to see (way to much data to look at all at once) • Important side-effect: if we map all interesting data, it means that all data is at one place, which means that we can download what we are interested in to do analysis!
The three browsers • UCSC genome browser • http://genome.ucsc.edu • Updated often, simple but powerful interface. Very simple underlying data formats • ensEMBL • http://www.ensembl.org • More complex web interface, with multiple zoom levels. Very complex underlying data formats • The generic genome browser • http://www.gmod.org/GBrowse • Actually more a software development platform, so that you can do your own. Resembles UCSC more than ensEMBL
In this course… • We will only use the UCSC browser due to • Simplicity • Lecturer bias • The galaxy tool - a very nifty web-tool to do power user analysis on UCSC data (more later) • If you know this browser, other browsers are easy to understand
Basic concepts • Zooming • Data tracks
Data tracks -the problem Example: The road from Melby to Ølby Melby Melby Melby Ølby Ølby Ølby 5 km 5 km 5 km 5 km 5 km 5 km 10 km 10 km 10 km Melby Melby Melby Melby Melby Melby Melby Melby Melby Ølby Ølby Ølby Ølby Ølby Ølby Ølby Ølby Ølby 5 km 5 km 5 km 5 km 5 km 5 km 5 km 5 km 5 km 10 km 10 km 10 km 10 km 10 km 10 km 10 km 10 km 10 km 5 km 5 km 5 km 5 km 5 km 5 km 5 km 5 km 5 km
Data tracks -the problem Example: The road from Melby to Ølby Melby Melby Melby Ølby Ølby Ølby 5 km 5 km 5 km 5 km 5 km 5 km 10 km 10 km 10 km Melby Melby Melby Melby Melby Melby Melby Melby Melby Ølby Ølby Ølby Ølby Ølby Ølby Ølby Ølby Ølby 5 km 5 km 5 km 5 km 5 km 5 km 5 km 5 km 5 km 10 km 10 km 10 km 10 km 10 km 10 km 10 km 10 km 10 km 5 km 5 km 5 km 5 km 5 km 5 km 5 km 5 km 5 km Melby Melby Melby Ølby Ølby Ølby 5 km 5 km 5 km 5 km 5 km 5 km 10 km 10 km 10 km
Data tracks -the solution Melby Melby Melby Ølby Ølby Ølby 5 km 5 km 5 km 5 km 5 km 5 km 10 km 10 km 10 km houses trees 5 km 5 km 5 km 5 km 5 km 5 km Monday Sunday
This is how genome browsers show the data Chromosome position Gene track mRNA track Exons Introns
Annotation tracks • A track is often one source of data, from a particular place, that is mapped to the genome • Data can be viewed as “blocks” with a start and an end, expressed as chromosome coordinates • It is important to know what the data is before trying to interpret it • We will first look at the “human mRNA” track
Human mRNA track • What the guys at UCSC did: • Take all the known mRNAs in Genbank, and map these to the human genome using a software called BLAT (similar to blast). Everything that hits will be shown in this track. • What is the pros and cons of this approach? What are the limitations? 2 minutes with your neighbour!
Example answers: Pros Simple, and no filtering - leaving me to make interpretation Cons Not real annotation - again, leaving me to make interpretation Heavily reliant on the data source quality Limited by the extent of data
A short non-interactive tour • We will use the browser extensively from now on • But first, I will guide through a few key concepts - otherwise confusion ensues when trying the real thing
What version of the genome do you have? • Genome sequences are based on many short sequenced reads, which then are assembled into a single sequence • This is very tricky, and we get slightly updated genomes at regular intervals • A version of the genome is called an assembly • So, whenever you say that you are using a genome sequence to do something, you have to say what assembly you are working on!
More about assemblies • The official naming system is • [species abbreviation][assembly number] For instance hg17 (human nr 17), or mm8 (mus musculus 8) There is an alternative way: the date of the release. So, hg17 is also called “Human May 2004”
Even more about assemblies Rules of thumb: The newer an assembly, the “better” Some older assemblies have more data mapped to them (because they have been around longer) Some genomes are new, and unstable: updates come often, and big differences between updates. Some are more mature (like human)
Selecting species & assembly Where on the genome Species Assembly: the genome “version”.
Looking at the genome, with mRNAs Chromosome overview Direction of arrows shows strand Different mRNAs (same gene)
Zooming in (We'll learn how later)
Some points: • Transcription, in this case, is right to left - transcription on the minus strand - shown by the arrows • Two of the mRNAs start here, the others start even further upstream. Probably alternative promoters • The fat, two-colored blocks are predicted to be protein-coding parts • Note that • There are parts of mRNAs that are not translated - so called UTRs • There is one mRNA that is clearly non-coding (might have a stop-coding further upstream)
Zooming even further down - we see the actual DNA Codons Clicking on any of these mRNAs take you to the corresponding Genbank entry
Different data representations Each data track has a selection 'box' Use this to : -turn tracks on or off -change visualization
Full examples Squished Dense
Time to try it out.. • Important: the genome browser shows many tracks by default, some which are named in a confusing way • Don’t let this throw you. We will walk them through! • Goto http://genome.ucsc.edu/ • Click 'Genome browser' to the left
We'll use default position for now, so just click the 'Submit' button (which is on the right)
Overwhelmed? Many types of data! We will only use some, others you can explore yourselves Below the image, the data tracks are categorized for easier access:
Let’s look only at the Human mRNA track as before Challenge: Turn off all tracks, except “base position” and “human mRNA”! (Expand/collapse the categories, then hide tracks. Use 'refresh' to update the image.)
Challenge Using the following buttons, and what we already went through, find out: What is the DNA sequence of the first two codons of mRNA DQ892408? What is the “gene name” of the mRNAs we are looking at? Are the two longest RNAs starting at exactly the same place? What are the neighboring genes?
Before we go any further… What are all these data? What can we use them for?
Fast info on a given track: • Click on the actual track name (over the box) • What does the “refseq genes” track hold? • What is the difference to “other refseq” or “Genscan genes” • When would you use each track?
It is not realistic to go through all tracks in this course • …and not meaningful, because new tracks are added over time • We will go over the main types of tracks, and the relevant experimental methods for producing the tracks • Understanding what we are looking is very necessary for meaningful interpretation
Big groups of things, summarized • Sequence features • CpG islands • Repeats • Transcripts or part of transcripts • mRNA, ESTs • The so-called genes (predicted or experimental) • Tiling array expression data • Chip-Chip • Variation within species (SNPs) • Conservation and alignments between species • net alignments, Phastcons scores, • The ENCODE dataset
Between transcription and translation – the modern RNA world • After transcription, RNAs are immature (precursor mRNAs). Processing RNAs give mature mRNAs, which gives access to the cytoplasm, and translation. As usual, we know only a small part of the mechanisms... • 5' CAP structure is added • 3' polyA stretch is added • Splicing (not always!) • RNA editing (rare?)
Problem: We want to know what mRNA look like... but RNA is unstable, can't be sequenced directly Solution: Turn them into cDNA first.
Into a plasmid – so, we have a library of plasmids each carrying one cDNA This is a “cDNA library” that later can be sequenced or used for other things
General problems with cDNA sequencing: • Reverse transcriptase falls off • Hard to sequence long transcripts • Many cDNAs are identical • Very expensive if you want to sequence all unique molecules
Solving the problem Only sequence parts of cDNAs - these are called ESTs(more in a few slides) Semi-recent development: sequencing of full-length cDNAs, using • Cap-trapping • PolyA primers • subtraction
Subtraction: how to only get RNAs you have not seen yet • Simple concept: • For a cDNA sample, we add an excess of abundant RNAs. These will hybridize • Then, we remove everything which hybridized • …and sequence the rest
Discuss with your neighbour (2 min) Say that we have two cDNA libraries - one is subtracted, one is not What are they good for? Expression (how many transcripts of a certain gene)? Annotation and gene discovery?
Visualizing and annotating cDNAs in the genome browser • The genome is actually needed to make sense of cDNAs, especially if it is not protein-coding • A general approach is to map your cDNA to the genome using an alignment algorithms • Here, we will use BLAT and the UCSC browser • Should be straight-forward, but...lets try it out: See the course page for 3 mouse sequences in the blat_seqs file – I will do one in real-time • Assume these are new sequences that you must say whether they are good enough to be part of the genome browser