80 likes | 220 Views
Working with Mapped Reads in R and BioConductor. Advanced Genomic Data Analysis BIOS 691- 804, 2012 Mark Reimers. Mapped Read File Formats. Common standard is Sequence Alignment/Map (SAM) Accommodates most kinds of sequence information commonly used Efficient for stream processing
E N D
Working with Mapped Reads in R and BioConductor Advanced Genomic Data Analysis BIOS 691-804, 2012 Mark Reimers
Mapped Read File Formats • Common standard is Sequence Alignment/Map (SAM) • Accommodates most kinds of sequence information commonly used • Efficient for stream processing • SAM files may be indexed by genomic position to efficiently retrieve all reads aligning to a locus • Compressed version Binary A.M.(BAM) • Other formats used (e.g. SOAP) have mostly similar fields • Variant Call Format (VCF) used by 1000 Genoms for SNPs and structural variants
SAM Example - File Format @HD VN:1.3 SO:coordinate @SQ SN:ref LN:45 r001 163 ref 7 30 8M2I4M1D3M = 37 39 TTAGATAAAGGATACTG * r002 0 ref 9 30 3S6M1P1I4M * 0 0 AAAAGATAAGGATA * r003 0 ref 9 30 5H6M * 0 0 AGCTAA * NM:i:1 r004 0 ref 16 30 6M14N5M * 0 0 ATAGCTTCAGC * r003 16 ref 29 30 6H5M * 0 0 TAGGC * NM:i:0 r001 83 ref 37 30 9M = 7 -39 CAGCGCCAT * • r003 – read identifier • ref – reference sequence (e.g. chr18) • 8M2I4M1D3M – CIGAR string
Tools for SAM Data • SAMTools • Rsamtools • GATK • Especially good for SNP calling • Picard • Good for pre-processing SAM files, e.g. removing duplicates
SAMTools • Utilities for manipulating alignments in the SAM format, including sorting, merging, and indexing • See http://samtools.sourceforge.net/ • Rsamtools - implementation of most of Samtools in R
IRanges • Provides efficient low-level S4 classes for storing ranges of integers and RLE (Run-Length Encoding)vectors • RLE: sequences in which the same data value occurs in many consecutive data elements are stored as a single data value and count • Provides several methods for manipulating sequences • The foundation of GenomicRanges
GenomicRanges • General purpose containers for storing genomic intervals as well as more specialized containers for storing alignments against a reference genome • Usually store many intervals in one object • Key function: countOverlaps()counts how many times one set of ranges overlaps another set of ranges • Ideal for counting reads in exons or genes