130 likes | 292 Views
agenda. Where is stuff? Automating download of TCGA data DAM web service DCC web service Annotations web service Genomic bins Copy number polymorphisms/variations (CNVs) Mapping CBS segments to genes. /lpg/LPGCommon/schaefec/TCGA subdirectories.
E N D
agenda • Where is stuff? • Automating download of TCGA data • DAM web service • DCC web service • Annotations web service • Genomic bins • Copy number polymorphisms/variations (CNVs) • Mapping CBS segments to genes
/lpg/LPGCommon/schaefec/TCGAsubdirectories • Downloaded TCGA data (with scripts for download & pre-processing) • Annotations • Clinical • Agilent_GE_data • Agilent_MI_data • Illumina_ME_data • RNASeq • SNP_6_data • Other • BIN_MAPPING • CNV • mRNA_meta_data • miRNA_meta_data • methylation_meta_data • DAM_WebService, DCC_WS • Analysis, AnalysisOutput
DAM Web Service • Generic code in directory DAM_WebService • Two steps • submit request, get back ticket (<disease>.1.xml) • poll until <status-message> = OK, then wget <archive-url> (<disease>.2.xml) • Submit bunch of requests in parallel • Example: SNP_6_data/snp6_dam_ws.csh • currently, flattenDir does not appear to work
BRCA.1.xml <job-process> <ticket>c55fd0dd-20de-494d-8c7f-4fa01f1900c6</ticket> <submission-time>2011-03-05T11:48:17.735-05:00</submission-time> <estimated-size>9336767</estimated-size> <status-check-url>http://tcga-data.nci.nih.gov/tcga/damws/jobprocess/xml/ticket/c55fd0dd-20de-494d-8c7f-4fa01f1900c6</status-check-url> <job-status> <status-code>201</status-code> <status-message>Created</status-message> </job-status> </job-process>
BRCA.2.xml <job-process> <ticket>c55fd0dd-20de-494d-8c7f-4fa01f1900c6</ticket> <submission-time>2011-03-05T11:48:17.735-05:00</submission-time> <estimated-size>9336767</estimated-size> <status-check-url>http://tcga-data.nci.nih.gov/tcga/damws/jobprocess/xml/ticket/c55fd0dd-20de-494d-8c7f-4fa01f1900c6</status-check-url> <job-status> <status-code>200</status-code> <status-message>OK</status-message> <archive-url>http://tcga-data.nci.nih.gov/tcgafiles/ftp_auth/distro_ftpusers/anonymous/userCreatedArchives/043dd58c-a3e2-4a3a-a8f7-8a040ab1d2f3.tar.gz</archive-url> </job-status> </job-process>
DCC Web Service • complicated (and slow) but powerful interface • example request: • http://tcga-data.nci.nih.gov/tcgadccws/ GetXML?query=Archive[@isLatest=1][Platform[@name=Genome_Wide_SNP_6]][Disease[@abbreviation=BRCA]][ArchiveType[@type=Level_3]] • Generic parser looks for (class, field, attribute [maybe null]), e.g. • (“Archive”, “deployLocation”, undef) • (“Archive, “fileCollection”, “xlink:href”) • Example script: DCC_WS/FindFiles.pl • returns (DISEASE, date, directory, file) • Example script: DCC_WS/PatientsPerTSS.pl • returns (DISEASE, TSS name, number patients) • useful in pulling clinical xml (which is not available via DAM) • see Clinical/clinical_dcc_ws.csh
Annotations • Replaces the old disease_barcode_status.txt files • Annotations/process.csh creates annotations.txt • (DISEASE, level, barcode, annotation type, TOSS/KEEP) • level: patient, sample, portion, aliquot • FindKeepers.pl • inputs: list of candidate aliquots; annotations.txt • output: filtered (KEEP) list of candidate aliquots • PickOneAliquot.pl – old script to avoid over-representation of one patient (cases: multiple portions; native DNA and WGA)
BIN_MAPPING • Continuous bin numbering across the genome • probably more complicated than necessary • several levels of resolution (200K, 20K, 10K) • originally to support the CN heatmaps in CGWB • Use UCSC chromInfo.txt for chr length • Use UCSC refFlat.txt for gene, exon coordinates • Major issue on the horizon: hg18 vs hg19
CNVs -- DGV • Based on published studies • Keep only VariationType == ‘CopyNumber’ (vast majority) • Min sample size for variation: 30 • Min frequency: 0.30 • Output format like CBS output: • (“dgv”, chr, seg-start, seg-end, “1”, “2.0”) • 1 == phony number of markers • 2.0 == phony log2ratio • Combine overlapping segments by using bins, size=1000 • so a loss of resolution
CNVs – from normals • Disease-specific [current] or pooled? • Disregard chrX, chrY • Filter normal samples nsegs <= 1000 -0.10 <= mean log2ratio <= 0.10 • -0.20 <= diploid <= 0.20 • Tally non-diploid bins, bin size = 1000 bp • Create CNV segments for contiguous bins where tally >= 5% of samples • Output format like CBS output: • (“normals”, chr, seg-start, seg-end, “1”, “2.0”)
gene-level copy number values • SNP_6_data/snp6.csh • CBSSeg2Gene.pl • ComputePairedGeneValues.pl • give up on chrX, chrY • for each gene choose extreme overlapping CBS segment value • MIN_OVERLAP currently set to 1 bp • filter out short CBS segments (likely to be artifact/CNV) • MIN_SEG currently set to 200 bp • binning only to speed up overlapping – no loss of resolution • make gene-level calls separate for tumor, matched normal, then subtract • output • (aliquot, gene, chr, start, stop, log2ratio, capped log2ratio [-2.0..2.0], PAIRED/UNPAIRED, CNV/NOCNV)
miRNA_meta_data • genomic positions from mirbase.org • but now miRNA locations (with non-standard names) are also in refFlat.txt • targets from targetscan.org • 179,129 miRNA/target associations • 394 miRNA • 9432 targeted mRNA • caution: targetscan from UCSC is very reduced • 46,841 miRNA/target associations • 162 miRNA • 7981 targeted mRNA
mRNA_meta_data • just the mechanics of updating gene symbols • pulls official symbols from refFlat.txt (to be in sync with the CN data) • pulls aliases for official symbols from col 5 of Entrez Gene flat file gene_info • maps unofficial symbols in UNC Agilent Level 3 data to aliases • creates 2-col file replace_syms.txt for use by FilterMapColumn.pl • presumably all this will be unnecessary when RNASeq submissions start using GAF