1 / 1

Laura Clarke Vertebrate Genomics laura@ebi.ac.uk

The 1000 Genomes project, Data Availability and Accessibility L Clarke , H Zheng Bradley, R Smith, I Streeter, E Kulesha, B. Vaughan, P. Flicek and The 1000 Genomes Project. European Bioinformatics Institute, Wellcome Trust Genome Campus, Cambridge, CB10 1SA, UK.

eloise
Download Presentation

Laura Clarke Vertebrate Genomics laura@ebi.ac.uk

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The 1000 Genomes project, Data Availability and Accessibility L Clarke, H Zheng Bradley, R Smith, I Streeter, E Kulesha, B. Vaughan, P. Flicek and The 1000 Genomes Project. European Bioinformatics Institute, Wellcome Trust Genome Campus, Cambridge, CB10 1SA, UK. The 1000 Genomes data sets represent the largest public variation data resources available to the community. Providing coherent and useful resources based on the project data continues to be a key goal for the project Data Coordination Center (DCC). We present here a selection of the tools built on top of the 1000 Genomes data to make it as useful as possible to the wider community. A copy of this poster is available at http://www.1000genomes.org/biology-genomes-2012-poster. For more information about the work at the DCC please see The 1000 Genomes Project: data management and community access, Clarke L, et al. Nat Meth 9, 459-462 2012 1000 Genomes in AWS Finding Data With more than 250,000 files and 275 Tbytes of data, finding information of the 1000 Genomes FTP site can prove challenging. The DCC provides some tools to assist with this. All the 1000 Genomes Phase 1 and Phase 2 data are now available in the Amazon Web Services Cloud as a public data set. At the root of our FTP site (ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp | ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/), we present an index file, current.tree which lists all the files and directories the FTP site contains. This is updated nightly. The file is a 5 column tab delimited text file listing the following items for each file or directory. We have also provided an Amazon Machine Image (AMI) to allow users to run our tools inside AWS. Access to the 1000 Genomes S3 bucket is also preconfigured in Cloudbiolinux. There is more information about all of this at http://www.1000genomes.org/using-1000-genomes-data-amazon-web-service-cloud • Relative file path (from the root of the FTP site) • Type (file or directory) • Size (bytes) • Timestamp (time file was last updated) • MD5 checksum Improved Variation Views Our browser has been updated to Ensembl version 65 which includes improved variation views. The icons give rapid access to different sections of the variation view. This includes to the population genotype views which provide pie charts for the 1000 Genomes population genotypes. We also present an easy way to search this file on the website, see above screen shot (http://www.1000genomes.org/ftpsearch). Tools As part of our browser, we present several tools to aid access and analysis of the 1000 Genomes data sets (http://browser.1000genomes.org/tools.html). These tools include the Ensembl Variant Effect Predictor, the Data Slicer, the Variation Pattern finder and the VCF to PED converter. The Gene Variation tables now also contain the minor allele for each variant and its frequency and as before these tables are available in csv format . The Variation Effect Predictor can provide functional annotation of SNVs and indels. This can including SIFT and PolyPhen consequences for non synonymous variants and overlap with high information parts of transcription factor binding sites. The Data Slicer allows users to get particular genomic sub sections of both VCF and BAM files. The Variation Pattern Finder allows easy discovery of shared inheritance patterns. Announcements and Help It is also now easier than ever to find out about new releases of data from the project. We have created both rss and twitter feeds of our website announcements and you can now subscribe to announcement emails from 1000Announce@1000genomes.org. http://twitter.com/1000genomes http://www.1000genomes.org/announcements/rss.xml We also have a tutorial and an faqhttp://www.1000genomes.org/using-1000-genomes-data http://www.1000genomes.org/faq The VCF to PED converter can turn our VCF files into the ped and locus information files required by LD visualization tools like haploview, allowing using to explore the LD and haplotype structure of our data. Acknowledgements: We would like to thank the Ensembl Variation Team, Don Preuss and Christopher Cope at the NCBI and our funder the Welcome Trust. Laura Clarke Vertebrate Genomics laura@ebi.ac.uk

More Related