180 likes | 199 Views
Bioinformatics tools and techniques Into the heart of darkness. Elaine Kenny Colm O’Dushlaine 15/11/07. Summary. Simple overviews of some of the tools and methods used by EK and CO’D TK notebook get_hapmap_snps.pl: retrieve HM genotype information for a list of SNPs
E N D
Bioinformatics tools and techniquesInto the heart of darkness Elaine Kenny Colm O’Dushlaine 15/11/07
Summary • Simple overviews of some of the tools and methods used by EK and CO’D • TK notebook • get_hapmap_snps.pl: retrieve HM genotype information for a list of SNPs • GeneViewer.pl & cross_ref.pl: visualise e.g. SNPs in the context of other genomic landmarks. Score SNPs depending on how many of these landmarks they overlap with • ld_expander.pl: find SNPs in LD with SNPs of interest, based on user-specified r2 and “LD window” (distance between SNPs) • STATA • VIM: command line text editor • Lab website
TK notebook • Application for saving notes, to-do lists, daily logs, and any other kind of textual information in a place where you can find it all again, and where related information is easily found • Easy to edit and rapidly searchable • DEMO – editing • DEMO – search
get_hapmap_snps.pl • Simple script to read in a 1-column list of SNPs and retrieve HapMap genotypes • Can select population and strand • DEMO • Retrieved data can be loaded into HaploView • DEMO
cross_ref_scored.pl • Score SNPs based on how many putatively functional regions they overlap with: • On a per gene / chromosome basis • Gene basis: • Type: perl cross_ref_scored.pl file_A file_B file_C ... where file_A - 2-column file of SNPs (format = id, location) file_B - 3-column file of EXONS (format = id/name, start, stop) file_C ... - whatever you want, (format = id/name, start, stop) i.e. other regions like CpGs, TFBS, clusters. Any order. …
cross_ref_scored.pl example output: Can then be merged with HapMap / Perlegen to retrieve MAF data for SNPs
Merge cross_ref_scored data with HapMap/ Perlegen data using merge_per_hap.pl • Type: perl merge_per_hap.pl perlegen.txt hapmap.txt overlapped_region_scored.txt • Where: hapmap.txt = 3-column file (format: rsid, ref_allele, ref_allele_freq), perlegen.txt = 3-column file (format: rsid, ref_allele, ref_allele_freq)
cross_ref.pl applied to WGA data • cross_ref.pl: Scoring SNPs throughout genome • Data analysed on coding/non-coding basis (coding) • perl cross_ref.plOverlapped_regions_scored.WTCCC.chr22.coding.txt 22WTCCC_T2D_chr22_without_inferred.forCrossRef WGA_databases/coding_non_synon_SNPs_UCSC.clean=3 WGA_databases/coding_synon_SNPs_UCSC.clean=2 WGA_databases/RefSeq_Genes_UCSC.byExon.uniqid=1 WGA_databases/Triplexes_may2006.bed=2 WGA_databases/splice_site_SNPs_UCSC.clean=2>Overlapped_regions_scored.WTCCC.chr22.coding.log & (input-dependent,coding/non-coding dependent, arbitrary) (noncoding) • perl cross_ref.pl Overlapped_regions_scored.WTCCC.chr22.NONcoding.txt 22 WTCCC_T2D_chr22_without_inferred.forCrossRef WGA_databases/TFBS.chr22=1 WGA_databases/CpG_islands_UCSC.uniqid=1 WGA_databases/Most_conserved_phastConsElements17way_UCSC.clean=1 WGA_databases/promoters_knowngene_hg18.txt=1 WGA_databases/sno_or_miRNA_UCSC.uniqid=1 > Overlapped_regions_scored.WTCCC.chr22.NONcoding.log &
cross_ref.pl • cross_ref.pl output: • Load into STATA. If SNPs have e.g. association p-values, calculate adjusted p-value (R. Anney) as -log10[P] + [cross_ref_score]
GeneViewer.pl • GeneViewer.pl: Visualise overlapping features (e.g. exons, SNPs etc.) along e.g. your gene of interest (html output)
ld_expander.pl • Find proxies (SNPs in LD) for a list of SNPs • User specifies the r2 and “LD window” • Currently configured to obtain proxies from HM CEU • Result is a list of additional proxy SNPs that have been obtained by LD expansion • DEMO • Note: don’t LD expand >150000 SNPs, or HapMap will ban you! CO’D has an alternative version that uses local pre-computed pairwise LD SNP files
STATA • Extremely powerful and flexible • >65k rows handled – shock horror! • Can write scripts to automate tasks, e.g. read in file, do analysis, save results • When use GUI to run some commands, the commands are shown in the command window, so can save in a do file • CO’D, EK and R. Anney strongly advocate this as a platform for both file manipulation and statistical analysis
http://www.wtccc.org.uk/ STATA example using WTCCC data Bipolar Disorder, Coronary Artery Disease, Crohn's Disease, Hypertension, Rheumatoid Arthritis, Type 1 Diabetes, Type 2 Diabetes
DATA FORMAT • 3 folders: • Basic • Each case collection against the pooled control groups 58C and UKBS • Combined cases • Combining other case collections as controls • Combined controls • Combining phenotypically relevant case collections (e.g. RA/T1D, autoimmune ) • Data are split by chromosome
Questions • How do I get all of the chromosome data for my gene of interest into one file? • How do I search easily all of the SNP information for my gene(s) of interest? • Create a “.do” file for all manipulations that you want to carry out to the data • DEMO • Good starting resource: http://www.ats.ucla.edu/stat/stata/
VIM • “Vi Improved”. Mainly UNIX but cross-platform text editor (available for Windows). • Full list of commands outside scope of this demonstration • Very fast and efficient, esp. with search and replace functions on large datasets • Regular expression pattern matching • DEMO • Integrates with Cygwin (www.cygwin.com – very useful UNIX emulator for windows)
Group website • Some useful stuff up there! • Please send information about current projects etc. Good for our image as a group and minimal effort required on your part • DEMO
Conclusions • Small summary of some things you can do • Slides and video demonstrations will be online at: http://www.medicine.tcd.ie/psychiatry/research/neuropsychiatry/Protocols/ • CO’D & EK available for advice(Friday’s 9-9.02am) • These things will help you in your work!!