380 likes | 534 Views
(Human) Genomics BIOM/PHAR206 – 05/19/2014. Olivier Harismendy, PhD Division of Genome Information Sciences Department of Pediatrics Moores UCSD Cancer Center. UCSC Genome Browser. isPCR BLAT LiftOver Track types BED minimum BED extended WIG Track Display and Shuffle
E N D
(Human) GenomicsBIOM/PHAR206 – 05/19/2014 Olivier Harismendy, PhD Division of Genome Information Sciences Department of Pediatrics Moores UCSD Cancer Center
UCSC Genome Browser • isPCR • BLAT • LiftOver • Track types • BED minimum • BED extended • WIG • Track Display and Shuffle • Browser Navigation • Custom Session – Export Figure • Custom Tracks
0-based coordinates Sequence A|C|C|G|G|T|C|G|A 1 based 1 2 3 4 5 6 7 8 9 0 based 1 2 3 4 5 6 7 8 9
BED Track Formats track name="ItemRGBDemo" description="Item RGB demonstration" visibility=2 itemRgb="On" chr7 127471196 127472363 Pos1 0 + 127471196 127472363 255,0,0 chr7 127472363 127473530 Pos2 0 + 127472363 127473530 255,0,0 chr7 127473530 127474697 Pos3 0 + 127473530 127474697 255,0,0 chr7 127474697 127475864 Pos4 0 + 127474697 127475864 255,0,0 chr7 127475864 127477031 Neg1 0 - 127475864 127477031 0,0,255 chr7 127477031 127478198 Neg2 0 - 127477031 127478198 0,0,255 chr7 127478198 127479365 Neg3 0 - 127478198 127479365 0,0,255 chr7 127479365 127480532 Pos5 0 + 127479365 127480532 255,0,0 chr7 127480532 127481699 Neg4 0 - 127480532 127481699 0,0,255
BED Track Formats Header: space separated parameters • name=<track_label> • description=<center_label> • type=<track_type> - Defines the track type. The track type attribute is required for BAM, BED detail, bedGraph, bigBed, bigWig, broadPeak, narrowPeak, Microarray, VCF and WIG tracks. • visibility=<display_mode> 0 - hide, 1 - dense, 2 - full, 3 - pack, and 4 - squish. • color=<RRR,GGG,BBB> - Defines the main color for the annotation track. • itemRgb=On • colorByStrand=<RRR,GGG,BBB RRR,GGG,BBB> - Sets colors for + and - strands, in that order. • useScore=<use_score> • group=<group> - • priority=<priority> - When the group attribute is set, defines the display position of the track relative to other tracks • db=<UCSC_assembly_name> - When set, indicates the specific genome assembly for which the annotation data is intended; • offset=<offset> - Defines a number to be added to all coordinates in the annotation track. The default is "0". • maxItems=<#> - Defines the maximum number of items the track can contain. • url=<external_url> - Defines a URL for an external link associated with this track. • htmlUrl=<external_url> - Defines a URL for an HTML description page to be displayed with this track. • bigDataUrl=<external_url> - Defines a URL to the data file for BAM, bigBed, bigWig or VCF tracks.
BED Track Formats • For intervals • Header: space separated configuration parameters • chrom - The name of the chromosome • chromStart - The starting position of the feature in the chromosome or scaffold. The first base in a chromosome is numbered 0. • chromEnd - The ending position of the feature in the chromosome or scaffold. The chromEnd base is not included in the display of the feature. • name - Defines the name of the BED line. • score - A score between 0 and 1000. • strand - Defines the strand - either '+' or '-'. • thickStart - The starting position at which the feature is drawn thickly • thickEnd - The ending position at which the feature is drawn thickly • itemRgb - An RGB value of the form R,G,B (e.g. 255,0,0). • blockCount - The number of blocks (exons) in the BED line. • blockSizes - A comma-separated list of the block sizes. • blockStarts - A comma-separated list of block starts.
WIG track format # 150 base wide bar graph at arbitrarily spaced positions, # threshold line drawn at y=11.76 # autoScale off viewing range set to [0:25] # priority = 10 positions this as the first graph # Note, one-relative coordinate system in use for this format track type=wiggle_0 name="variableStep" description="variableStep format" visibility=full autoScale=off viewLimits=0.0:25.0 color=50,150,255 yLineMark=11.76 yLineOnOff=on priority=10 variableStepchrom=chr19 span=150 49304701 10.0 49304901 12.5 49305401 15.0 49305601 17.5 49305901 20.0 49306081 17.5 49306301 15.0 49306691 12.5 49307871 10.0 # 200 base wide points graph at every 300 bases, 50 pixel high graph # autoScale off and viewing range set to [0:1000] # priority = 20 positions this as the second graph # Note, one-relative coordinate system in use for this format track type=wiggle_0 name="fixedStep" description="fixedStep format" visibility=full autoScale=off viewLimits=0:1000 color=0,200,100 maxHeightPixels=100:50:20 graphType=points priority=20 fixedStepchrom=chr19 start=49307401 step=300 span=200 1000 900 800 700 600 500 400 300 200 100
Specific Tracks of interest • UCSC genes • RefSeq Genes • RepeatMasker • Conservation • TF motif predictions • dbSNP • ENCODE • Roadmap
Custom Sessions • Create an account • Customize the tracks displayed • Add you own track (limited in size and time) • Save and Share
Table Browser • Subset gene, region, genome • Output BED or fasta • Intersection • Filters
ENCODE / Roadmap Tracks • Track search • Cell Types / Tissue Types • Raw • Peaks • HMM
UNIX commands • Head • More (press Q to exit) • Cat • Example cat file • Example cat file1 file2 • Grep • Grep –v ‘expression’ • Grep –A 1 ‘expression’ • Grep –B 2 ‘expression’ • Example: grep –v ‘#’ file.txt to remove comments • Expression metacharacters • $ end of line • $ beginning of line • [AB] A or B • * any character • Example: ‘CDKN*’ or ‘chr[1-7]’
UNIX commands • Cut • cut –f 1 • cut –f 3 –d ‘:’ • Sort • sort –n • sort –nr (or sort –n –r) • sort –k 2 • uniq • uniq • uniq -c • wc • wc –l file.txt • Example: cut –f 1 file | sort | uniq -c
UNIX commands • Sed • Sed ‘s/foo/bar/g’ file: find and replace • Awk • Awk ‘$3>2000’ file : select row with 3rd field>2000 • Awk ‘{if ($3>2000) print $1,$2}’ file only print first 2 columns • Awk ‘{sum+=$3} END {print sum}’ file print sum of column 3 • Awk ‘{sum+=$3} END {print sum/NR}’ file print average of column 3 • Join • join –j 1 sorted_file1 sorted_file2
Human Genetic Variation Highly Similar Genomes DNA variants (Sequence differences) Phenotypic Differences (Physical traits)
Variant Types Frazer et al. 2009 Rahim, Harismendy et al (2008)
Variants from an individual genome Within any given individual there are ~ 4 million genetic variants encompassing ~ 12 Mb
Variants from multiple genomes Within a given individual the majority of variants are common.
Next Generation DNA analysis • Whole genome sequencing • Mutations (coding and non-coding) • Translocations • Copy Number Variants • Whole Exome Sequencing • Mutations (coding) • ~Copy number variants (trisomia, gene amplifications) • Gene Panel • Mutations (coding)
Variant Frequencies • Common genetic variants • second allele present at greater than 3% frequency • Rare genetic variant • present at less than 3% frequency, and commonly at very low frequencies • Private variants • in limited families or single individuals
HapMap Project • Genotyped ~ 3.1 million SNPs in 270 individuals • 90 Yoruba in Ibadan, Nigeria (YRI) • 90 European descent in Utah, USA (CEU) • 45 Han Chinese in Beijing, China (CHB) • 45 Japanese in Tokyo, Japan (JPT) Map of Genetic Variation Relationships between common SNPs in the human genome Frazer et al (2007)
VCF format ##fileformat=VCFv4.1 ##fileDate=20090805 ##source=myImputationProgramV3.1 ##reference=file:///seq/references/1000GenomesPilot-NCBI36.fasta ##contig=<ID=20,length=62435964,assembly=B36,md5=f126cdf8a6e0c7f379d618ff66beb2da,species="Homo sapiens",taxonomy=x> ##phasing=partial ##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data"> ##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth"> ##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency"> ##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele"> ##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129"> ##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership"> ##FILTER=<ID=q10,Description="Quality below 10"> ##FILTER=<ID=s50,Description="Less than 50% of samples have data"> ##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype"> ##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality"> ##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth"> ##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality"> #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA00002 NA00003 20 14370 rs6054257 G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,51 1/1:43:5:.,. 20 17330 . T A 3 q10 NS=3;DP=11;AF=0.017 GT:GQ:DP:HQ 0|0:49:3:58,50 0|1:3:5:65,3 0/0:41:3 20 1110696 rs6040355 A G,T 67 PASS NS=2;DP=10;AF=0.333,0.667;AA=T;DB GT:GQ:DP:HQ 1|2:21:6:23,27 2|1:2:0:18,2 2/2:35:4 20 1230237 . T . 47 PASS NS=3;DP=13;AA=T GT:GQ:DP:HQ 0|0:54:7:56,60 0|0:48:4:51,51 0/0:61:2 20 1234567 microsat1 GTC G,GTCT 50 PASS NS=3;DP=9;AA=G GT:GQ:DP 0/1:35:4 0/2:17:2 1/1:40:3
Linkage Disequilibrium (LD) Given two biallelic sites there are four combinations that can be observed with the following distributions. SNP 1 = A/G SNP 2 = A/C LD measure the level of correlation between SNPs LD is the consequence of recombination at preferential sites
LD Bin structure example LD bin = groups of SNPs with r2≥0.8 • The majority of common SNPs are in LD bins in the human genome • Genotypes of a set of ~500,000 “tag SNPs” provide information (r2 ≥ 0.8) regarding a large fraction (90%) of all 8 million common SNPs present in humans.
GWAS principle From phenotype to genotype Tests if common SNPs tagging an interval in the human genome are “associated” with a disease http://www.mpg.de
GWAS results Q1 2011 221 traits 1319 studies >4000 associated SNPs PR interval Large number to test requires low p-value (5.10-8) Sample sizes determine variant frequencies and effect size (Power) WTCCC (2007)
GWAS highlights • Many genes/loci not previously known to be involved in the diseases studied • Newly identified pathways suggest that molecular sub-phenotypes of common diseases may exist • Many common diseases have the same associated genes suggesting similar etiologies
GWAS limitations • Genetic • Small Effect sizes : only explains a small fraction (1-25%) of the heritability • Missing heritability can be hiding in • Rare variants with large effects • Epitasis (Gene x Gene interactions) • Gene x Environment interaction (overlooked in heritability studies) • Clinical • Limited Prognostic value : classic marker (family history, life style) work better • Limited by ethnicity • Functional • Proxy SNPs are not the functional ones • Genes associated by proximity : Variants are mostly outside • Cell type and condition unknown
Clinical Data Collected Personal and history Histology Clinical Molecular
Clinical Data Collected Patients Decreasing Intrinsic sensitivity Days after Dx