220 likes | 304 Views
Zebra Finch Seg Dup Analysis. Genome Parameters for Pipeline Analysis. Zebra Finch Genome.
E N D
Zebra Finch Seg Dup Analysis Genome Parameters for Pipeline Analysis
Zebra Finch Genome • The Genome (Jul. 2008 assembly of the zebra finch genome taeGut1, WUSTL v3.2.4) is downloaded from UCSU. This assembly was produced by the Genome Sequencing Center at the Washington University in St. Louis (WUSTL) School of Medicine. • The zebra finch DNA used for the shotgun sequencing and the BAC and cosmid libraries was derived from a single male domesticated zebra finch. The initial assembly was generated using PCAP with approximately 6X coverage. About 1.0 Gb of the 1.2-Gb genome has been ordered and oriented along 33 chromosomes and one linkage group. The chromosome names are based on their homologous chromosomes in the chicken (Gallus gallus). • Total genome size (gapped) 1,233,186,341 bp
Seg Dup detection pipelines • WGAC to detect Seg Dup in genomic assemblies by looking for homologouse pairs ( >1 kb in length >90% identity). • WSSD to detect Seg Dup in given sequences based on depth coverage of WGS (whole-genome shotgun reads). Depth coverage > Average + 3SD. Done by Ginger Cheng.
Parameters and notes for WGAC pipeline • Repeats • The sequences download from UCSC has been soft masked. • UCSC rmsk options: RepeatMasker -align -s -species 'Taeniopygia guttata' • The repeat coordinates were reverse generated based on the soft-masked sequences. • Blast parsing seeds in WGAC pipeline: • the seed size is 250 bp.
Result from WGAC Pipeline • Total pairs of WGAC detected (>1 kb and >90% identity) 198180 • Inter chromosome pairs 81415 • Intra chromosome pairs 116742 • Chromosome inter and intra (excluding chr_random and chrUn) 26510 • ChrUn inter and intra 172670 • Total WGAC NR (bp) 384,501,909 • Total genome size (with gap) 1,233,186,341 Notes: • The NR space of WGAC is about 31% zebra finch genome, which is too high. It is either due to the incomplete repeat masking or redundant sequences in chr_random and chrUn. 87% of the total WGAC pairs (inter and intra) have at least one sequence in each pair is on chrUn. The result indicates a big portal of false positive WGAC is from chrUn.
General analysis of WGAC length and identity distribution • Length distribution peaked at 1-2 kb, intra > inter, with 87% of WGAC related to chrUn. • Identity distribution peaked at 97-98%. Few are higher than 99%.
General analysis, NR distribution on chromosome high SD in chrUn
Global image shows the inter and intra pairs of 10 kb and above 90% in identity without or with chrUn. The red indicates the inter chromosomal pairs and blue indicates intra chromosomal pairs. Without chrUn With chrUn
WGAC page • http://eichlerlab.gs.washington.edu/help/linchen/zfinch/zfinch_wgac.html
WSSD analysis done by Gingerhttp://eichlerlab.gs.washington.edu/help/ginger/zebrafinch/ • Downloaded the WGS reads; about 11,683,735 reads from trace archive at NCBI. • Downloaded zfinch-finished BACs. These BACs are used to determine the threshold for WGS depth coverage. For 5-kb window, the average number of reads is 59. The threshold for 5-kb window is 110, for 1-kb it’s 22. • Used UCSC taeGut1 database rmsk tables as input to mask the genome for repeats with divergence <=10%.(UCSC rmsk options: RepeatMasker -align -s -species 'Taeniopygia guttata')
WSSD results • A total of 16,076 regions with 44,218,871 bp were found in wssdGE10K_nogap.tab (which has a 10-k cut-off). 13,782 of them are on chrUn. • A summary table of WGAC intersect with WSSD is at http://eichlerlab.gs.washington.edu/help/linchen/zfinch/data/wgacCMPwssd.out.xls
General view showing WGAC (>5kb) and WSSD on all chromosomes Grey above lines are WSSD Brow below lines are WGAC
Union of WSSD and WGAC gene intersect with Seg Dups • A nonredundant union of WGAC and WSSD is generated with cut-off size at 10 kb (AllDup10kb.tab). There are 3,839 NR regions with 50,902,487 bp, which is about 10 mb more than WSSD alone. • However, be aware there may be false positive sites, especially on chrUn, since we know there are high false positive WGACs on chromosomes and chrUn.
Large SDs >=10 kb • SD >=10 kb in size were pulled out. There are a total of 3,839 intervals with length 50,902,487 bp in the allDup.tab.
The study of the chromosome only WGAC • The Segment duplications on sequences assigned to chromosome should be more reliable sequences with less artifact. • It should contains sequences reflecting best of the assembly.
Total Dup length 105,145,288 bp • Intra Dup length 100,234,309 bp • Inter Dup length 8,499,428 bp • More Dup is intra chromosome dup >90% • These intra chromosome dup are predominantly short range intra dup, see the global view on next slide
Global view of 90%-5k and 94%-5k respectively, showing significant amount of WGAC pairs are intra chromosome short range duplications.
The blowup view showing WGAC on chromosome 1 at 5k and 94%. This is WGAC detected on sequences assigned to chromosome only
Intra chromosome Homology pairs Detail of a sample region on chr1 Grey Depth of coverage by reads WSSD Assembly Gaps The average identity for the for the reads mapped to the region. Red >99% Orange >98% Yellow > 97% Green > 96%
Text description for slide 20 • Each black line represent the chromosome regions as indicated by ticks. • Blue bars and pairs are the intra chromosome homologous pairs (segment duplications) found. • Red bar and pair on chromosome line represent the inter chromosome homologous pairs (inter chromosome Segment Duplications). • The grey bars under the chromosome line represent the depth of coverage at the regions by WGS reads in 1kb window. The longer the bar is , the higher the depth of coverage by sequence reads. • The color bar under the chromosome line represent the average identity for all the reads mapped to the region. Red(>99%), Orange(>98%), yellow(>97%), green (>96%). • The black bar above the chromosome line represent WSSD detected. • The purple vertical line on chromosome line represent the assembly gaps. • Each tick represent the 10000bp; each line is 100kb.
result • Most of the intra chromosomal pairs are very close to each other. In most cases, one sequence within the pair has gaps on both ends, which suggest the contig is not physically connected to its adjacent sequences. It was placed at current position by the mate pairs. • Some of them are also next to each other, separated by a gap. • We have not see in sampled region that a single contig contains both sequences within the pairs of intra chromosome segment duplications. • Consider observation mentioned above, we think there is a high possibility that they could be assembly artifacts introduced by assembler.