160 likes | 313 Views
Elephant Seg Dup Analysis. Genome Parameters for Pipeline Analysis. Zebra Finch Genome. The Genome assembly is downloaded from ftp://ftp.ncbi.nih.gov/genbank/genomes/Eukaryotes/vertebrates_mammals/Loxodonta_africana/Loxafr3.0/
E N D
Elephant Seg Dup Analysis Genome Parameters for Pipeline Analysis
Zebra Finch Genome • The Genome assembly is downloaded from ftp://ftp.ncbi.nih.gov/genbank/genomes/Eukaryotes/vertebrates_mammals/Loxodonta_africana/Loxafr3.0/ • This assembly contains 693 scaffolds(GL…) and 1658 contigs (AAGU…), but they are not mapped to chromosomes. • Total gapped length is 3,196mb and none gapped sequence length is 3,118mb.
Seg Dup detection pipelines • WGAC to detect Seg Dup in genomic assemblies by looking for homologous pairs ( >1 kb in length >90% identity).
Parameters and notes for WGAC pipeline • Repeats • Because the elephant repeats library is not available, we masked out the combined sequence space of winMask and repeatmasker spaces. • The repeatMasker only using the default is not good enough. Tested by blast. • The combined masking space is good enough. • Blast parsing seeds in WGAC pipeline: • the seed size is 500 bp.
Result from WGAC Pipeline • Total pairs of WGAC detected (>1 kb and >90% identity) 64164 • Inter chromosome pairs 58454 • Intra chromosome pairs 5709 • Total WGAC NR (bp) 128,672,221 • NR inter 97,156,068 • NR intra 55,296,067 • Total genome size (with gap) 3,196,721,236 Notes: • The inter, and intra are based on scaffold and contigs rather than chromosomes.
General analysis of WGAC length and identity distribution • Length distribution peaked at 1-2 kb, intra > inter, with 87% of WGAC related to chrUn. • Identity distribution peaked at 97-98%. Few are higher than 99%.
NR distribution (AllDupLen.xls) • Because the scaffold and contigs are not mapped to chromosome, there is no NR distribution on each chromosome • In general, the large scaffold has less SD, and smaller scaffold and has higher SDs, especially those less than 1mb. • All contigs has high percentage of the SDs.
WSSD analysis done by Tinnot yet • Downloaded the WGS reads; about 11,683,735 reads from trace archive at NCBI. • Downloaded zfinch-finished BACs. These BACs are used to determine the threshold for WGS depth coverage. For 5-kb window, the average number of reads is 59. The threshold for 5-kb window is 110, for 1-kb it’s 22. • Used UCSC taeGut1 database rmsk tables as input to mask the genome for repeats with divergence <=10%.(UCSC rmsk options: RepeatMasker -align -s -species 'Taeniopygia guttata')
WSSD resultsnot yet available • A total of 16,076 regions with 44,218,871 bp were found in wssdGE10K_nogap.tab (which has a 10-k cut-off). 13,782 of them are on chrUn. • A summary table of WGAC intersect with WSSD is at http://eichlerlab.gs.washington.edu/help/linchen/zfinch/data/wgacCMPwssd.out.xls
General view showing WGAC (>5kb) and WSSD on all chromosomesnot done yet, may be on large scaffold Grey above lines are WSSD Brow below lines are WGAC
Union of WSSD and WGAC gene intersect with Seg Dupsnot available • A nonredundant union of WGAC and WSSD is generated with cut-off size at 10 kb (AllDup10kb.tab). There are 3,839 NR regions with 50,902,487 bp, which is about 10 mb more than WSSD alone. • However, be aware there may be false positive sites, especially on chrUn, since we know there are high false positive WGACs on chromosomes and chrUn.
Large SDs >=10 kb • SD >=10 kb in size were pulled out. There are a total of 3,839 intervals with length 50,902,487 bp in the allDup.tab.