1 / 16

Elephant Seg Dup Analysis

Elephant Seg Dup Analysis. Genome Parameters for Pipeline Analysis. Zebra Finch Genome. The Genome assembly is downloaded from ftp://ftp.ncbi.nih.gov/genbank/genomes/Eukaryotes/vertebrates_mammals/Loxodonta_africana/Loxafr3.0/

ranae
Download Presentation

Elephant Seg Dup Analysis

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Elephant Seg Dup Analysis Genome Parameters for Pipeline Analysis

  2. Zebra Finch Genome • The Genome assembly is downloaded from ftp://ftp.ncbi.nih.gov/genbank/genomes/Eukaryotes/vertebrates_mammals/Loxodonta_africana/Loxafr3.0/ • This assembly contains 693 scaffolds(GL…) and 1658 contigs (AAGU…), but they are not mapped to chromosomes. • Total gapped length is 3,196mb and none gapped sequence length is 3,118mb.

  3. Seg Dup detection pipelines • WGAC to detect Seg Dup in genomic assemblies by looking for homologous pairs ( >1 kb in length >90% identity).

  4. Parameters and notes for WGAC pipeline • Repeats • Because the elephant repeats library is not available, we masked out the combined sequence space of winMask and repeatmasker spaces. • The repeatMasker only using the default is not good enough. Tested by blast. • The combined masking space is good enough. • Blast parsing seeds in WGAC pipeline: • the seed size is 500 bp.

  5. Result from WGAC Pipeline • Total pairs of WGAC detected (>1 kb and >90% identity) 64164 • Inter chromosome pairs 58454 • Intra chromosome pairs 5709 • Total WGAC NR (bp) 128,672,221 • NR inter 97,156,068 • NR intra 55,296,067 • Total genome size (with gap) 3,196,721,236 Notes: • The inter, and intra are based on scaffold and contigs rather than chromosomes.

  6. General analysis of WGAC length and identity distribution • Length distribution peaked at 1-2 kb, intra > inter, with 87% of WGAC related to chrUn. • Identity distribution peaked at 97-98%. Few are higher than 99%.

  7. NR distribution (AllDupLen.xls) • Because the scaffold and contigs are not mapped to chromosome, there is no NR distribution on each chromosome • In general, the large scaffold has less SD, and smaller scaffold and has higher SDs, especially those less than 1mb. • All contigs has high percentage of the SDs.

  8. Initial stats is in allstat.xls

  9. WGAC page , not yet set up

  10. WSSD analysis done by Tinnot yet • Downloaded the WGS reads; about 11,683,735 reads from trace archive at NCBI. • Downloaded zfinch-finished BACs. These BACs are used to determine the threshold for WGS depth coverage. For 5-kb window, the average number of reads is 59. The threshold for 5-kb window is 110, for 1-kb it’s 22. • Used UCSC taeGut1 database rmsk tables as input to mask the genome for repeats with divergence <=10%.(UCSC rmsk options: RepeatMasker -align -s -species 'Taeniopygia guttata')

  11. WSSD resultsnot yet available • A total of 16,076 regions with 44,218,871 bp were found in wssdGE10K_nogap.tab (which has a 10-k cut-off). 13,782 of them are on chrUn. • A summary table of WGAC intersect with WSSD is at http://eichlerlab.gs.washington.edu/help/linchen/zfinch/data/wgacCMPwssd.out.xls

  12. General view showing WGAC (>5kb) and WSSD on all chromosomesnot done yet, may be on large scaffold Grey above lines are WSSD Brow below lines are WGAC

  13. Union of WSSD and WGAC gene intersect with Seg Dupsnot available • A nonredundant union of WGAC and WSSD is generated with cut-off size at 10 kb (AllDup10kb.tab). There are 3,839 NR regions with 50,902,487 bp, which is about 10 mb more than WSSD alone. • However, be aware there may be false positive sites, especially on chrUn, since we know there are high false positive WGACs on chromosomes and chrUn.

  14. Summary table 1not avaible

  15. Large SDs >=10 kb • SD >=10 kb in size were pulled out. There are a total of 3,839 intervals with length 50,902,487 bp in the allDup.tab.

  16. result

More Related