Elephant Seg Dup Analysis

Elephant Seg Dup Analysis Genome Parameters for Pipeline Analysis

Zebra Finch Genome • The Genome assembly is downloaded from ftp://ftp.ncbi.nih.gov/genbank/genomes/Eukaryotes/vertebrates_mammals/Loxodonta_africana/Loxafr3.0/ • This assembly contains 693 scaffolds(GL…) and 1658 contigs (AAGU…), but they are not mapped to chromosomes. • Total gapped length is 3,196mb and none gapped sequence length is 3,118mb.

Seg Dup detection pipelines • WGAC to detect Seg Dup in genomic assemblies by looking for homologous pairs ( >1 kb in length >90% identity).

Parameters and notes for WGAC pipeline • Repeats • Because the elephant repeats library is not available, we masked out the combined sequence space of winMask and repeatmasker spaces. • The repeatMasker only using the default is not good enough. Tested by blast. • The combined masking space is good enough. • Blast parsing seeds in WGAC pipeline: • the seed size is 500 bp.

Result from WGAC Pipeline • Total pairs of WGAC detected (>1 kb and >90% identity) 64164 • Inter chromosome pairs 58454 • Intra chromosome pairs 5709 • Total WGAC NR (bp) 128,672,221 • NR inter 97,156,068 • NR intra 55,296,067 • Total genome size (with gap) 3,196,721,236 Notes: • The inter, and intra are based on scaffold and contigs rather than chromosomes.

General analysis of WGAC length and identity distribution • Length distribution peaked at 1-2 kb, intra > inter, with 87% of WGAC related to chrUn. • Identity distribution peaked at 97-98%. Few are higher than 99%.

NR distribution (AllDupLen.xls) • Because the scaffold and contigs are not mapped to chromosome, there is no NR distribution on each chromosome • In general, the large scaffold has less SD, and smaller scaffold and has higher SDs, especially those less than 1mb. • All contigs has high percentage of the SDs.

Initial stats is in allstat.xls

WGAC page , not yet set up

WSSD analysis done by Tinnot yet • Downloaded the WGS reads; about 11,683,735 reads from trace archive at NCBI. • Downloaded zfinch-finished BACs. These BACs are used to determine the threshold for WGS depth coverage. For 5-kb window, the average number of reads is 59. The threshold for 5-kb window is 110, for 1-kb it’s 22. • Used UCSC taeGut1 database rmsk tables as input to mask the genome for repeats with divergence <=10%.(UCSC rmsk options: RepeatMasker -align -s -species 'Taeniopygia guttata')

WSSD resultsnot yet available • A total of 16,076 regions with 44,218,871 bp were found in wssdGE10K_nogap.tab (which has a 10-k cut-off). 13,782 of them are on chrUn. • A summary table of WGAC intersect with WSSD is at http://eichlerlab.gs.washington.edu/help/linchen/zfinch/data/wgacCMPwssd.out.xls

General view showing WGAC (>5kb) and WSSD on all chromosomesnot done yet, may be on large scaffold Grey above lines are WSSD Brow below lines are WGAC

Union of WSSD and WGAC gene intersect with Seg Dupsnot available • A nonredundant union of WGAC and WSSD is generated with cut-off size at 10 kb (AllDup10kb.tab). There are 3,839 NR regions with 50,902,487 bp, which is about 10 mb more than WSSD alone. • However, be aware there may be false positive sites, especially on chrUn, since we know there are high false positive WGACs on chromosomes and chrUn.

Summary table 1not avaible

Large SDs >=10 kb • SD >=10 kb in size were pulled out. There are a total of 3,839 intervals with length 50,902,487 bp in the allDup.tab.

result

Elephant Seg Dup Analysis

Elephant Seg Dup Analysis

Presentation Transcript

elephant

Shooting Elephant

1MHz Dup Mode

Indian Elephant

Elephant

Elephant

elephant

Elephant Run

SEG 3210

Stickleback Seg Dup Analysis

Zebra Finch Seg Dup Analysis

ELEPHANT BABIES

2MHz Dup Mode

Electronic Dup/Trip

Elephant

DUP Method Program Review

Elephant Ear

Save Elephant Foundation with elephant money