280 likes | 656 Views
Rapid detection of expanded short tandem repeats in personal genomics using hybrid sequencing.
E N D
Rapid detection of expanded short tandem repeats in personal genomics using hybrid sequencing By KoichiroDoi, TakuMonjo,, Pham H. Hoang,, Jun Yoshimura, Hideaki Yurino, Jun Mitsui, Hiroyuki Ishiura, Yuji Takahashi, Yaeko Ichikawa, Jun Goto, Shoji Tsuji and Shinichi Morishita(The University of Tokyo) Bioinformatics, November 2013 Presented by KWOK TszPiu (Bill) 26/3/2014
Short tandem repeats (STRs) • Repeating sequences of 2-6 base pairs of DNA • A type of Variable Number Tandem Repeat
Short tandem repeats (STRs) • Many genetic disorders are associated with STRs • E.g. Huntington’s disease(亨丁頓舞蹈症) • In the coding region of huntingtin • Expansion of triplet repeat (CAG)n • n < 28: normal • n = 28–35: intermediate • n = 36–40 in reduced penetrance • n > 40 in full penetrance
Short tandem repeats (STRs) • Besides coding regions, STRs are found in untranslated regions, introns and promoters • E.g. Fragile-X syndrome • Secondmajor reason for intellectual disability • (CGG) repeat in the 5’-UTR
Motivation • The infeasibility of obtaining longer reads(>100bp) lead detecting important STRs difficult • E.g. (ATTCT)n, n= 800-4500 in SCA10 • Typical length(~100bp) of short reads make identification and of long STRs difficult, accurate positions are hard to determine • STRs have several variants with many mutations • Spontaneous mutation rate of STRs = 3.78*10-4 to 7.44*10-2 in human Y-chromosome • Much higher than average rate of de novo single-nucleotide variation 1.18*10-8 • Detecting various STRs is fundamental to analysis of personal genomes
Motivation • Previous software cannot process billions of short reads quickly
Target • Sensing and locating long STRs with 2-6 base-long repeat units efficiently in personal genome
Definition: • Basic unit of an STR should be minimized • E.g. ACACACAC is AC, not ACAC • Repeat unit representative: • Not a repeat of a shorter unit • First lexicographical motif among all possible shifts and its reverse complement
Step 1: List approximate STRs in billions of short reads • Some unit of STRs should be allowed to contain a few mutations • Listing approximate STRs are computationally intractable
Step 1: List approximate STRs in billions of short reads • Heuristic approach • Definition: • A repetition(STRs) is a string of the form • (p)mq • p is a non-empty string, is called unit of repetition • q is a prefix of p • E.g. (CAG)3CA = CAGCAGCAGCA • p = CAG, m =3, q = CA
Step 1: List approximate STRs in billions of short reads • A repetition is maximal if it is not a proper substring of a repetition that has the same unit. • E.g. • (CAG)2CA is a maximal repetition with unit CAG • The entire string is also a maximal repetition with unit (CAG)2CA.
Step 1: List approximate STRs in billions of short reads • To identify all occurrences of approximate STRs • Enumerate all maximal repetitions in a read using Main’s O(nlogn)-time algorithm (1989) • For each maximal repetition Y, identify the minimum unit U such that U is not a repetition and Y is a concatenation of multiple occurrences of U and a prefix of U. For example, when Y = (CAG)6CA, U = CAG. • Extend maximal (approximate) repetition in both direction by greedy method • CGCCCGCAGCGCAT(CAG)6CATCAGGGA • CGCCCGCAGC-GCAT(CAG)6CATCAGGGA, • Allow 1 mismatch/deletion/insertion • Remove overlapping STRs if any (the shorter one)
Step 2: Sensing expanded STRs by frequency distributions • Generate frequency distributions of all approximate STRs in reads by their length • Suggested the presence of a long AGAGGC repeat in NA12877
Step 3: Locating long expansions of STRs • Solvable if flanking regions can be uniquely mapped • Otherwise, use the information on paired-end reads • If one end-read is filled with an STR, test if the other end can be mapped uniquely by BWA-MEM • Approximate the position STR by the ends if its location can be sandwiched by paired-end reads
Step 3: Locating long expansions of STRs • Third generation sequencer is needed • SMRTTM is able to read ~5kb in average • We know the accurate position for one end • Amplify the repeat region using PCR primers
Results (part 1) • Reproducibility of detecting STR expansions for independent biological replicates • Despite the presence of 100-bp long STR, their method might fail • Collected two independent replicates of NA12878 • DePristoet al. (2011) • Illumina’s platinum genome website
Results (part 1) • Identified 60 STRs with 100bp occurrences in one (n=13) or both (n=47) • Of 13 STRs with no counts in one replicate, 12 had one or two occurrences
Results (part 1) • If an STR occurrence in the genome is short(e.g. 100bp) • Failure rate is high (50% for 50X coverage) • Essentially consistent results for the two replicates
Results (part 2) • Target: • Find these STRs with no prior information from SCA31 • SCA31 is a well characterized case sample • (AAAATAGAAT) repeat and (AATGG) repeat • In the introns of BEAN1 and TK2 • (AAAAT) in the reference genome
Results (part 2) • Resequence the genome of a sample whose parent is a case of SCA31 • Collect reads of NA12877, NA12878 and NA18507 • Applied their methods to SCA31, NA12877, NA12878 and NA18507
Results (part 2) • Only one STR is detected: (AAAATAGAAT) • No significant difference For (AATGG) • (AATGG) repeat is enriched in human centromeres
Results (part 2) • By using paired-end reads with AAAATAGAAT repeats • ~2.5-3.8Kb insertion detected • Sequence the repeat region in 11 SCA31 samples by SMRTTM sequencing.
Results (part 2) • Previous studies showed: • i = 2, k = 10, l = 46, j & m undetermined • Their study showed • i = 1~2, j=220~321, k=9~13, l=42~78, m=90~118 • the instability of STR expansions.
Conclusion & Discussion • STRs in personal genomes remain largely uncharacterized • Proposed a method for listing long approximate STRs with mutations • Presented a procedure for detecting significant expansions of STRs • Applied their methods on NA12878, showed the reproducibility. • Applied to 11 SCA31 samples, showed the instability of STR expansions.