1 / 26

Rapid detection of expanded short tandem repeats in personal genomics using hybrid sequencing

Rapid detection of expanded short tandem repeats in personal genomics using hybrid sequencing.

emilie
Download Presentation

Rapid detection of expanded short tandem repeats in personal genomics using hybrid sequencing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Rapid detection of expanded short tandem repeats in personal genomics using hybrid sequencing By KoichiroDoi, TakuMonjo,, Pham H. Hoang,, Jun Yoshimura, Hideaki Yurino, Jun Mitsui, Hiroyuki Ishiura, Yuji Takahashi, Yaeko Ichikawa, Jun Goto, Shoji Tsuji and Shinichi Morishita(The University of Tokyo) Bioinformatics, November 2013 Presented by KWOK TszPiu (Bill) 26/3/2014

  2. Short tandem repeats (STRs) • Repeating sequences of 2-6 base pairs of DNA • A type of Variable Number Tandem Repeat

  3. Short tandem repeats (STRs) • Many genetic disorders are associated with STRs • E.g. Huntington’s disease(亨丁頓舞蹈症) • In the coding region of huntingtin • Expansion of triplet repeat (CAG)n • n < 28: normal • n = 28–35: intermediate • n = 36–40 in reduced penetrance • n > 40 in full penetrance

  4. Short tandem repeats (STRs) • Besides coding regions, STRs are found in untranslated regions, introns and promoters • E.g. Fragile-X syndrome • Secondmajor reason for intellectual disability • (CGG) repeat in the 5’-UTR

  5. Motivation • The infeasibility of obtaining longer reads(>100bp) lead detecting important STRs difficult • E.g. (ATTCT)n, n= 800-4500 in SCA10 • Typical length(~100bp) of short reads make identification and of long STRs difficult, accurate positions are hard to determine • STRs have several variants with many mutations • Spontaneous mutation rate of STRs = 3.78*10-4 to 7.44*10-2 in human Y-chromosome • Much higher than average rate of de novo single-nucleotide variation 1.18*10-8 • Detecting various STRs is fundamental to analysis of personal genomes

  6. Motivation • Previous software cannot process billions of short reads quickly

  7. Target • Sensing and locating long STRs with 2-6 base-long repeat units efficiently in personal genome

  8. Definition: • Basic unit of an STR should be minimized • E.g. ACACACAC is AC, not ACAC • Repeat unit representative: • Not a repeat of a shorter unit • First lexicographical motif among all possible shifts and its reverse complement

  9. Step 1: List approximate STRs in billions of short reads • Some unit of STRs should be allowed to contain a few mutations • Listing approximate STRs are computationally intractable

  10. Step 1: List approximate STRs in billions of short reads • Heuristic approach • Definition: • A repetition(STRs) is a string of the form • (p)mq • p is a non-empty string, is called unit of repetition • q is a prefix of p • E.g. (CAG)3CA = CAGCAGCAGCA • p = CAG, m =3, q = CA

  11. Step 1: List approximate STRs in billions of short reads • A repetition is maximal if it is not a proper substring of a repetition that has the same unit. • E.g. • (CAG)2CA is a maximal repetition with unit CAG • The entire string is also a maximal repetition with unit (CAG)2CA.

  12. Step 1: List approximate STRs in billions of short reads • To identify all occurrences of approximate STRs • Enumerate all maximal repetitions in a read using Main’s O(nlogn)-time algorithm (1989) • For each maximal repetition Y, identify the minimum unit U such that U is not a repetition and Y is a concatenation of multiple occurrences of U and a prefix of U. For example, when Y = (CAG)6CA, U = CAG. • Extend maximal (approximate) repetition in both direction by greedy method • CGCCCGCAGCGCAT(CAG)6CATCAGGGA • CGCCCGCAGC-GCAT(CAG)6CATCAGGGA, • Allow 1 mismatch/deletion/insertion • Remove overlapping STRs if any (the shorter one)

  13. Step 1: List approximate STRs in billions of short reads

  14. Step 2: Sensing expanded STRs by frequency distributions • Generate frequency distributions of all approximate STRs in reads by their length • Suggested the presence of a long AGAGGC repeat in NA12877

  15. Step 3: Locating long expansions of STRs • Solvable if flanking regions can be uniquely mapped • Otherwise, use the information on paired-end reads • If one end-read is filled with an STR, test if the other end can be mapped uniquely by BWA-MEM • Approximate the position STR by the ends if its location can be sandwiched by paired-end reads

  16. Step 3: Locating long expansions of STRs • Third generation sequencer is needed • SMRTTM is able to read ~5kb in average • We know the accurate position for one end • Amplify the repeat region using PCR primers

  17. Results (part 1) • Reproducibility of detecting STR expansions for independent biological replicates • Despite the presence of 100-bp long STR, their method might fail • Collected two independent replicates of NA12878 • DePristoet al. (2011) • Illumina’s platinum genome website

  18. Results (part 1) • Identified 60 STRs with 100bp occurrences in one (n=13) or both (n=47) • Of 13 STRs with no counts in one replicate, 12 had one or two occurrences

  19. Results (part 1) • If an STR occurrence in the genome is short(e.g. 100bp) • Failure rate is high (50% for 50X coverage) • Essentially consistent results for the two replicates

  20. Results (part 2) • Target: • Find these STRs with no prior information from SCA31 • SCA31 is a well characterized case sample • (AAAATAGAAT) repeat and (AATGG) repeat • In the introns of BEAN1 and TK2 • (AAAAT) in the reference genome

  21. Results (part 2) • Resequence the genome of a sample whose parent is a case of SCA31 • Collect reads of NA12877, NA12878 and NA18507 • Applied their methods to SCA31, NA12877, NA12878 and NA18507

  22. Results (part 2) • Only one STR is detected: (AAAATAGAAT) • No significant difference For (AATGG) • (AATGG) repeat is enriched in human centromeres

  23. Results (part 2) • By using paired-end reads with AAAATAGAAT repeats • ~2.5-3.8Kb insertion detected • Sequence the repeat region in 11 SCA31 samples by SMRTTM sequencing.

  24. Results (part 2) • Previous studies showed: • i = 2, k = 10, l = 46, j & m undetermined • Their study showed • i = 1~2, j=220~321, k=9~13, l=42~78, m=90~118 • the instability of STR expansions.

  25. Conclusion & Discussion • STRs in personal genomes remain largely uncharacterized • Proposed a method for listing long approximate STRs with mutations • Presented a procedure for detecting significant expansions of STRs • Applied their methods on NA12878, showed the reproducibility. • Applied to 11 SCA31 samples, showed the instability of STR expansions.

  26. Thanks!

More Related