Rapid detection of expanded short tandem repeats in personal genomics using hybrid sequencing

Rapid detection of expanded short tandem repeats in personal genomics using hybrid sequencing By KoichiroDoi, TakuMonjo,, Pham H. Hoang,, Jun Yoshimura, Hideaki Yurino, Jun Mitsui, Hiroyuki Ishiura, Yuji Takahashi, Yaeko Ichikawa, Jun Goto, Shoji Tsuji and Shinichi Morishita(The University of Tokyo) Bioinformatics, November 2013 Presented by KWOK TszPiu (Bill) 26/3/2014

Short tandem repeats (STRs) • Repeating sequences of 2-6 base pairs of DNA • A type of Variable Number Tandem Repeat

Short tandem repeats (STRs) • Many genetic disorders are associated with STRs • E.g. Huntington’s disease(亨丁頓舞蹈症) • In the coding region of huntingtin • Expansion of triplet repeat (CAG)n • n < 28: normal • n = 28–35: intermediate • n = 36–40 in reduced penetrance • n > 40 in full penetrance

Short tandem repeats (STRs) • Besides coding regions, STRs are found in untranslated regions, introns and promoters • E.g. Fragile-X syndrome • Secondmajor reason for intellectual disability • (CGG) repeat in the 5’-UTR

Motivation • The infeasibility of obtaining longer reads(>100bp) lead detecting important STRs difficult • E.g. (ATTCT)n, n= 800-4500 in SCA10 • Typical length(~100bp) of short reads make identification and of long STRs difficult, accurate positions are hard to determine • STRs have several variants with many mutations • Spontaneous mutation rate of STRs = 3.78*10-4 to 7.44*10-2 in human Y-chromosome • Much higher than average rate of de novo single-nucleotide variation 1.18*10-8 • Detecting various STRs is fundamental to analysis of personal genomes

Motivation • Previous software cannot process billions of short reads quickly

Target • Sensing and locating long STRs with 2-6 base-long repeat units efficiently in personal genome

Definition: • Basic unit of an STR should be minimized • E.g. ACACACAC is AC, not ACAC • Repeat unit representative: • Not a repeat of a shorter unit • First lexicographical motif among all possible shifts and its reverse complement

Step 1: List approximate STRs in billions of short reads • Some unit of STRs should be allowed to contain a few mutations • Listing approximate STRs are computationally intractable

Step 1: List approximate STRs in billions of short reads • Heuristic approach • Definition: • A repetition(STRs) is a string of the form • (p)mq • p is a non-empty string, is called unit of repetition • q is a prefix of p • E.g. (CAG)3CA = CAGCAGCAGCA • p = CAG, m =3, q = CA

Step 1: List approximate STRs in billions of short reads • A repetition is maximal if it is not a proper substring of a repetition that has the same unit. • E.g. • (CAG)2CA is a maximal repetition with unit CAG • The entire string is also a maximal repetition with unit (CAG)2CA.

Step 1: List approximate STRs in billions of short reads • To identify all occurrences of approximate STRs • Enumerate all maximal repetitions in a read using Main’s O(nlogn)-time algorithm (1989) • For each maximal repetition Y, identify the minimum unit U such that U is not a repetition and Y is a concatenation of multiple occurrences of U and a prefix of U. For example, when Y = (CAG)6CA, U = CAG. • Extend maximal (approximate) repetition in both direction by greedy method • CGCCCGCAGCGCAT(CAG)6CATCAGGGA • CGCCCGCAGC-GCAT(CAG)6CATCAGGGA, • Allow 1 mismatch/deletion/insertion • Remove overlapping STRs if any (the shorter one)

Step 1: List approximate STRs in billions of short reads

Step 2: Sensing expanded STRs by frequency distributions • Generate frequency distributions of all approximate STRs in reads by their length • Suggested the presence of a long AGAGGC repeat in NA12877

Step 3: Locating long expansions of STRs • Solvable if flanking regions can be uniquely mapped • Otherwise, use the information on paired-end reads • If one end-read is filled with an STR, test if the other end can be mapped uniquely by BWA-MEM • Approximate the position STR by the ends if its location can be sandwiched by paired-end reads

Step 3: Locating long expansions of STRs • Third generation sequencer is needed • SMRTTM is able to read ~5kb in average • We know the accurate position for one end • Amplify the repeat region using PCR primers

Results (part 1) • Reproducibility of detecting STR expansions for independent biological replicates • Despite the presence of 100-bp long STR, their method might fail • Collected two independent replicates of NA12878 • DePristoet al. (2011) • Illumina’s platinum genome website

Results (part 1) • Identified 60 STRs with 100bp occurrences in one (n=13) or both (n=47) • Of 13 STRs with no counts in one replicate, 12 had one or two occurrences

Results (part 1) • If an STR occurrence in the genome is short(e.g. 100bp) • Failure rate is high (50% for 50X coverage) • Essentially consistent results for the two replicates

Results (part 2) • Target: • Find these STRs with no prior information from SCA31 • SCA31 is a well characterized case sample • (AAAATAGAAT) repeat and (AATGG) repeat • In the introns of BEAN1 and TK2 • (AAAAT) in the reference genome

Results (part 2) • Resequence the genome of a sample whose parent is a case of SCA31 • Collect reads of NA12877, NA12878 and NA18507 • Applied their methods to SCA31, NA12877, NA12878 and NA18507

Results (part 2) • Only one STR is detected: (AAAATAGAAT) • No significant difference For (AATGG) • (AATGG) repeat is enriched in human centromeres

Results (part 2) • By using paired-end reads with AAAATAGAAT repeats • ~2.5-3.8Kb insertion detected • Sequence the repeat region in 11 SCA31 samples by SMRTTM sequencing.

Results (part 2) • Previous studies showed: • i = 2, k = 10, l = 46, j & m undetermined • Their study showed • i = 1~2, j=220~321, k=9~13, l=42~78, m=90~118 • the instability of STR expansions.

Conclusion & Discussion • STRs in personal genomes remain largely uncharacterized • Proposed a method for listing long approximate STRs with mutations • Presented a procedure for detecting significant expansions of STRs • Applied their methods on NA12878, showed the reproducibility. • Applied to 11 SCA31 samples, showed the instability of STR expansions.

Thanks!

Rapid detection of expanded short tandem repeats in personal genomics using hybrid sequencing

Rapid detection of expanded short tandem repeats in personal genomics using hybrid sequencing

Presentation Transcript

ProRepeat a comprehensive directory of exact tandem repeats in proteins

Rapid Pathogen Detection using Phage Technology

Diverse Eukaryotic Transcripts Suggest Short Tandem Repeats have Cellular Functions

Detection of Genomic Rearrangements in K562 cells using Paired End Sequencing

Very Short Dispersed Repeats

Personal genomics

Short Tandem Repeat (STR) Typing from Short Read Sequencing Data: STRTyper

Short Dispersed Repeats

Intrusion Detection Using Hybrid Neural Networks

PEAKS: De Novo Sequencing using Tandem Mass Spectrometry

Short Tandem Repeats (STR) and Variable Number Tandem Repeats (VNTR)

Towards Personal Genomics

Fire μSat : An Algorithm to Detect Tandem Repeats in DNA

Functional Genomics with Next-Generation Sequencing

MOLECULAR BIOLOGY – PCR, sequencing, Genomics

short tandem repeats profile

Next Generation Sequencing – Future of Genomics

Rapid detection of fentanyl using a multifunction nanostructured substrate

Towards Personal Genomics

Contact and Force Detection using Hybrid Estimation

Repeats