CAP5510 – Bioinformatics Sequence Assembly

CAP5510 – BioinformaticsSequence Assembly Tamer Kahveci CISE Department University of Florida

What is Sequence Assembly? • We can only sequence short fragments (100 – 500 bases). • How can we sequence long sequences (e.g., single chromosome can have hundreds of millions of bases) ? • Chop long sequence to many small fragments • Sequence all fragments • Put them together to construct the long sequence • Problem: Consider a long sequence S. Given a collection of subsequences (aka fragments or reads) of S, denoted with R = {r1, r2, …, rn}. Construct S from R

Sequence Assembly Coverage: average number of reads in R containing a base in S. • Issues: • Errors in R • Repeats in S Repeat

Assemblers • De novo: No knowledge known about S. • Slow • Phusion (Mullikin & Ning 2003) • Arachne (Batzoglou et al. 2002) • CAP (Huang & Madan, 1992) • Mapping: A similar sequence to S is known. • Needs prior knowledge on S. • Shrimp (Rumble et al. 2009)

Phusion (Mullikin & Ning 2003) • Clipping: Remove low quality reads, clip ends. • Clustering: Group similar reads together. • Create a histogram of k-mers (k = 17) • Remove repetitive ones (13 or more occurrences)

Phusion (Mullikin & Ning 2003) • Clipping: Remove low quality reads, clip ends. • Clustering: Group similar reads together. • Create a histogram of k-mers (k = 17) • Remove repetitive ones (13 or more occurrences) • Keep a list for each k-mer showing the reads that contain it. • Find all pairs of reads sharing at least one k-mer • Keep the number of common k-mers for each such pair

Phusion (Mullikin & Ning 2003) • Clipping: Remove low quality reads, clip ends. • Clustering: Group similar reads together. • Assemble each cluster into a contig • Given a pair of reads, extend their matching k-mers • Join overlapping contigs • If two contigs share a read, try to put them together into a longer contig by splicing them first.

Euler (Pevzner et al. 2001) • Clipping: Remove low quality reads, clip ends. • Clustering: Group similar reads together. • Assemble each cluster into a contig • Create de Brujin graph • Each node is a k-mer • A directed edge indicates a dove tail overlap of k-1 positions • Find the Eulerian path on this graph (visit each edge once) – polynomial • Not the Hamiltonian path (visit each vertex once) – NP complete

CAP5510 – Bioinformatics Sequence Assembly

CAP5510 – Bioinformatics Sequence Assembly

Presentation Transcript

Bioinformatics and sequence analysis

Introduction to LC-3 Assembly Language

ARM Assembly Programming

Introduction to Bioinformatics

Sequence Analysis, Pair Wise Alignment, and Database Searching

Day 2 Implementation – Comprehension Instructional Sequence (CIS) – CCSS

Sequence of Events; Summarization Lesson James Forten: from Now Is Your Time!

DNA Self-Assembly

Assembly language programming

CAP5510 – Bioinformatics Sequence Comparison

Chapter 6 Assembly Drawings

ARM Instruction Set

Bioinformatics For MNW 2 nd Year

Goals of the Human Genome Project (1990 ~) Map and sequence the 3,000 Mb human genome

CS 6293 Advanced Topics: Translational Bioinformatics

EEL 3801

Bioinformatics

Paul Medvedev Michael Brudno

Canadian Bioinformatics Workshops

Some topics in Bioinformatics: An introduction 1, Primary mathematical statistics

Intel x86 Assembly Fundamentals