HKU CS Bioinformatics Research

HKU CS Bioinformatics Research Siu Ming Yiu Department of Computer Science The University of Hong Kong Other faculty members: Prof. Francis Chin Prof. TW Lam Dr HF Ting

Impact of bioinformatics Medical research Biological research Huge volume of data e.g. finding a cancer-causing gene? e.g. can we make rice grow faster? Environmental study Biofuel e.g. human genome: 3G long; Medical study: 100 persons e.g. human gut contains 1000+ bacteria (data: 500G)  obesity e.g. how to remove harmful bacteria e.g. how bacteria digest food to produce energy?

The de novo assembly problem (single genome) Given an unknown genome, Genome X NO existing technology is able to read out the DNA sequence (ACCG…..) of it as the sequence is too long (e.g. human = 3 billions long; even bacteria are about 10k – several millions). What we can do? High-throughput sequencing technology (next generation sequencing (NGS)): …………………. Multiple copies of Genome X ACCG GTCG DNA sequencing machine CTTG [Inside the machine, the genomes are randomly cut into short fragments (reads), the machine can read out the DNA sequence of the reads.] AACG CTCG GTCG CTAG CAAG GGAG GTTG

Bad news  Multiple copies of Genome X • The reads are really short: 100-150 bp (c.f. genome of a bacterium – 10K to several millions). • They are mixing together (no idea where from the genome each read is from!!). • There are errors in the read. [AACCGTTC => AACGGTCC] The (de novo) assembly problem: Can we reconstruct the original genome from the reads?

Data volume: HUGE!! Take human genome as an example. The genome is of 3x109 (3 billion) long. Recall: multiple copies are cut (fragmented). At any position of the genome, multiple copies of reads may be obtained. The average number of copies of reads from each position of a genome is referred as the depth of the sequencing. ………………. For depth = 30, # of reads: (3x109x30)/100 ≈ 109 Note that they are mixed together, no ordering information

Good news  There are some clues inside the reads: The reads are overlapping! Unknown genome: AACCGGTTGCACGTTCCACTTGGCC……… AACCGGTTG CCGGTTGTC GGTTGTCAC TGTCACGTT CGGTTGTCA ACCGGTTGT TTGTCACGT GTTGTCACG Ideal case: every position has at least one read, no errors in the read, then…. [But the reality…. is a lot worse]

Unknown genome: AACCGGTTGCACGTTCCACTTGGCC……… AACCGCTTG CCGGTTGTC GGTTGTCAC TGTCACGTT CGGTTGTCA ACCGGTTTT TTGTCACGT GTTGTCACG The reality: (a) There are errors in the reads; not easy to locate the next read! (b) At some positions, we may have no reads.

Publications Bioinformatics (impact factor: 5.323) BMC genomics (impact factor: 4.4) PloS One (impact factor: 3.73) BMC bioinformatics (impact factor: 3.02) Journal of Computational Biology (impact factor: 1.56) IEEE/ACM TCBB (impact factor: 1.54) …… Top conferences: RECOMB, ISMB, ECCB Nature papers with our collaborators HKU-BGI research center: BGI (Shenzhen) is the largest genomic center in the world Other international collaborators: JGI, dept. of energy, US (biofuel); Sidekid hospital, Canada (diabetes); CAS-MPG PICB, Shanghai (C4 Rice project); UC San Francisco (Optical mapping data analysis); NUS, Singapore (RNA study); ….

How to solve the problem? A few general approaches String graph, de Bruijn graph, … Genome …. A C G T G T A C C T C……. Idea: we still make use of the overlapping parts in reads to connect them together. We do not need reads of every position. -------------------------- Graph: Vertex: k-mer (k consecutive nucleotides in a read) Edge: two k-mers appear consecutively in a read Read G T G T A C C T C (k = 4) GTGT TGTA GTAC TACC ACCT CCTC

Ideal case • No errors • Reads at every position • The graph can read out one single path, that will be the genome! Genome: A A C G A C G T G T A C C T C A G T A A C G A C G T G A C G A C G T G T C G A C G T G T A G A C G T G T A C A C G T G T A C C C G T G T A C C T G T G T A C C T C T G T A C C T C A G T A C C T C A G T A C C T C A G T Reads (len = 9) GTGT AACG CGAC ACGT GTAC CGTG ACGA GACG TGTA CCTC TCAG TACC CAGT CTCA ACCT

Note: even a few reads are missing, we are still ok! Can anyone see that how many reads can be missed depends on the value of k (when constructing the graph!)? Genome: A A C G A C G T G T A C C T C A G T A A C G A C G T G A C G A C G T G T C G A C G T G T A G A C G T G T A C A C G T G T A C C C G T G T A C C T G T G T A C C T C T G T A C C T C A G T A C C T C A G T A C C T C A G T Q: to allow more missing reads, larger or smaller k is better? Reads (len = 9) GTGT AACG CGAC ACGT GTAC CGTG ACGA GACG TGTA CCTC TCAG TACC CAGT CTCA ACCT

G CGTG Genome: A A C G A C G T G T A C C T C A G T ACGT Contigs: Maximal path without branches/paths CGTC A A C G A C G T G A C G A C G T G T C G A C G T G T A G A C G T G T A C A C G T G T A C C C G T G T A C C T G T G T A C C T C T G T A C C T C A G T A C C T C A G T A C C T C A G T contig G Reads (len = 9) CGAC GACG ACGT CGACGT Real case is more complicated: Even no error, in a genome, some patterns may repeat! In reality, we seldom can construct the whole genome in one piece, but stop at junctions, resulting with a set of contigs

A part of the de Bruijn graph for Ecoli (~4M long); you can imagine how complicated for human genome (3G long)

Conclusions • Our team: • Core Faculty members: • Prof. Francis Chin, Prof. TW Lam, me • 1 Research Assistant Professor (Henry Leung) • 1 Postdoc (Jianyu Shi) • about 8 PhD/master students + a team in HKU-BGI Lab • Some collaborators: • Beijing Genome Institute at Shenzhen (BGI) • - HKU-BGI Laboratory • HKU medical schools; life science departments • Sickkids hospital, Canada • JGI, DoE, US • CAS-MPG PICB, Shanghai (C4 Rice project) • UC San Francisco (Pui’s group) • GIS (Genome Institute at Singapore) • Universities: NUS, CUHK, U of Liverpool etc. <Thank you>

HKU CS Bioinformatics Research

HKU CS Bioinformatics Research

Presentation Transcript

CS 5263 Bioinformatics

CS 5263 Bioinformatics

CS 5263 Bioinformatics

CS 5263 Bioinformatics

CS 5263 Bioinformatics

CS 6990 Bioinformatics

CS 5263 Bioinformatics

CS 5263 Bioinformatics

CS 5263 Bioinformatics

CS 5263 Bioinformatics

CS 5263 Bioinformatics

CS 5263 Bioinformatics

CS 5263 Bioinformatics

CS 5263 Bioinformatics

CS 5263 Bioinformatics

CS 5263 Bioinformatics CS 4593 AT: Bioinformatics

CS 5263 Bioinformatics

CS 5263 Bioinformatics

CS 5263 Bioinformatics

CS 5263 Bioinformatics CS 4593 AT: Bioinformatics

CS 5263 Bioinformatics

CS 5263 Bioinformatics