150 likes | 352 Views
HKU CS Bioinformatics Research. Siu Ming Yiu Department of Computer Science The University of Hong Kong. Other faculty members: Prof. Francis Chin Prof. TW Lam Dr HF Ting. Impact of bioinformatics. Medical research. Biological research. Huge volume of data.
E N D
HKU CS Bioinformatics Research Siu Ming Yiu Department of Computer Science The University of Hong Kong Other faculty members: Prof. Francis Chin Prof. TW Lam Dr HF Ting
Impact of bioinformatics Medical research Biological research Huge volume of data e.g. finding a cancer-causing gene? e.g. can we make rice grow faster? Environmental study Biofuel e.g. human genome: 3G long; Medical study: 100 persons e.g. human gut contains 1000+ bacteria (data: 500G) obesity e.g. how to remove harmful bacteria e.g. how bacteria digest food to produce energy?
The de novo assembly problem (single genome) Given an unknown genome, Genome X NO existing technology is able to read out the DNA sequence (ACCG…..) of it as the sequence is too long (e.g. human = 3 billions long; even bacteria are about 10k – several millions). What we can do? High-throughput sequencing technology (next generation sequencing (NGS)): …………………. Multiple copies of Genome X ACCG GTCG DNA sequencing machine CTTG [Inside the machine, the genomes are randomly cut into short fragments (reads), the machine can read out the DNA sequence of the reads.] AACG CTCG GTCG CTAG CAAG GGAG GTTG
Bad news Multiple copies of Genome X • The reads are really short: 100-150 bp (c.f. genome of a bacterium – 10K to several millions). • They are mixing together (no idea where from the genome each read is from!!). • There are errors in the read. [AACCGTTC => AACGGTCC] The (de novo) assembly problem: Can we reconstruct the original genome from the reads?
Data volume: HUGE!! Take human genome as an example. The genome is of 3x109 (3 billion) long. Recall: multiple copies are cut (fragmented). At any position of the genome, multiple copies of reads may be obtained. The average number of copies of reads from each position of a genome is referred as the depth of the sequencing. ………………. For depth = 30, # of reads: (3x109x30)/100 ≈ 109 Note that they are mixed together, no ordering information
Good news There are some clues inside the reads: The reads are overlapping! Unknown genome: AACCGGTTGCACGTTCCACTTGGCC……… AACCGGTTG CCGGTTGTC GGTTGTCAC TGTCACGTT CGGTTGTCA ACCGGTTGT TTGTCACGT GTTGTCACG Ideal case: every position has at least one read, no errors in the read, then…. [But the reality…. is a lot worse]
Unknown genome: AACCGGTTGCACGTTCCACTTGGCC……… AACCGCTTG CCGGTTGTC GGTTGTCAC TGTCACGTT CGGTTGTCA ACCGGTTTT TTGTCACGT GTTGTCACG The reality: (a) There are errors in the reads; not easy to locate the next read! (b) At some positions, we may have no reads.
Publications Bioinformatics (impact factor: 5.323) BMC genomics (impact factor: 4.4) PloS One (impact factor: 3.73) BMC bioinformatics (impact factor: 3.02) Journal of Computational Biology (impact factor: 1.56) IEEE/ACM TCBB (impact factor: 1.54) …… Top conferences: RECOMB, ISMB, ECCB Nature papers with our collaborators HKU-BGI research center: BGI (Shenzhen) is the largest genomic center in the world Other international collaborators: JGI, dept. of energy, US (biofuel); Sidekid hospital, Canada (diabetes); CAS-MPG PICB, Shanghai (C4 Rice project); UC San Francisco (Optical mapping data analysis); NUS, Singapore (RNA study); ….
How to solve the problem? A few general approaches String graph, de Bruijn graph, … Genome …. A C G T G T A C C T C……. Idea: we still make use of the overlapping parts in reads to connect them together. We do not need reads of every position. -------------------------- Graph: Vertex: k-mer (k consecutive nucleotides in a read) Edge: two k-mers appear consecutively in a read Read G T G T A C C T C (k = 4) GTGT TGTA GTAC TACC ACCT CCTC
Ideal case • No errors • Reads at every position • The graph can read out one single path, that will be the genome! Genome: A A C G A C G T G T A C C T C A G T A A C G A C G T G A C G A C G T G T C G A C G T G T A G A C G T G T A C A C G T G T A C C C G T G T A C C T G T G T A C C T C T G T A C C T C A G T A C C T C A G T A C C T C A G T Reads (len = 9) GTGT AACG CGAC ACGT GTAC CGTG ACGA GACG TGTA CCTC TCAG TACC CAGT CTCA ACCT
Note: even a few reads are missing, we are still ok! Can anyone see that how many reads can be missed depends on the value of k (when constructing the graph!)? Genome: A A C G A C G T G T A C C T C A G T A A C G A C G T G A C G A C G T G T C G A C G T G T A G A C G T G T A C A C G T G T A C C C G T G T A C C T G T G T A C C T C T G T A C C T C A G T A C C T C A G T A C C T C A G T Q: to allow more missing reads, larger or smaller k is better? Reads (len = 9) GTGT AACG CGAC ACGT GTAC CGTG ACGA GACG TGTA CCTC TCAG TACC CAGT CTCA ACCT
G CGTG Genome: A A C G A C G T G T A C C T C A G T ACGT Contigs: Maximal path without branches/paths CGTC A A C G A C G T G A C G A C G T G T C G A C G T G T A G A C G T G T A C A C G T G T A C C C G T G T A C C T G T G T A C C T C T G T A C C T C A G T A C C T C A G T A C C T C A G T contig G Reads (len = 9) CGAC GACG ACGT CGACGT Real case is more complicated: Even no error, in a genome, some patterns may repeat! In reality, we seldom can construct the whole genome in one piece, but stop at junctions, resulting with a set of contigs
A part of the de Bruijn graph for Ecoli (~4M long); you can imagine how complicated for human genome (3G long)
Conclusions • Our team: • Core Faculty members: • Prof. Francis Chin, Prof. TW Lam, me • 1 Research Assistant Professor (Henry Leung) • 1 Postdoc (Jianyu Shi) • about 8 PhD/master students + a team in HKU-BGI Lab • Some collaborators: • Beijing Genome Institute at Shenzhen (BGI) • - HKU-BGI Laboratory • HKU medical schools; life science departments • Sickkids hospital, Canada • JGI, DoE, US • CAS-MPG PICB, Shanghai (C4 Rice project) • UC San Francisco (Pui’s group) • GIS (Genome Institute at Singapore) • Universities: NUS, CUHK, U of Liverpool etc. <Thank you>