390 likes | 478 Views
Weekly Report Start learning GPU. Ph.D. Student: Leo Lee Supervisor: Dr. Xiaowen Chu Date: Sep. 11, 2009. Outline. Protein identification and pFind GPU and data mining Research Plan. Protein identification and pFind. Background Identify flow Challenges
E N D
Weekly ReportStart learning GPU Ph.D. Student: Leo LeeSupervisor: Dr. Xiaowen ChuDate: Sep. 11, 2009
Outline • Protein identification and pFind • GPU and data mining • Research Plan
Protein identification and pFind • Background • Identify flow • Challenges • Could GPU be used?
Protein identification and pFind • Background • Identify flow • Challenges • Could GPU be used?
Human Plasma Proteome Project, USA Human Disease Glycomics/Proteome Initiative (HGPI), Japan Human Proteome Program: China in charge of liver
Protein identification and pFind • Background • Identify flow • Challenges • Could GPU be used?
Mass Spectrometry Based Protein Identification Tandem MS LC-MS/MS Digest Mixed peptides Mixed Proteins Data analyze >ipi|IPI00243451|IPI00243451.6 MDQHQHLNKTAESASSEKKKTRRCNGFKMFLAALSFSYIAKALGGIIMKISITQIERRFD… TAESASSEK MFLAALSFSYIAK … Merge Protein sequence Peptide sequence
Protein identification SE 1200 1200 1000 1000 200 200 400 400 600 600 800 800 1000 1200 200 400 600 800 pFind TAESA MFLAALS … FSYIAK Go score query Sequence database …… …KFDTGIPDGFAGFFGHYAQGGITFRHEWTRJQIDF…
1200 1000 200 400 600 800 Protein identification SE 400.15 EVDG 400.15 AAEE 400.15 PSTD … 698.48 SVKKKK 699.78 TLKHLK 699.78 WDRDL …… 查询结果 Upper bound of mass:699.70 lower bound of mass699.90 digestion >IQPSKANME TEPDQ… >DEAVPPPAL QLQFN… ….. Protein sequence database
Protein identification SE 1200 1200 1000 1000 200 200 400 400 600 600 800 800 1000 1200 200 400 600 800 Protein database MS >IQPSKANME TEPDQ… >DEAVPPPAL QLQFN… >RQRAILKVM NTIGGE… … ……
Protein identification SE MS Peptide Protein database 400 EVDG 400 AAEE 400 PSTD 698 SVKKKK 699 TLKHLK 699 WDRDL …… >IQPSKANME TEPDQ… >DEAVPPPAL QLQFN… >RQRAILKVM NTIGGE… … Matching Digest
Protein identification and pFind • Background • Identify flow • Challenges • Could GPU be used?
Challenges of PISE Protein database MS Peptide EVDG AAEE PSTD SVKKKK TLKHLK WDRDL …… >IQPSKANME TEPDQ… >DEAVPPPAL QLQFN… >RQRAILKVM NTIGGE… Matching Digest Protein increase exponentially Generation Speed keep increasing PTM leads to huge peptides
E.g. Phosphorylation Amino S, T and Y(HPO3,80Da) May be happen 25 kinds of possibilities PO3 PO3 PO3 PO3 PO3 EMSVPSCQYILSATNR
Identification of PTM Peptide Protein 400 EVDG 400 AAEE 400 PSTD 631 EMSVPS 699 TLKHLK 699 WDRDL …… >IQPSKANME TEPDQ… >DEAVPPPAL QLQFN… >RQRAILKVM NTIGGE… …
Protein identification and pFind • Background • Identify flow • Challenges • Could GPU be used? • http://bioinformatics.oxfordjournals.org/cgi/content/full/25/15/1937
Protein identification on GPU • Each thread-each MS • Each thread-each score • Each thread-each “query” • V1 Match V2 Seems valuable to think further!
Outline • Protein identification and pFind • GPU and data mining • Research Plan
GPU and data mining • Characters of GPU • GPU VS CPU • CUDA • Data mining on GPU
GPU VS CPU 1 Based on slide 7 of S. Green, “GPU Physics,” SIGGRAPH 2007 GPGPU Course. http://www.gpgpu.org/s2007/slides/15-GPGPU-physics.pdf
Control ALU ALU ALU ALU DRAM Cache DRAM Design philosophies are different. • The GPU is specialized for compute-intensive, massively data parallel computation (exactly what graphics rendering is about) • So, more transistors can be devoted to data processing rather than data caching and flow control • The fast-growing video game industry exerts strong economic pressure for constant innovation CPU GPU
What is the GPU Good at? • The GPU is good at data-parallel processing • The same computation executed on many data elements in parallel – low control flow overhead withhigh SP floating point arithmetic intensity • Many calculations per memory access • Currently also need high floating point to integer ratio • High floating-point arithmetic intensity and many data elements mean that memory access latency can be hidden with calculations instead of big data caches – Still need to avoid bandwidth saturation!
. . . . . . CUDA - No more shader functions. • CUDA integrated CPU+GPU application C program • Serial or modestly parallel C code executes on CPU • Highly parallel SPMD kernel C code executes on GPU CPU Serial Code Grid 0 GPU Parallel Kernel KernelA<<< nBlk, nTid >>>(args); CPU Serial Code Grid 1 GPU Parallel Kernel KernelB<<< nBlk, nTid >>>(args);
CUDA • Basic • Memory • Threads • Application performance
Data mining on GPU • K-means • K-nn • Apriori • SVM
K-means on GPU • A team at University of Virginia, led by Professor Skadron • HKUST && MSRA • GPUMiner • LABS-hp
Data mining on GPU • The time of speed-up highly depends on the implementation • Data transfer • Memory • CPU-GPU cooperation
Outline • Protein identification and pFind • GPU and data mining • Research Plan
Research Plan • Keep reading related papers • GPU, data mining • Development • Read our k-means program • Try to speed it up • Try protein identification on GPU
Time schedule • Courses • Thu. 6.30-9.30pm, data mining • TA • Tue. 11.30-12.20am, Network security; • Fri. 9.30-11.30am, Network security;