240 likes | 633 Views
Genomic Signal Processing. Dr. C.Q. Chang Dept. of EEE. Outline. Basic Genomics Signal Processing for Genomic Sequences Signal Processing for Gene Expression Resources and Co-operations Challenges and Future Work. Basic Genomics. Genome.
E N D
Genomic Signal Processing Dr. C.Q. Chang Dept. of EEE
Outline • Basic Genomics • Signal Processing for Genomic Sequences • Signal Processing for Gene Expression • Resources and Co-operations • Challenges and Future Work
Genome • Every human cell contains 6 feet of double stranded (ds) DNA • This DNA has 3,000,000,000 base pairs representing 50,000-100,000 genes • This DNA contains our complete genetic code or genome • DNA regulates all cell functions including response to disease, aging and development • Gene expression pattern: snapshot of DNA in a cell • Gene expression profile: DNA mutation or polymorphism over time • Genetic pathways: changes in genetic code accompanying metabolic and functional changes, e.g. disease or aging.
DNA transcription mRNA translation Protein Gene: protein-coding DNA CCTGAGCCAACTATTGATGAA CCUGAGCCAACUAUUGAUGAA PEPTIDE
In more detail (color ~state)
The Problem • Genomic information is digital letters A, T, C and G • Signal processing deals with numerical sequences, character strings have to be mapped into one or more numerical sequences • Identification of protein coding regions • Prediction of whether or not a given DNA segment is a part of a protein coding region • Prediction of the proper reading frame • Comparing to traditional methods, signal processing methods are much quicker, and can be even more accurate in some cases.
Signal Analysis • Spectral analysis (Fourier transform, periodogram) • Spectrogram • Wavelet analysis • HMT: wavelet-based Hidden Markov Tree • Spectral envelope (using optimal string to numerical value mapping)
Spectral envelope of the BNRF1 gene from the Epstein-Barr virus • 1st section (1000bp), (b) 2nd section (1000bp), • (c) 3rd section (1000bp), (d) 4th section (954bp) • Conjecture: the 4th quarter is actually non-coding
Biological Question Data Analysis & Modeling Samplepreparation Microarray Life Cycle MicroarrayDetection Microarray Reaction Taken from Schena & Davis
excitation scanning cDNA clones (probes) laser 2 laser 1 PCR product amplification purification emission printing mRNAtarget) overlay images and normalise 0.1nl/spot Hybridise target to microarray microarray analysis
Image Segmentation • Simple way: fixed circle method • Advanced: fast marching level set segmentation Advanced Fixed circle
Clustering and filtering methods Principal approaches: • Hierarchical clustering (kdb trees, CART, gene shaving) • K-means clustering • Self organizing (Kohonen) maps • Vector support machines • Gene Filtering via Multiobjective Optimization • Independent Component Analysis (ICA) Validation approaches: • Significance analysis of microarrays (SAM) • Bootstrapping cluster analysis • Leave-one-out cross-validation • Replication (additional gene chip experiments, quantitative PCR)
ICA for B-cell lymphoma data Data: 96 samples of normal and malignant lymphocytes. Results: scatter-plotting of 12 independent components Comparison: close related to results of hierarchical clustering
Resources and Co-operations Resources: databases on the internet such as • GeneBank • ProteinBank • Some small databases of microarray data Co-operations in need: • First hand microarray data • Biological experiment for validation
Challenges and Future Work • Genomic signal processing opens a new signal processing frontier • Sequence analysis: symbolic or categorical signal, classical signal processing methods are not directly applicable • Increasingly high dimensionality of genetic data sets and the complexity involved call for fast and high throughput implementations of genomic signal processing algorithms • Future work: spectral analysis of DNA sequence and data clustering of microarray data. Modify classical signal processing methods, and develop new ones.