1 / 26

Deep Learning in Bioinformatics

Deep Learning in Bioinformatics. Asmitha Rathis. Why Bioinformatics?. Protein structure Genetic Variants Anomaly classification Protein classification Segmentation/Splicing. Why is Deep Learning beneficial?.

rsanchez
Download Presentation

Deep Learning in Bioinformatics

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Deep Learning in Bioinformatics Asmitha Rathis

  2. Why Bioinformatics? • Protein structure • Genetic Variants • Anomaly classification • Protein classification • Segmentation/Splicing

  3. Why is Deep Learning beneficial? • scalable with large datasets and are effective in identifying complex patterns from feature-rich datasets • learn high levels of abstractions from multiple layers of non-linear transformations.

  4. Terms • What are Motifs? • short, recurring patterns in DNA that are presumed to have a biological function • What is non-coding DNA? •  DNA that do not encode protein sequences. 

  5. Papers • DanQ: a hybrid convolutional and recurrent deep neural network for quantifying the function of DNA sequences - Daniel Quang and XiaohuiXie [2016] • Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning - Babak Alipanahi et al [2015] • Exploiting the past and the future in protein secondary structure prediction - Pierre Bald et al [1999]

  6. DanQ:a hybrid convolutional and recurrent deep neural network for quantifying the function of DNA sequences • A predictive model for the function of non-coding DNA has enormous benefit for translation research • 98% of human genome is non coding DNA and 93% of disease variants lie in this region • Previous work: DeepSea model • Propose a novel hybrid convolutional and bi-directional long short-term memory recurrent neural network framework

  7. Network Model Convolution for motifs Recurrent layer for capturing dependency between the motifs and grammar

  8. Training Details • Random initialization and initialize kernels from known motifs • Dropout is included • RMSprop algorithm with a minibatch size of 100 • 60 epochs to fully train and each epoch of training takes ∼6 h

  9. Results Calculated ROC for each of the 919 binary targets on the test set Predicted probability was the average of the forward and reverse complement sequence pairs

  10. Results Precision recall curve

  11. Future Work • Better initialization techniques • Half are initialized with known motifs from JASPAR dataset • Datasets from more cell types

  12. Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning: DeepBind • DNA- and RNA-binding proteins play a central role in gene regulation, including transcription and alternative splicing. • In the field of transcription, sequence specificity of DNA usually means how specific a protein, usually a transcription factor, recognizes its target DNA motif.

  13. Challenges • Data come in qualitatively different forms, eg: microarray and sequencing data • Quantity is very large • Need to overcome the biases of existing technologies

  14. Data • For training, DeepBind uses a set of sequences and, for each sequence, an experimentally determined binding score.

  15. Binding score :

  16. Training/Testing Details • training on in vitro data and testing on in vivo data. • vitro : refers to the technique of performing a given procedure in a controlled environment outside of a living organism • Vivo : tested on whole, living organisms or cells, usually animals, including humans, and plants,

  17. Results

  18. Analysis of potentially disease-causing genomic variants • Use binding models to identify, group and visualize variants that potentially change protein binding • Importance of each base based on the height of the letter • The mutation map indicating how much each possible mutation will increase or decrease the binding score. A cancer risk variant in a MYC enhancer weakens a TCF7L2 binding site.

  19. Analysis of Splicing Patterns

  20. Exploiting the past and the future in protein secondary structure prediction • Predicting the secondary structure of a protein (alpha-helix, beta sheet, coil) is an important step towards understanding its three dimensional structure as well as its function. • Old methods : ML models that don’t capture variable long ranged information, Increasing size of window leads to overfitting

  21. Results

  22. Results • Overall performance close to 76% correct classification with 6 BRNNs • Use a range to limit the size of the window Size of window

  23. Questions • Based on the more recent models and technologies seen in class, which of them can be applied to these problems? • Can these techniques be applied to other bioinformatics tasks?

More Related