RNA-protein binding motifs mining with a new hybrid deep learning based cross-domain knowledge integration approach

[Introduction]   [Code and Dataset]  


RNA plays key roles in cells through the interactions with proteins known as the RNA-binding proteins (RBP) and their binding motifs enable crucial understanding of the post-transcriptional regulation of RNAs. How the RBPs correctly recognize the target RNAs and why they bind specific positions is still far from clear. Machine learning-based algorithms are widely acknowledged to be capable of speeding up this process. Although many automatic tools have been developed to predict the RNA-protein binding sites from the rapidly growing multi-resource data, e.g. sequence, structure, their domain specific features and formats have posed significant computational challenges. One of current difficulties is that the cross-source shared common knowledge is at a higher abstraction level beyond the observed data, resulting in a low efficiency of direct integration of observed data across domains. The other difficulty is how to interpret the prediction results. Existing approaches tend to terminate after outputting the potential discrete binding sites on the sequences, but how to assemble them into the meaningful binding motifs is a topic worth of further investigation.
In viewing of these challenges, we propose a deep learning-based framework (iDeep) by using a novel hybrid convolutional neural network and deep belief network to predict the RBP interaction sites and motifs on RNAs. This new protocol is featured by transforming the original observed data into a high-level abstraction feature space using multiple layers of learning blocks, where the shared representations across different domains are integrated. To validate our iDeep method, we performed experiments on 31 large-scale CLIP-seq datasets, and our results show that by integrating multiple sources of data, the average AUC can be improved by 8% compared to the best single-source-based predictor; and through cross-domain knowledge integration at an abstraction level, it outperforms the state-of-the-art predictors by 6%. Besides the overall enhanced prediction performance, the convolutional neural network module embedded in iDeep is also able to automatically capture the interpretable binding motifs for RBPs. Large-scale experiments demonstrate that these mined binding motifs agree well with the experimentally verified results, suggesting iDeep is a promising approach in the real-world applications.

Figure 1 The flowchart of proposed iDeep for predicting RNA-protein binding sites on RNAs. It firstly extracted different representation for RNA-protein binding sites within a windows size 101, then use multimodal deep learning consisting of DBNs and CNNs to integrate these extracted representations to predict RBP interaction sites. 

Code and Datasets

The program package consists of main python program and the dataset, including all extracted features. To install the programs, download the package (available at github (iDeep)

deep learning lib Keras-1.0.0 keras
machine learning lib scikit-learn


Xiaoyong Pan and Hong-Bin Shen, RNA-protein binding motifs mining with a new hybrid deep learning based cross-domain knowledge integration approach, BMC Bioinformatics, 2017, 18: 136.

© 2011 Computational Systems Biology/Shen Group.