IPMiner: Hidden ncRNA-protein interaction sequential pattern mining with stacked autoencoder for accurate computational prediction

[Introduction]   [Code and Dataset]  

Introduction

Non-coding RNA (ncRNA) plays a crucial role in different biological processes, such as post-transcriptional gene regulation. It always functions through interacting with proteins. To understand the functions of ncRNAs, a fundamental step is to identify which protein interacts with this ncRNA. Therefore it is promising to computationally predict RNA-binding protein (RBP). In this study, we propose a computational method IPMiner to predict ncRNA-protein interactions from sequences, which makes use of deep learning and further improves its performance using stacked ensembling. It automatically extracts high-level features from conjoint triad features of protein and RNA sequences using stacked autoencoder, then the high-level features are fed into Random Forest to predict ncRNA-protein interaction. Finally stacked ensembling is used to integrate different predictors to further improve prediction performance. The experimental results indicate that IPMiner achieves high performance on our constructed lncRNA-protein interaction dataset with accuracy of 0.891, sensitivity of 0.939, specificity of 0.831, precision of 0.945 and MCC of 0.784, respectively. We further comprehensively investigate IPMiner on other RNA-protein interactions datasets, which yields much better performance than state-of-the-art methods, and the performance has an increase of over 20% on some datasets. Meanwhile, we further applied IPMiner for large-scale prediction of ncRNA-protein network, which achieved high prediction performance.




Figure 1. The flowchart of proposed IPMiner, it proceeded in two main steps. a) Train stacked autoencoder models for RNA and protein respectively, and fine tuning for stakced model using label information for RNA-protiein pairs. b) Apply stacked ensembling to integrate SDA-RF, SDA-TF-RF and RPISeq-RF, which used high-level features before fine tuning, high-level features after fine tuning and raw conjoint features (k-mer) respectively. 

Code and Datasets

The program package consists of main python program and RNA-protein interaction dataset, including protein and RNA sequences. To install the programs, download the package (available as a tar.gz-file) or from github (IPMiner)

Dependency:
deep learning lib Keras-0.1.2 keras
machine learning lib scikit-learn

Usage:

python IPMiner.py -datatype=RPI488
where RPI488 is lncRNA-protein interaction dataset, and IPMiner will do 5-fold cross-validation for it. you can also choose other datasets, such as RPI1807, RPI369, RPI2241, RPI13254 and NPInter.

python IPMiner.py -r=RNA_fasta_file -p=protein_fasta_file
it will predict pairwise interaction score for RNAs and protiens in input file.


Reference

Xiaoyong Pan, Yong-Xian Fan, Junchi Yan and Hong-Bin Shen, IPMiner: Hidden ncRNA-protein interaction sequential pattern mining with stacked autoencoder for accurate computational prediction. Submitted.

© 2011 Computational Systems Biology/Shen Group.