Input RNA sequence
                

Introduction
    To perform its function smoothly, an RNA usually needs to be mediated by an RNA-binding protein (RBP). Therefore, a deregulated-RBP may lead to the failure of a certain type of RNA to perform its regulatory or translational function. RBP is a key player in post-transcriptional events. The versatility and structural flexibility of their RNA-binding domains allow RBPs to control the metabolism of a large number of transcripts. There are approximately 1542 human RBPs identified, accounting for 7.5% of all proteins. RBP involves in almost all steps of the post-transcriptional regulatory layer. They establish highly dynamic interactions with other proteins and RNAs, regulating RNA splicing, polyadenylation, stability, positioning, translation and degradation. Studies have found that RBP is dysregulated in different cancers. Therefore, deciphering the intricate and interconnected network between RBPs and its cancer-related RNA targets will provide a better understanding of tumor biology and may lead to new cancer treatments. It is worth mentioning that most RNAs can bind to more than one RBP, thus finding RBPs with similar binding capacity has become an important research direction.
    There are many methods that use machine learning models to identify RBP’s binding sites from RNA sequences. They mainly focused on using the sequence or structural characteristics of the original RNA sequence to predict the binding site.
    iDeepMV is designed to predict which RBPs an unexplored RNA can bind to. It integrates multi-view feature learning, deep feature learning, and multi-label classification technology for RBP recognition. First, based on the raw RNA sequences, we extracted the amino acid sequence view’s data and the dipeptide component view’s data; Then, for the data from different views, we design deep neural network models of the respective views to learn the deep features, and the extracted deep features are further used to train multi-label classifiers that can effectively take the correlation of the labels into account; Finally, the voting mechanism is used to make a multi-view comprehensive decision on the results of each view to further improve the prediction accuracy.
Overall framework of iDeepMV
Dataset
    The data we used comes from the AURA website. The full name of AURA is "Atlas of UTR Regulatory Activities". It is a manually compiled and comprehensive catalog of human UTR and UTR regulatory notes. The website has full access to a wealth of information through its intuitive web interface. This information integrates RNA sequence and structural data, regulatory and mutation sites, gene synonymy, gene and protein expression, and gene function descriptions from scientific literature and specialized databases. All this information is available through a variety of data mining tools. The total dataset contains 137003 RNA sequence information, 1264 regulatory factor information and 2549510 binding site information between them from this website. Because our aim is to study the association of RBPs’ binding, and the RNAs involved in the above binding site information are not all included in 137003 sequences, we finally selected 67 RBPs, 73681 RNA sequence information and 550386 binding site information between them as the dataset of this research project
About our model
    Our model is based on multi-view deep feature extraction technology and multi-label learning. First, the input sequence will be decomposed by the data processing module into the initial features of the three views, and then they will be subjected to feature extraction trained by CNN to extract the respective depth features, and then these depth features will be feed into the multi-label classifier trained by CC model for preliminary prediction. Finally we use the voting mechanism to make the final decision.
    Since our model is trained based on the above dataset, this model can only predict the binding of the input RNA sequence to the 68 RNA-binding proteins in the library. These are: ALKBH5, FXR1, IGF2BP1, LIN28B, LIN28A, CELF1, PUM1, ADAR1, FMR1_iso1, RC3H1, MOV10, TNRC6B, PABPC1, FXR2, LARP4B, RBFOX2, ZC3H12A, HNRNPU, PARK7, ATXN2, WDR33, EIF3D, CPEB1, STAU1, ELAVL1, TARDBP, U2AF2, HNRNPA1, PCBP2, YTHDF2, EIF3A, DGCR8, IGF2BP2, PUM2, HNRNPA2B1, TAF15, HNRNPH1, TIA1, AGO1, DDX21, QKI, IGF2BP3, RBPMS, SRRM4, RBM10, EIF4A3, YTHDF1, ZC3H7B, C22ORF28, C17ORF85, HNRNPD, AGO2, METTL3, EWSR1, RBM47, FUS, EIF3G, ZFP36, EIF3B, HNRNPC, FMR1_iso7, HNRNPF, CAPRIN1, TARBP2, TIAL1, MSI1, AGO4, negative. The negative class indicates that it cannot be combined with 67 other RBPs.
Reference
  • Yang, Haitao; Deng, Zhaohong; Pan, Xiaoyong; Shen, Hongbin ; Choi , Kup-Sze; Wang, Lei; Wang, Shitong; Wu, Jing
    RNA-binding protein recognition based on multi-view deep feature and multi-label learning.
    Briefings in Bioinformatics, 2020, in press.