Background
RNA-binding proteins (RBPs) are involved in many biological processes, their binding sites on RNAs can give insights into mechanisms behind diseases involving RBPs. Thus, how to identify the RBP binding sites on RNAs is very crucial for follow-up analysis, like the impact of mutations on binding sites. With high-throughput sequencing developing, there is an explosion in the amount of experimentally verified RBP binding sites, e.g. eCLIP in ENCODE. However,CLIP-seq to detect RBP binding sites relies on gene expression which can be highly variable between experiments, and cannot provide a complete picture of the RBP binding landscape. Computational methods are in urgent needed to predict missing binding sites for individual RBPs.
Considering that RBPs have difference binding preferences, the machine leaning-based methods train protein-specific models; each model is trained per RBP. In addition, RBP can bind to both linear RNAs and circular RNAs (circRNAs), and RBP may show different binding preference to linear RNAs and circRNAs.
In RBPsuite, we mainly contain two deep learning-based approaches, iDeepS and CRIP . iDeepS is deveoped for predicting RBP binding sites on linear RNAs. CRIP is developed for predicting RBP binding sites on circular RNAs.
iDeepS for predicting RBP sites on linear RNAs
Here we modify the iDeepS to handle sequence with varaible lengths, it encodes sequence and strcture into a one-hot encode matrix.
1. To this end, we first convert the sequence into a one-hot encoded matrix. The given sequence string consisting of an alphabet of size N (A, C, G, U) and the corresponding structure string over an alphabet of size M (F, T,I,H,M,S) into a single new string using an extended alphabet of size N*M. In detail, an specific example is given below. Now we have an alphabet with size 4*6 = 24, they correpsond to
Alphabet = {AF, AT, AI,AH,AM,AS, CF, CT,CI,CH, CM, CS, GF, GT, GI, GH, GM, GS, UF, UT, UI, UH, UM, US} with index from 0 to 23.
2. The one-hot encoded matrix is fed into a 1D CNN, whose feature maps are further fed into a bidirectional LSTM.
3. Two fully connected layers are used to determine whether the given linear RNA is a binding site or not.
CRIP for predicting RBP sites on circular RNAs
CRIP is developed to predict RBP binding sites on circular RNAs using deep learning. CRIP consists of a stacked codon-based encoding scheme and a hybrid deep learning architecture, in which a convolutional neural network (CNN) learns high-level abstract features and a recurrent neural network (RNN) learns long dependency in the sequences. It is motivated by the followings: 1) The mechanisms of circRNAs interacting with RBPs are different from those of other types of RNAs, thus the existing methods may not be generalized well to circRNAs. 2) circRNAs have limited information for the prediction.
1. CRIP uses a stacked codon-based encoding to get an initial representation for the RNA sequences with a one-hot encoded matrix, whose value is .
2. the CNN to learn highlevel features from the initial representation and the long short-term memory (LSTM) network to learn dependency within the sequences.
3. Two fully connected layers are used to determine whether the given circRNA is a binding site or not.