Background
RNA-binding proteins (RBPs) are involved in many biological processes, their binding sites on RNAs can give insights into mechanisms behind diseases involving RBPs. Thus, how to identify the RBP binding sites on RNAs is very crucial for follow-up analysis, like the impact of mutations on binding sites. With high-throughput sequencing developing, there is an explosion in the amount of experimentally verified RBP binding sites, e.g. eCLIP in ENCODE. However,CLIP-seq to detect RBP binding sites relies on gene expression which can be highly variable between experiments, and cannot provide a complete picture of the RBP binding landscape. Computational methods are in urgent needed to predict missing binding sites for individual RBPs.
Considering that RBPs have difference binding preferences, the machine leaning-based methods train protein-specific models; each model is trained per RBP. In addition, RBP can bind to both linear RNAs and circular RNAs (circRNAs), and RBP may show different binding preference to linear RNAs and circRNAs. In RBPsuite, we mainly contain two deep learning-based approaches, iDeepS and iDeepC for linear RNAs and circular RNAs. iDeepS is deveoped for predicting RBP binding sites on linear RNAs. iDeepC is developed for predicting RBP binding sites on circular RNAs.
iDeepS for predicting RBP sites on linear RNAs
Here we modify the iDeepS to handle sequence with varaible lengths, it encodes sequence and strcture into a one-hot encode matrix.
1. To this end, we first convert the sequence into a one-hot encoded matrix. The given sequence string consisting of an alphabet of size N (A, C, G, U) and the corresponding structure string over an alphabet of size M (F, T,I,H,M,S) into a single new string using an extended alphabet of size N*M. In detail, an specific example is given below. Now we have an alphabet with size 4*6 = 24, they correpsond to Alphabet = {AF, AT, AI,AH,AM,AS, CF, CT,CI,CH, CM, CS, GF, GT, GI, GH, GM, GS, UF, UT, UI, UH, UM, US} with index from 0 to 23.
2. The one-hot encoded matrix is fed into a 1D CNN, whose feature maps are further fed into a bidirectional LSTM.
3. Two fully connected layers are used to determine whether the given linear RNA is a binding site or not.
iDeepC for predicting RBP sites on circular RNAs
We present a RBP-specific method iDeepC for predicting RBP binding sites on circRNAs from sequences. iDeepC adopts a Siamese-like neural network consisting of a network module with a lightweight attention and a metric module. The network module with pre-training learns embeddings for a pair of sequences, whose embedding difference is fed to the metric module to estimate the binding potential. iDeepC is able to capture mutual information between circRNAs, and thus mitigate data scarcity for poorly characterized RBPs.