Background
RNA-binding proteins (RBPs) are involved in many biological processes, their binding sites on RNAs can give insights into mechanisms behind diseases involving RBPs. Thus, how to identify the RBP binding sites on RNAs is very crucial for follow-up analysis, like the impact of mutations on binding sites. With high-throughput sequencing developing, there is an explosion in the amount of experimentally verified RBP binding sites, e.g. eCLIP in ENCODE. However,CLIP-seq to detect RBP binding sites relies on gene expression which can be highly variable between experiments, and cannot provide a complete picture of the RBP binding landscape. Computational methods are in urgent needed to predict missing binding sites for individual RBPs.
Considering that RBPs have difference binding preferences, the machine leaning-based methods train protein-specific models; each model is trained per RBP. In addition, RBP can bind to both linear RNAs and circular RNAs (circRNAs), and RBP may show different binding preference to linear RNAs and circRNAs. In RBPsuite, we mainly contain two deep learning-based RBP-specific approaches, iDeepS for linear RNAs and iDeepC for linear RNAs and circular RNAs, and recenty we added another RBP-general method iDeepG for binding site prediction of any RBPs. iDeepS is deveoped for predicting RBP binding sites on linear RNAs. iDeepC is developed for predicting RBP binding sites on circular RNAs. iDeepG is a RBP-general model designed to predict RBP binding sites on RNAs for any RBPs.
iDeepS for predicting RBP sites on linear RNAs
Here we modify the iDeepS to handle sequence with varaible lengths, it encodes sequence and strcture into a one-hot encode matrix.
1. To this end, we first convert the sequence into a one-hot encoded matrix. The given sequence string consisting of an alphabet of size N (A, C, G, U) and the corresponding structure string over an alphabet of size M (F, T,I,H,M,S) into a single new string using an extended alphabet of size N*M. In detail, an specific example is given below. Now we have an alphabet with size 4*6 = 24, they correpsond to Alphabet = {AF, AT, AI,AH,AM,AS, CF, CT,CI,CH, CM, CS, GF, GT, GI, GH, GM, GS, UF, UT, UI, UH, UM, US} with index from 0 to 23.
2. The one-hot encoded matrix is fed into a 1D CNN, whose feature maps are further fed into a bidirectional LSTM.
3. Two fully connected layers are used to determine whether the given linear RNA is a binding site or not.
iDeepC for predicting RBP sites on circular RNAs
We present a RBP-specific method iDeepC for predicting RBP binding sites on circRNAs from sequences. iDeepC adopts a Siamese-like neural network consisting of a network module with a lightweight attention and a metric module. The network module with pre-training learns embeddings for a pair of sequences, whose embedding difference is fed to the metric module to estimate the binding potential. iDeepC is able to capture mutual information between circRNAs, and thus mitigate data scarcity for poorly characterized RBPs.
iDeepG for predicting RBP-general binding sites on RNAs
We present a RBP-general method iDeepG for predicting RBP binding sites on RNAs from sequences for any RBPs. iDeepG designs cross-attention networks with language models to integrate multi-modal RNA and RBP information, capable of precisely predicting binding sites across a wide range of RBPs on RNAs. iDeepG can accurately predict binding sites across RNAs for 168 RBPs in ENCODE, especially for unseen RNAs and RBPs in inductive setting, showcasing the robust generalization capability and outstanding predictive performance.
Figure. The flowchart of RBP-general iDeepG for predicting RBP binding sites on RNAs