CircSite_introduction

Background

Circular RNAs (circRNAs) interact with RNA-binding proteins (RBPs) to modulate gene expression. To date, most computational methods for predicting RBP binding sites on circRNAs focus on circRNA fragments instead of circRNAs. These methods detect whether an circRNA fragment contains a binding site, but cannot determine where is the binding site and how many binding sites on the whole circRNA. We report a hybrid deep learning-based tool, called CircSite, to predict RBP binding sites at single-nucleotide resolution and detect key contributed sequence contents on circRNAs. CircSite takes advantages of convolutional neural network (CNN) and Transformer for learning local and global representations, respectively. We construct 37 datasets for RBP-binding circRNAs and the experimental results show that CircSite offers accurate predictions of RBP binding nucleotides and detects known binding motifs. To the best of our knowledge, CircSite is the first computational tool to explore the binding nucleotides of RBPs on circRNAs. The source code of CircSite can also be found at CircSite

Method

We design a hybrid deep network consisting of CNN, BiGRU and Transformer to predict RBP binding nucleotides on a circRNA. First, we use a sliding window to scan the circRNAs into fragments with a step size of one, and these fragments are represented as one-hot encoded matrix, which are first fed into 1-D CNN, followed by the BiGRU and transformer, respectively. Then, the two learned representations are concatenated into the MLP classifier to obtain binding scores for individual fragments. Finally, these scores are post-processed using a median filter and threshold binarization to obtain the binding nucleotides on the RNAs.

Figure. The pipeline of the CircSite, the circRNA are split into fragments, which are first fed into 1-D CNN, followed by BiGRU and Transformer, respectively. Then concatenated representations are fed into a MLP classifier, followed a median filter and threshold binarization strategy to obtain the binding nucleotides. L is the number of repetitions of the corresponding block. LN is layer normalization, h_t, Z_0 are the output of the last time step of BiGRU, respectively, and the token is set at dimension 0 of the input matrix. A is the predicted binding scores of individual nucleotides on a circRNA, B is the result after median filtering, and C is the final binding nucleotides on the circRNAs after binarization.