Benchmark datasets |
In this study, we construct a benchmark dataset of RBP-binding circRNAs for 37 RBPs, it consists of training and test set at the nucleotide level. In addition, we construct an independent test set consisting of circRNAs for 37 RBPs, this set is used to evaluate the performance of CircSite on predicting whether the circRNA can interact with a given RBP. |
Nucleotide-level training and test set (Click here (56.4M) to download): We first construct benchmark datasets of RBP binding sites on circRNAs. Over 120,000 circRNAs sequences for 37 RBPs are extracted from the circRNA interactome database (https://circinteractome.nia.nih.gov/). For each RBP, we first split binding sequences into training and test set with a ratio 8:2. Considering that high sequence similarity in the training and test sets may lead to overestimated performance, we use CD-HIT with a similarity threshold of 0.8 to remove redundant sequences in the test sets. In addition, to avoid potential bias caused by sequences that were too short and too long, we only keep those sequences with a length between 200 and 6000, where the number of sequences within this interval takes over 90% of the total sequences. |
Independent circRNA test set (Click here (5.2M) to download): To better explore the advantages of CircSite, we try to predict whether the circRNA can interact with a given RBP or not. A total of 37 datasets for 37 RBPs are collected, each dataset set consists of 100 positive and negative circRNAs randomly selected from circinteractome database. The positive samples were composed of circRNAs that have binding sites for a given RBP and do not appear in the training set. The negative samples were composed of circRNAs that do not have any binding sites with the given RBP. In order to make a objective evaluation, we use CD-HIT with a similarity threshold of 0.8 to remove redundant sequences in the independent test set to the training set. In this independent test set, each sample is a circRNA. |