INSP dataset for validation
To evaluate the performance of our models in predicting Nuclear Localization Signals (NLS), we leverage the INSP dataset as a rigorous validation benchmark. Sourced from the latest nlsdb database(2017) and SwissProt before its publication date(2020), Its tripartite structure encompasses a training set and hybrid test sets, each enriched with proteins from a spectrum of diverse species. Additionally, a specialized yeast test set has been curated, exclusively highlighting proteins from the yeast species.
Various support levels data for NLS
In addition to NLS data supported by 'ECO:0000269,' we extract NLS data supported by other levels of evidences. In Swiss-Prot, 'ECO:0000255' indicates information generated by sequence analysis programs and confirmed by a curator. 'ECO:0000303' refers to information retrieved from scientific articles without experimental support. 'ECO:0000305' involves manually curated information by curators based on their scientific knowledge. 'ECO:0000256' represents automated information generated by the automatic annotation system of UniProtKB.
Characteristics domains recognization datasets
To test the ability of NLSExplorer to detect characteristic areas, we extract proteins with labels for DNA-binding, tRNA interaction, and RNA cap binding, all with 'ECO:0000269' evidence, indicating they are experimentally validated.
Training dataset of A2KA
We collect all protein sequences located within the nucleus with experimental evidence. We collect 14316 proteins of 416 species from Swiss-Prot as positive samples. To build a balanced training dataset, we keep the numbers of positive samples and negative samples equal. Thus, we select an equal number of 14316 protein sequences from Swiss-Prot that are not localized within the nucleus (528,826 proteins) as negative samples. We denote this dataset NLSExplorer-p.
Datasets download
All of our datasets are open to the public and are packaged in CSV format. Furthermore,
our NLS Candidate Library allows you to download the potential predicted NLS using NLSEXplorer
according to your interests.
Various support levels data for NLS
DNA-binding
RNA cap binding
Nuclear Export Signals
tRNA interaction
NLSEXplorer-p(in pkl format)