We collect two nucleic acid-binding protein benchmark datasets from BioLiP database(1), and split them to the
training and test sets according to the release date. In addition, we compare GraphBind with
state-of-the-art methods for predicting small ligand-binding residues, including Ca2+,
Mn2+, Mg2+, ATP and HEME.
1.Nucleic acid-binding proteins
We collect the high quality DNA- and RNA-binding protein datasets from BioLiP database in December 5, 2018.
According to the release date, protein chains released before January 6, 2016 are assigned into original
training sets, while other remaining chains are assigned into original test sets.
We apply data augmentation to increase the number of training samples while reduce the sequence redundancy.
In order to reduce the sequence similarity between the training set and the test set,
we remove the sequences from the test set with over 30% sequence similarity to any sequence in the training set.
After transferring binding annotations, we further remove the redundant proteins to reduce the sequence identity in the training set to be less than 30%.
Nucleic acid-binding protein datasets can be download from Download datasets.
For nucleic acid-binding protein training sets, every 4 lines represent a protein chain:
(1) PDB ID;
(2) Chain;
(3) Transferred binding annotations, where "1" and "0" indicate nucleic acid-binding and non-nucleic acid-binding residues;
(4) Non-transferred binding annotations, where "1" and "0" indicate nucleic acid-binding and non-nucleic acid-binding residues.
For nucleic acid-binding protein test sets, every 3 lines represent a protein chain:
(1) PDB ID;
(2) Chain;
(3) Binding annotations, where "1" and "0" indicate nucleic acid-binding and non-nucleic acid-binding residues.
Table 1. Composition of the nucleic acid-binding protein datasets.
Type | Dataset | NProteina | Nposb | Nnegc | PNratiod |
---|---|---|---|---|---|
DNA | DNA-573_Train(Transfered) | 573 | 14479 | 145404 | 0.100 |
DNA-573_Train(Non-transfered) | 573 | 11074 | 148809 | 0.074 | |
DNA-129_Test | 129 | 2240 | 35275 | 0.064 | |
RNA | RNA-495_Train(Transfered) | 495 | 14609 | 122290 | 0.119 |
RNA-495_Train(Non-transfered) | 495 | 11756 | 125143 | 0.094 | |
RNA-117_Test | 117 | 2031 | 35314 | 0.058 |
Note:
a Number of proteins;
b Number of binding residues;
c Number of non-binding residues;
d PNratio = Npos/Nneg;
2.Small ligand-binding proteins
The Ca2+-, Mn2+-, Mg2+- and HEME-binding protein datasets can be
downloaded from DELIA(2).
The ATP-binding protein training set ATP-388_Train and test set ATP-41_Test can be downloaded from
ATPbind(3).
3.General ligand-binding proteins
The ligand-general benchmark sets used to train and evaluate the ligand-general model GraphBind-G can be downloaded from P2Rank(4).
(1) Yang, J., Roy, A. and Zhang, Y. (2013) BioLiP: a semi-manually curated database for biologically relevant ligand-protein interactions. Nucleic Acids Res, 41, D1096-1103.
(2) Chun-Qiu Xia, Xiaoyong Pan, Hong-Bin Shen, Protein-ligand binding residue prediction enhancement through hybrid deep heterogeneous learning of sequence and structure data. Bioinformatics.
(3) Hu, J., Li, Y., Zhang, Y. and Yu, D.-J. (2018) ATPbind: accurate protein–ATP binding site prediction by combining sequence-profiling and structure-based comparisons. Journal of chemical information and modeling, 58, 501-510.
(4) Radoslav, K. and David, H. P2Rank: machine learning based tool for rapid and accurate prediction of ligand binding sites from protein structure. Journal of Cheminformatics 10, 39 (2018).