We collect two nucleic acid-binding protein benchmark datasets from BioLiP database(1), and split them to the training and test sets according to the release date. In addition, we compare GraphBind with state-of-the-art methods for predicting small ligand-binding residues, including Ca2+, Mn2+, Mg2+, ATP and HEME.
1.Nucleic acid-binding proteins
      We collect the high quality DNA- and RNA-binding protein datasets from BioLiP database in December 5, 2018. According to the release date, protein chains released before January 6, 2016 are assigned into original training sets, while other remaining chains are assigned into original test sets. We apply data augmentation to increase the number of training samples while reduce the sequence redundancy. In order to reduce the sequence similarity between the training set and the test set, we remove the sequences from the test set with over 30% sequence similarity to any sequence in the training set. After transferring binding annotations, we further remove the redundant proteins to reduce the sequence identity in the training set to be less than 30%.
      Nucleic acid-binding protein datasets can be download from Download datasets.
      For nucleic acid-binding protein training sets, every 4 lines represent a protein chain:
(1) PDB ID;
(2) Chain;
(3) Transferred binding annotations, where "1" and "0" indicate nucleic acid-binding and non-nucleic acid-binding residues;
(4) Non-transferred binding annotations, where "1" and "0" indicate nucleic acid-binding and non-nucleic acid-binding residues.
      For nucleic acid-binding protein test sets, every 3 lines represent a protein chain:
(1) PDB ID;
(2) Chain;
(3) Binding annotations, where "1" and "0" indicate nucleic acid-binding and non-nucleic acid-binding residues.

Table 1. Composition of the nucleic acid-binding protein datasets.

Type Dataset NProteina Nposb Nnegc PNratiod
DNA DNA-573_Train(Transfered) 573 14479 145404 0.100
DNA-573_Train(Non-transfered) 573 11074 148809 0.074
DNA-129_Test 129 2240 35275 0.064
RNA RNA-495_Train(Transfered) 495 14609 122290 0.119
RNA-495_Train(Non-transfered) 495 11756 125143 0.094
RNA-117_Test 117 2031 35314 0.058

Note:
a Number of proteins;
b Number of binding residues;
c Number of non-binding residues;
d PNratio = Npos/Nneg;

2.Small ligand-binding proteins
      The Ca2+-, Mn2+-, Mg2+- and HEME-binding protein datasets can be downloaded from DELIA(2).
      The ATP-binding protein training set ATP-388_Train and test set ATP-41_Test can be downloaded from ATPbind(3).

3.General ligand-binding proteins
      The ligand-general benchmark sets used to train and evaluate the ligand-general model GraphBind-G can be downloaded from P2Rank(4).

(1) Yang, J., Roy, A. and Zhang, Y. (2013) BioLiP: a semi-manually curated database for biologically relevant ligand-protein interactions. Nucleic Acids Res, 41, D1096-1103.
(2) Chun-Qiu Xia, Xiaoyong Pan, Hong-Bin Shen, Protein-ligand binding residue prediction enhancement through hybrid deep heterogeneous learning of sequence and structure data. Bioinformatics.
(3) Hu, J., Li, Y., Zhang, Y. and Yu, D.-J. (2018) ATPbind: accurate protein–ATP binding site prediction by combining sequence-profiling and structure-based comparisons. Journal of chemical information and modeling, 58, 501-510.
(4) Radoslav, K. and David, H. P2Rank: machine learning based tool for rapid and accurate prediction of ligand binding sites from protein structure. Journal of Cheminformatics 10, 39 (2018).