GraphBind

      We collect two nucleic acid-binding protein benchmark datasets from BioLiP database(1), and split them to the training and test sets according to the release date. In addition, we compare GraphBind with state-of-the-art methods for predicting small ligand-binding residues, including Ca²⁺, Mn²⁺, Mg²⁺, ATP and HEME.
1.Nucleic acid-binding proteins
      We collect the high quality DNA- and RNA-binding protein datasets from BioLiP database in December 5, 2018. According to the release date, protein chains released before January 6, 2016 are assigned into original training sets, while other remaining chains are assigned into original test sets. We apply data augmentation to increase the number of training samples while reduce the sequence redundancy. In order to reduce the sequence similarity between the training set and the test set, we remove the sequences from the test set with over 30% sequence similarity to any sequence in the training set. After transferring binding annotations, we further remove the redundant proteins to reduce the sequence identity in the training set to be less than 30%.
      Nucleic acid-binding protein datasets can be download from Download datasets.
      For nucleic acid-binding protein training sets, every 4 lines represent a protein chain:
(1) PDB ID;
(2) Chain;
(3) Transferred binding annotations, where "1" and "0" indicate nucleic acid-binding and non-nucleic acid-binding residues;
(4) Non-transferred binding annotations, where "1" and "0" indicate nucleic acid-binding and non-nucleic acid-binding residues.
      For nucleic acid-binding protein test sets, every 3 lines represent a protein chain:
(1) PDB ID;
(2) Chain;
(3) Binding annotations, where "1" and "0" indicate nucleic acid-binding and non-nucleic acid-binding residues.

Table 1. Composition of the nucleic acid-binding protein datasets.

Type	Dataset	N_Protein^a	N_pos^b	N_neg^c	PNratio^d
DNA	DNA-573_Train(Transfered)	573	14479	145404	0.100
	DNA-573_Train(Non-transfered)	573	11074	148809	0.074
	DNA-129_Test	129	2240	35275	0.064
RNA	RNA-495_Train(Transfered)	495	14609	122290	0.119
	RNA-495_Train(Non-transfered)	495	11756	125143	0.094
	RNA-117_Test	117	2031	35314	0.058

Note:
^a Number of proteins;
^b Number of binding residues;
^c Number of non-binding residues;
^d PNratio = N_pos/N_neg;

2.Small ligand-binding proteins
The Ca²⁺-, Mn²⁺-, Mg²⁺- and HEME-binding protein datasets can be downloaded from DELIA(2).
The ATP-binding protein training set ATP-388_Train and test set ATP-41_Test can be downloaded from ATPbind(3).

3.General ligand-binding proteins
The ligand-general benchmark sets used to train and evaluate the ligand-general model GraphBind-G can be downloaded from P2Rank(4).

(1) Yang, J., Roy, A. and Zhang, Y. (2013) BioLiP: a semi-manually curated database for biologically relevant ligand-protein interactions. Nucleic Acids Res, 41, D1096-1103.
(2) Chun-Qiu Xia, Xiaoyong Pan, Hong-Bin Shen, Protein-ligand binding residue prediction enhancement through hybrid deep heterogeneous learning of sequence and structure data. Bioinformatics.
(3) Hu, J., Li, Y., Zhang, Y. and Yu, D.-J. (2018) ATPbind: accurate protein–ATP binding site prediction by combining sequence-profiling and structure-based comparisons. Journal of chemical information and modeling, 58, 501-510.
(4) Radoslav, K. and David, H. P2Rank: machine learning based tool for rapid and accurate prediction of ligand binding sites from protein structure. Journal of Cheminformatics 10, 39 (2018).