In this study, we construct three types of datasets from BioLip database that gives binding annotations for residues. The first is ligand-specific training and test sets of binding residues of 1159 ligands, the second is a pre-training dataset of binding residues for 1301 ligands, and the third is an independent test dataset of binding residues for 16 unseen ligands in the training set.
Download datasets
1. Ligand-specific training and test sets of 1159 ligands
      For a ligand, we collect its binding proteins to construct the ligand-specific binding dataset from BioLip database released in February 2021. We reduce the sequence identity of proteins in the dataset to be less than 30% by removing the redundant proteins with CD-HIT and split the proteins into the ligand-specific training and test set according to the released date in BioLip database. Here, proteins released before January 2017 are assigned into the training set, while the remaining proteins are assigned into the test set. To ensure the proposed method can be trained and evaluated, we only keep the ligands with at least two binding proteins in the training sets and at least one binding protein in the test sets. In total, 1159 ligand-specific benchmark datasets are collected and they consist of 27738 proteins.
2. Pre-training dataset of 1301 ligands
      To train an accurate predictor for those ligands with a limited number of binding residues, we construct a pre-training dataset for model pre-training. Similar to the benchmark dataset construction, we collect 187905 binding proteins released before January 2017 in BioLip. To ensure the low sequence redundancy of proteins in the pre-training dataset with all the proteins in 1159 ligand-specific test sets, CD-HIT is applied to remove the proteins from the pre-training dataset with over 30% sequence identity to any protein chain in the 1159 ligand-specific test sets. Finally, the 6093 proteins of 1301 ligands comprise the pre-training dataset.
, and the binding residues and non-binding residues from these proteins are processed in the same way as the ligand-specific datasets.
3. Independent test sets of 16 unseen ligands
      During the construction of ligand-specific benchmark dataset, we divide the proteins into the ligand-specific training and test sets based on the released date in BioLip. Some ligands are filtered out since the number of proteins in its training set is too few to train a model. Of these ligands, ligands that have enough proteins are further selected to investigate the generalization ability of LigBind for unseen ligands that never appear in the pre-training and fine-tuning phases. First, the ligands that has at least five proteins in the test sets (released after January 2017 in BioLip) are collected. Then, for each ligand, proteins with over 30% sequence identity to any protein chain in the 1159 ligand-specific training sets and the pre-training dataset are removed to make sure the low redundancy. In the end, we obtain 16 test sets of unseen ligands with at least five proteins as the independent test sets (Supplementary Table S3), and the binding and non-binding residues are obtained in the same way as the above datasets.
Dataset format:
Among each txt file, each line contains the following five types of information of a sequence:
      released time, PDBID_ChainID, Binding Residues of structure, Binding Residues of sequence, and Sequence
These five types of information are separated by semicolons as follows:
      201808:5onl_A:K58 E59 E196 F199:52 53 190 193:EDVYQNFEELKNNE...SYL