Dataset used in TargetS (Download)


Dataset description:

    We constructed benchmark datasets from BioLip, which is a semi-manually curated database for biologically relevant protein-ligand interactions. 12 different types of ligands, which can be grouped into 4 categories, i.e., Metal Ion, Nucleotide, Nucleic Acid, and HEME, were considered in the present study. For each of the 12 types of the considered ligands, we constructed its training dataset and independent validation dataset as follows. Training datasets: we extracted all the protein sequences that interact with the given ligand and were released in PDB before 10 March 2010 from BioLip. The maximal pairwise sequence identity of the extracted protein sequences was culled to 40% with PISCES software and the resulting sequences constitute the training dataset for that ligand; Validation datasets: we extracted all the protein sequences that interact with the ligand and were released in PDB after 10 March 2010 from BioLip. Again, the maximal pairwise sequence identity of the extracted protein sequences was reduced to 40% and the resulting sequences constitute the validation dataset. Moreover, if a given sequence in the validation dataset shares >40% identity to a sequence in the training dataset, then we remove the sequence from the validation dataset. This assures that the sequences in validation dataset are independent of those in training dataset and can be used to test the generalization capability of the proposed prediction models built on the training dataset. Table 1 summarizes the detailed composition of the training datasets and the independent validation datasets for the 12 types of ligands.
    It is undeniable that the number of protein sequences interacting with the considered 12 types of ligands is still very limited as shown in Table 1. To objectively evaluate the efficacy of the proposed method, we still have partitioned the limited number of protein sequences into training dataset and validation dataset for each type of ligand. In order to take full advantage of the limited amount of proteins, however, the training dataset and the validation dataset for each type of ligand are merged to train the final online prediction model for that ligand.  

Table 1. Composition of the training datasets and the validation datasets for the 12 types of ligands

Ligand Category

Ligand

Type

Training Dataset

Validation Dataset

Total No. of Sequences

No. of

Sequences

(numP, numN)

No. of Sequences

(numP, numN)

Metal Ion

Ca2+

965

(4914, 287801)

165

(785,53779)

1130

Zn2+

1168

(4705, 315235)

176

(744, 47851)

1344

Mg2+

1138

(3860, 350716)

217

(852, 72002)

1355

Mn2+

335

(1496, 112312)

58

(237, 17484)

393

Fe3+

173

(818, 50453)

26

(120, 9092)

199

Nucleotide

ATP

221

(3021, 72334)

50

(647, 16639)

271

ADP

296

(3833, 98740)

47

(686,20327)

343

AMP

145

(1603, 44401)

33

(392, 10355)

178

GDP

82

(1101, 26244)

14

(194, 4180)

96

GTP

54

(745, 21205)

7

(89, 1868)

61

Nucleic Acid

DNA

335

(6461, 71320)

52

(973, 16225)

387

 

HEME

207

(4380, 49768)

27

(580, 8630)

234

 Figures numP, numN in 2-tuple (numP, numN) represent the numbers of positive (binding residues) and negative (non-binding residues) samples, respectively.
[Back]


Dataset format:

    The training dataset and the validation dataset for each type of ligands were contained in the training and the validation directories, respectively.
   
Taking ATP ligand as an example, its training dataset is a txt file named as follows:
                  Training\ATP_Training.txt
    and its validation dataset is also a text file named as follows:
                  Validation\ATP_Validation.txt
    Among each txt file, each line contains the following six types of information of a sequence:
                  PDBID, Chain, Binding Site Number, Ligand Type, Binding Residues, and Sequence
    These six types of information are separated by semicolons as follows:
                  3C16;A;BS03;ATP;D21 I22 E23 F25 T26 L63 G64;DMMFHKIY...ERNAYLKEHSIETFLI
[Back]