Dataset used in DELIA (Download)


Dataset description:

    We constructed benchmark datasets from BioLip, which is a semi-manually curated database for biologically relevant protein-ligand interactions. 5 different types of ligands, i.e., Ca2+, Mg2+, Mn2+, ATP and HEME, were considered in the present study. We constructed the training dataset and corresponding independent testing dataset for each of them except for ATP.
    Training datasets: we extracted proteins that interact with the given ligand and were released in PDB before 6 Janurary 2016 from BioLip. The maximal pairwise sequence identity of the extracted protein sequences was culled to 30% with cd-hit software and the resulting proteins constitute the training dataset for that ligand. We are sorry that the numbers of postive and negative samples in MG-1194_train are recorded incorrectly in our paper. The correct numbers are summarized in the Table 1 below;
    Testing datasets: we extracted proteins that interact with the ligand and were released in PDB after 6 Janurary 2016 from BioLip. Again, the maximal pairwise sequence identity of the extracted protein sequences was reduced to 30% and the resulting proteins constitute the testing dataset. Moreover, if a given sequence in the testing dataset shares >30% identity to a sequence in the training dataset, then we remove the sequence from the testing dataset. This guarantees that proteins in the testing dataset are independent of those in the training dataset and can be used to test the generalization capability of the proposed predictor built on the training dataset.
    Table 1 summarizes the detailed composition of the training datasets and the independent testing datasets for the 4 types of ligands.

Table 1. Composition of the training datasets and the testing datasets for the 4 types of ligands

Ligand Category

Ligand

Type

Training Dataset

Testing Dataset

Total No. of Proteins

No. of proteins

(numP, numN)

No. of Proteins

(numP, numN)

Metal Ion

Ca2+

1022

(4830,255917)

515

(2958,186678)

1537

Mg2+

1194

(4147, 320736)

651

(2321, 244088)

1845

Mn2+

440

(1931, 150299)

144

(612, 50838)

584

 

HEME

175

(3851, 44477)

96

(2012, 26341)

271

  numP, numN in 2-tuple (numP, numN) represent the numbers of positive (binding residues) and negative (non-binding residues) samples, respectively.


Dataset format:

    The training dataset and the testing dataset for each type of ligands were contained in the training and the testing directories, respectively.
   
Taking ATP ligand as an example, its training dataset is a txt file named as follows:
                  HEM-175_Train.txt
    and its validation dataset is also a text file named as follows:
                  HEM-96_Test.txt
    Among each txt file, each line contains the following six types of information of a sequence:
                  PDBID, Chain, Binding Residues, and Sequence
    These six types of information are separated by semicolons as follows:
                  3C16 : A : D21 I22 E23 F25 T26 L63 G64 : DMMFHKIY...ERNAYLKEHSIETFLI