Dataset description: | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
We
constructed benchmark datasets from BioLip, which is a semi-manually
curated database for biologically relevant protein-ligand interactions.
12 different types of ligands, which can be grouped into 4 categories,
i.e., Metal Ion, Nucleotide, Nucleic Acid, and HEME, were considered in
the present study. For each of the 12 types of the considered ligands,
we constructed its training dataset and independent validation dataset
as follows. Training datasets: we extracted all the protein
sequences that interact with the given ligand and were released in PDB before 10
March 2010 from
BioLip. The maximal pairwise sequence identity of the extracted protein
sequences was culled to 40% with PISCES software and the
resulting sequences constitute the training dataset for that ligand;
Validation datasets: we extracted all the protein
sequences that interact with the ligand and were released in PDB after 10
March 2010 from
BioLip. Again, the maximal pairwise sequence identity of the extracted
protein sequences was reduced to 40% and the resulting sequences
constitute the validation dataset. Moreover, if a given sequence in the
validation dataset shares >40% identity to a sequence in the training
dataset, then we remove the sequence from the validation dataset. This
assures that the sequences in validation dataset are independent of
those in training dataset and can be used to test the generalization
capability of the proposed prediction models built on the training
dataset. Table 1
summarizes the detailed composition of the training datasets and the
independent validation datasets for the 12 types of ligands. Table 1. Composition of the training datasets and the validation datasets for the 12 types of ligands
Figures
numP, numN in 2-tuple (numP, numN) represent the numbers of positive
(binding residues) and negative (non-binding residues) samples,
respectively. |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Dataset format: |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
The training dataset and the
validation dataset for each type of ligands were contained in the
training and the validation directories, respectively. |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||