The benchmark dataset is constructed from the BindingDB database. According to previous studies [8, 32], several preprocessing steps are applied to create the benchmark dataset from the original BindingDB database.1) Remove any records missing either a unique identifier (PubChem CID) or the compound’s chemical structure (SMILES). 2) Remove any records of proteins that do not have a unique identifier (Uniport ID), amino acid sequence, or predicted tertiary structure by AlphaFold2 [31]. 3) Remove the records with the protein size greater than 1000. 4) Keep records with measured IC50 value. Following the activity threshold used in [32, 36], a record is active (positive) if its IC50 is less than 100nm, and inactive (negative) if its IC50 is greater than 10000nm. Finally, the processed dataset contains 497,064 active pairs and 319,178 inactive pairs between 4,663 proteins and 549,020 compounds.
We process the benchmark dataset into the training and test sets for three different settings: 1) transductive setting, where the test proteins and compounds are present in the training set; 2) semi-inductive setting, where the test compounds are present but the test proteins are not in the training set; and 3) inductive setting, where both test proteins and compounds are not in the training set.
Dataset format:
Among each txt file, each line contains the following five types of information of a sequence:
Compound Pubchem CID, Compound SMILES, Protein PDBID, Protein sequence, label(1:active, 0:inactive)
Download datasets
Download compound and protein features of datasets