ProtDAT Dataset
ProtDAT dataset is derived from the SwissProt dataset through deduplication and filtering processes, comprising a total of 469,395 protein sequence-text pairs. The protein description texts include key information such as protein functions, subcellular localization, and protein family details. The cleaned sequence data and text data can be accessed here.
Dataset Splitting
The ProtDAT dataset was randomly partitioned into three subsets: the training set, containing 402,395 protein sequence-text pairs; the validation set, comprising 47,000 pairs; and the test set, consisting of 20,000 pairs. The training results of the ProtDAT dataset based on this partitioning are illustrated in the figure below.
Protein Structure Dataset
For protein structure data, it can be obtained through the following two methods:
(1) Directly retrieve the corresponding protein structure information from the AlphaFold Protein Structure Database.
(2) Based on ESMFold Prediction: Use the ESMFold tool to fold protein sequences and generate corresponding structural data. Specific operations can be referred to in the ESMFold tutorial.
For the processing of protein description texts, methods from the PubMedBERT tutorial can be referenced.