The training dataset was taken from DeepAccNet and GNNRefine,
we randomly select 5% of the decoys from the two datasets to form the validation dataset.
DeepAccNet contains 1,104,080 decoys generated from 7,992 protein sequences.
The proteins are retrieved from the PISCES server and deposited to PDB by May 1st, 2018 and ranging from 50 to 300 residues in length.
GNNRefine contains 509,443 decoys generated from 29,455 protein sequences. The protein chains are selected from CASP7-12 and CATH domains released in March 2018.
All decoys are subject to dual-space relaxation in Rosetta to mitigate the possible difference in modeling procedures between different methods.
For testing purposes, 26 CASP14 proteins and 3897 corresponding decoys are used.
These decoys are submitted by participating servers and screened by the organizers for EMA (model accuracy evaluation) experiments.
To guarantee at least 90% sequence structural integrity, we abandon eight protein chains with many missing coordinates in the experimental observations proposed by the organizers.
All decoy structures are dual-space relaxed in Rosetta software. CASP14_GDT and CASP14_LDDT datasets are composed of decoys with GDT-TS higher than 0.5 and the decoys with average LDDT higher than 0.5.
RCSB_alphafold & RCSB_rosettafold: 250 experimentally solved protein structures with high accuracy (resolution less than 1.2) were downloaded from RCSB PDB (https://www.rcsb.org/) where they were deposited from July 2021 to May 2022.
In this dataset, each protein has a complete experimental structure derived from X-ray diffraction, and the complete chain A is isolated for evaluation.
The length of these protein sequences ranges from 20 to 800, and most proteins contain 80 to 400 amino acids.
The sequence of amino acid residues in the protein structure was fed into AlphaFold2 and RosettaFold to predict five decoys respectively.
Accordingly, RCSB_alphafold has 1250 decoys and RCSB_rosettafold has 1230 decoys (RosettaFold failed to predict four protein chains).
Besides, we generated comparison graphs for all decoys where P by QATEN is higher than that by confidence predictor in AlphaFold2 or that by QA module in RosettaFold.
Download links of this sets are shown bolow.
Download links
The DeepAccNet dataset
The GNNRefine dataset
The Test dataset
The Comparison graphs