ToxDL 2.0: protein toxicity predictor

Data

Click here to download the benchmark dataset of ToxDL2.

The benchmark dataset consists of training, validation, test and independent test set in FASTA format. In the header line, label 0 stands for non-toxic protein, and label 1 stands for toxic protein, as outlined below:

a) Training Set: This set contains 4,900 toxic and 9,800 non-toxic protein sequences, selected randomly from the refined positive and negative samples.

b) Validation Set: This set comprises 78 toxic and 853 non-toxic protein sequences, ensuring no redundancy with the training set.

c) Test Set: This set contains 112 toxic and 1,735 non-toxic protein sequences, ensuring no redundancy with the training set.

d) Independent Test Set: This set consists of 152 toxic and 4,710 non-toxic protein sequences, which features non-redundant entries with the training set and includes proteins collected after January 1, 2022.

Click here to access the source code of ToxDL2 at the github repository.

They are associated with protein toxicity with the keyword toxin or toxic in the name.

Click here to download the learned embedding of domains by Word2Vec.

To better represent domains, we use word2vec with Skip-gram to learn embedding of domains, each domain is represented in a vector of continuous values. We download the domains of proteins protein2ipr.dat.gz from https://www.ebi.ac.uk/interpro/download/InterPro. This downloaded file contains InterPro entries and individual signatures that UniProtKB proteins match, it consists of 200,810,128 proteins and 45,151 domains, which are used to train word2vec to learn domain embedding.