Data
Click here to download the benchmark dataset of ToxDL.
The benchmark dataset consists of training, validation and test set in FASTA format. In the header line, label 0 stands for non-toxic protein, and label 1 stands for toxic protein. Download the toxin proteins from Animal toxin annotation project in Uniprot https://www.uniprot.org/program/Toxins, it has 6164 toxic proteins. And also download the non-toxic proteins for species in the animal list.
Click here to access the source code of ToxDL at the github repository.
Click here to download the toxic domain list with 269 toxic domains.
269 domains are associated with protein toxicity with the keyword toxin or toxic in the domain name.
Click here to download the 75 toxic pfam.
they are associated with protein toxicity with the keyword toxin or toxic in the name.
Click here to download the learned embedding of domains by Word2Vec.
To better represent domains, we use word2vec with Skip-gram to learn embedding of domains, each domain is represented in a vector of continuous values. We download the domains of proteins protein2ipr.dat.gz from https://www.ebi.ac.uk/interpro/download.html. This downloaded file contains InterPro entries and individual signatures that UniProtKB proteins match, it consists of 126,780,787 proteins and 36,713 domains, which are used to train word2vec to learn domain embedding.