ToxDL: Deep learning using primary structure and domain embeddings for assessing protein toxicity



Introduction

The process of developing genetically engineered (GE) food crops involves introducing proteins from one species to a crop plant species or to modify an existing protein to improve agricultural important traits such as yield, pest resistance and herbicide tolerance. or both research and regulation purpose it is crucial to examine/assess the potential allergenicity and toxicity of the introduced or gene-edited protein to ensure the food and environment safety of the crop products.
In this study, we develop an interpretable deep learning-based method, ToxDL, to classify toxic proteins from non-toxic proteins using sequences alone. There are two main components in the multi-modal ToxDL. The first component is based on CNNs, in which sequences are encoded in one-hot matrix, which is fed into a CNN. The second component is a multilayer perceptron with domain information. The domains are first scanned from protein sequences using InterProscan. Instead of using high-dimensional one-hot encoding, domains of proteins are encoded in embeddings learned by word2vec, which is fed into a fully connected layer together with feature maps from the CNN.


Availability: The ToxDL is available at www.csbio.sjtu.edu.cn/bioinf/ToxDL.




Figure 1. The flowchart of the proposed ToxDL  

Output of ToxDL

Figure 2. The output of the proposed ToxDL  


The output consists of three parts: 1) the predicted score (higher is better, here it is the probabiality from the deep network classifier) of being toxic proteins by ToxDL; 2) the detected motifs by ToxDL to explain which amino acid important to make such a prediciton; 3) any toxic domains detected by InterProScan in this protein.

© 2017 Computational Systems Biology/Shen Group.