WorkFlow

Protein design has become a critical method in advancing significant potential for various applications such as drug development and enzyme engineering. However, protein design methods utilizing large language models with solely pretraining and fine-tuning struggle to capture relationships in multi-modal protein data. To address this, we propose ProtDAT, a de novo fine-grained framework capable of designing proteins from any descriptive protein text input. ProtDAT builds upon the inherent characteristics of protein data to unify sequences and text as a cohesive whole rather than separate entities. It leverages an innovative multi-modal cross-attention, integrating protein sequences and textual information for a foundational level and seamless integration.

Note: The input and output must strictly adhere to the descriptions in the examples, otherwise ProtDAT's default tokenizer will affect subsequent sequence generation.

The workflow is as following:
(1) Step1: Input protein descriptions (and/or) corresponding protein sequence fragments.
(2) Step2: Select hyperparameters of ProtDAT generation methods.
(3) Step3: Click the 'Submit' button and wait for a minute.

You can get the corresponding protein sequences related to the text (and/or) the sequence fragments.

Framework

The emerging protein generation methods either focus solely on information from a single modality, such as protein sequences, or rely only on simple word prompts, lacking the ability to integrate other modalities to guide the protein generation process. Multimodal pretraining models for proteins present significant challenges, as they require high compatibility across various modalities, are susceptible to overfitting on training data, and encounter difficulties with modality alignment and expansion.

The differences between ProtDAT and other protein sequence design models are as below.


The overall pipeline of ProtDAT is as below.