Background

Understanding of NLS is a long-term basic biological problem, developing NLS prediction models from limited experimental data poses big challenges. NLSExplorer is developed based on the biology knowledge enriched within protein embeddings generated from large language models, where the attention mechanism acts as the bridge to eliminate the gap among various prediction tasks. By leveraging the knowledge from pre-trained representations, the NLSExplorer opens the door for exploring NLS space. NLSExplorer is able to discover new types of NLS and reduce its dependence on the size of the training dataset, resulting in superior generalization performance.



Attention to Key Area

NLSExplorer is equipped with an explorer module called Attention to Key Area (A2KA). It learns to identify key areas crucial for prediction by extracting biological information from the embedding space of large language models. We train A2KA model to predict the subcellular locations of all nuclear proteins from Swiss-Prot using representations generated by ESM1b-650M. This process enables the model with the ability of large-scale detection for NLS segments and other important nuclear transport fragments within entire sequences.


A2KA is a general framework that excels at integrating with language models. The attention mechanism within it serves as a medium, providing a connection to various downstream tasks and endowing the model with interpretability.


Exploration for NLS

NLSExplorer-prediction demonstrates a high accuracy in NLS prediction. In additon, NLSExplorer-SCNLS of NLSExplorer provides the Search and Collect NLS (SCNLS) algorithm for post-analysis of recommended segments. We utilize NLSExplorer-SCNLS to detect potential NLSs and analyze their patterns for all nuclear proteins in Swiss-Prot, shedding light on the potential universe of NLSs. In addition, we use A2KA to explore the nuclear import segment in nuclear proteins, unveil a potential multi-species relationship for 416 species in Swiss-Prot.


NLS Candidate Library

To advance the investigation and identification of NLS, we put all the potential NLSs recommended by NLSExplorer from all the nuclear proteins from Swiss-Prot with experimental evidence into an interactive visualization map. Each point represents an NLS, and the coordinates of them are obtained by projecting the embedding of each NLS segment from the recommendation system of NLSExplorer using UMAP. We develop several functions to explore the NLS space. In this map, NLS is labeled with the probability predicted by NLSExplorer, users can select a threshold to filter segments to visualize NLS in various confidence levels. Additionally, clicking on a point will show relevant information. This map also provides a search function. By inputting an amino acid segment, all the potential NLS containing this segment will be highlighted. We call this map as NLS Candidate Library


Nuclear Transport Pattern Map

We develop another interactive map called Nuclear Transport Pattern Map to visualize the potential continuous segment patterns mined by SCNLS algorithm from the Swiss-Prot database. These segments for pattern analysis are obtained by running A2KA with the selective strategy 'Perc_80' that keeps the NLS at a maximum retrieval scale. As a result, the model maintains an optimal ability to detect potential NLS segments and other segments that may be important for protein nuclear transport, while keeping a low probability of introducing irrelevant segments.


Species map

Using the search correlation as a standard to assess the segment relationship between species allows us to explore the patterns and tendencies. We further visualize this search correlation by building a search map for species with in-degree more than 20. This map shows that the search tendency converges towards species with rich segment patterns like Homo sapiens. Interestingly, mammalian categories display a mutual search correlation among species and are centrally connected to other species frequently. In general, the segment characteristics among species exhibit interspecies difference and shared characteristics of potential nuclear-mediated segments in Swiss-Prot.