上海交通大学-沈红斌-模式识别与生物信息学研究组

Online Services

研究组概况

上海交通大学模式识别与生物信息学研究组的研究重点是模式识别与人工智能基础理论算法及其在生物医学大数据建模中的应用，发展复杂问题驱动的新理论与算法，探索信息与生命学科交叉研究的新方法。

研究组方向

人工智能与模式识别基础理论
机器学习算法与应用
生物信息学
蛋白质与RNA工程
生物医学图像处理
复杂网络与生物医学大数据挖掘

研究组动态

RNA-binding proteins (RBPs) play crucial roles in many biological processes, and computationally identifying RNA-RBP interactions provides insights into the biological mechanism of diseases associated with RBPs.To make the RBP-specific deep learning-based RBP binding sites prediction methods easily accessible, we developed an updated easy-to-use webserver, RBPsuite 2.0, with an updated web interface for predicting RBP binding sites from linear and circular RNA sequences.RBPsuite 2.0 has a higher coverage on the number of supported RBPs and species compared to the original RBPsuite, supporting an increased number of RBPs from 154 to 353 and expanding the supported species from one to seven. This study is published in BMC Biology, 2025.
Traditional alignment-based methods are designed for precise pairwise comparisons, offering high accuracy. However, they face challenges when searching within large databases. In response to this challenge, we propose a novel deep-learning approach FoldExplorer. It harnesses the powerful capabilities of graph attention neural networks and protein large language models for both protein structures and sequences data processing to generate embeddings for protein structures. FoldExplorer demonstrates excellent performance in both protein geometric similarity search and fold-type classification tasks. Notably, it remains effective even when handling low-confidence predicted structures. Moreover, FoldExplorer is highly efficient when searching large-scale databases. Its ability to generate accurate embedding spaces enables a comprehensive view of the protein structure landscape, offering novel insights into clustering and boundaries within the protein universe.
RNA velocity is closely related with cell fate and is an important indicator for the prediction of cell states with elegant physical explanation derived from single-cell RNA-seq data. Motivated by the finding that RNA velocity could be driven by the transcriptional regulation, we propose TFvelo, which expands RNA velocity concept to various single-cell datasets without splicing information, by introducing gene regulatory network information. Experiments on synthetic data, scRNA-Seq and MERFISH data demonstrate that TFvelo can model the gene dynamics, infer cell pseudo-time and trajectory, and also detect the key TF-target regulation simultaneously. This study is published in Li et.al, Nature Communications, 2024.
Spatial transcriptomics data can provide high-throughput gene expression profiling and the spatial structure of tissues simultaneously. Taking advantage of spatial transcriptomics and graph neural networks, we introduce cell clustering for spatial transcriptomics data with graph neural networks, an unsupervised cell clustering method based on graph convolutional networks to improve ab initio cell clustering and discovery of cell subtypes based on curated cell category annotation. On the basis of its application to five in vitro and in vivo spatial datasets, we show that CCST outperforms other spatial clustering approaches on spatial transcriptomics datasets and can clearly identify all four cell cycle phases from multiplexed error-robust fluorescence in situ hybridization data of cultured cells. This study is published in Li et.al, Nature Computational Science, 2022.
The large language model can capture intrinsic correlations and biological knowledge with massive pre-training sequences through self-supervised learning, It presents big opportunities for exploring and identifying key functional regions, specific structural features, mutation sites, and other crucial areas on protein sequences. NLSExplorer utilizes A2KA to establish connections between protein nuclear localizations and language models, effectively transferring pre-trained representations and nuclear localization information to the task of predicting NLSs through attention mechanisms. We utilize NLSExplorer to detect potential NLSs on Swiss-Prot database and build a map called the NLS Candidate Library, an online interactive map with global and local search function, shedding light on the potential universe of NLSs. This study is available at https://doi.org/10.1101/2024.08.10.606103.
GraphBind is a webserver designed for structure-based nucleic acid- and small ligand-binding residues prediction. GraphBind consists of two modules: 1) Constructing graphs based on structure, which integrates the local neighborhood around residues to construct graphs. Figure 1 shows the flowchart of constructing a graph based on structure context in GraphBind. 2) The hierarchical graph neural networks (HGNNs), which progressively updates the edge features, node features and graph features, and further learns high-level features for classifying the binding residues. This study is published in Xia et.al, Nucleic Acids Research, 2021.
Transmembrane proteins (TMPs) play important roles in many biological processes. Their structures are crucial for revealing complex functions but are hard to obtain. In this study, we mainly focus on α-helical TMP and develop a multiscale deep learning pipeline, MemBrain 3.0, to improve topology prediction. This new protocol includes two submodules. The first module is transmembrane helix (TMH) prediction, which features the capability of accurately predicting TMH with the tail part through the incorporation of tail modeling. The second module is orientation prediction, which consists of a support vector machine classifier and a new Max-Min assignment strategy. The study is published in Journal of Molecular Biology, 2019.
We used immunohistochemistry images from the Human Protein Atlas as the source of subcellular location information, and built classification models for the complex protein spatial distribution in normal and cancerous tissues. The models can automatically estimate the fractions of protein in different subcellular locations, and can help to quantify the changes of protein distribution from normal to cancer tissues. In addition, we examined the extent to which different annotated protein pathways and complexes showed similarity in the locations of their member proteins, and then predicted new potential proteins in these networks. This study is published in Bioinformatics, 2019.
We have developed the AIR software, which is an Artificial Intelligence-based protein 3D structure Refinement method. AIR is constructed using a multi-objective particle swarm optimization(PSO) protocol. We use multiple energy functions as multi-objectives so as to correct the potential inaccuracy from a single function. Given several initial 3D models for one sequence, AIR takes each initial structure as the particle, with the process of structure refinement, the particles will also move around. The quality of current particles (structures) will be evaluated by three energy functions, and the non-dominated particles will be put into a set called Pareto set. After the iteration converges, the particles from the Pareto set will be screened and part of them will be outputted, which are the final refined structures. The AIR is published in Bioinformatics, 2019.
We propose a new network structural similarity metric-based clustering protocol NCEM for clustering the noisy cryo-EM images. We first construct an image complex network for all the cryo-EM single particle images, where each image is represented as a node in the network. Then the similarity between two images is refined from network structural geometry. By extending the similarity measurement from two independent images to their corresponding neighbored sets in the network, this new NCEM has typical advantages over direct measurement of two images for its noise resistance by using the structural information of the network. This study is published in Yin et.al, Journal of Chemical Information and Modeling, 2019.
We develop a new annotator for the fruit fly embryonic images, AnnoFly. Driven by an attention-enhanced RNN model, it can weigh images of different qualities, so as to focus on the most informative image patterns. We assess the new model on three standard data sets. The experimental results reveal that the attention-based model provides a transparent approach for identifying the important images for labeling, and it substantially enhances the accuracy compared with the existing annotation methods, including both single-instance and multi-instance learning methods. This study is published in Yang et.al, Bioinformatics, 2019.
We propose a deep learning-based method, iDeepS, to simultaneously identify the binding sequence and structure motifs from RNA sequences using convolutional neural networks (CNNs) and a bidirectional long short term memory network (BLSTM). We first perform one-hot encoding for both the sequence and predicted secondary structure, to enable subsequent convolution operations. To reveal the hidden binding knowledge from the observed sequences, the CNNs are applied to learn the abstract features. Considering the close relationship between sequence and predicted structures, we use the BLSTM to capture possible long range dependencies between binding sequence and structure motifs identified by the CNNs. Finally, the learned weighted representations are fed into a classification layer to predict the RBP binding sites. This study is published in Pan et.al, BMC Genomics, 2018.
How to measure the resolution of a reconstructed 3D density map is an important problem of the Single-Particle Reconstruction (SPR) of cryo-EM images. It plays a critical role for promoting methodology development of SPR and structural biology. Due to there is no benchmark map in a new structure generation, how to realize the resolution estimation of a new map is still an open question. We proposed a new self-reference-based resolution estimation protocol SRes, which only requires a single reconstructed 3D map for the purpose of resolution measurement. The core idea in SRes is performing a multi-scale spectral analysis on the map through multiple size-variable masks segmenting the map. The new SRes approach has provided a new routine for measuring the resolution from a single density map. This study is published in Yang et.al, Journal of Chemical Information and Modeling, 2018.
The lncLocator is a new ensemble classifier-based predictor for predicting the lncRNA subcellular localizations developed by Shen lab. The long non-coding RNA (lncRNA) studies have been hot topics in the field of RNA biology. Recent studies have shown that their subcellular localizations carry important information for understanding their complex biological functions. Considering the costly and time-consuming experiments for identifying subcellular localization of lncRNAs, computational methods are urgently desired. To fully exploit lncRNA sequence information, we adopt both k-mer features and high-level abstraction features generated by unsupervised deep models, and construct four classifiers by feeding these two types of features to support vector machine and random forest for predictions. This study is published in Cao et.al, Bioinformatics, 2018.
Enriched RNA-protein binding motifs revealed by new iDeepE model. RNA-binding proteins take over 5–10% of the eukaryotic proteome and play key roles in many biological processes, e.g. gene regulation. We present a deep learning-based method iDeepE to predict the RBP binding sites from sequences alone by fusing the local multi-channel convolutional neural networks and global convolutional neural networks. It is able to mine new binding motifs from big data pool efficiently. This study is published in Pan and Shen, Bioinformatics, 2018.
We have developed a new cell tracking approach Hift to construct the cell lineage. The quantitative analysis of the cell population trajectories has a widely applications in revealing the complex mechanisms of organisms in the micro-world, such as microtubule, stem cells and embryo. For instance, to understand how the drug effects on cells, or study the propagation process of embryo cells, even analyze the cell cycle, accurate tracking the cell population and extract the motion features is critical. Accurate cell tracking and lineage construction under microscopy has played an important role in analyzing cell migration, mitosis and proliferation. In the last decade, this labor-intensive manual analysis was gradually replaced by automated cell tracking methods. The new hierarchically tracking method Hift is robust to the cell morphologies or staining. The paper is published in Zhi, et.al, Neurocomputing, 2018.
Inter-residue contacts in proteins have been widely acknowledged to be valuable for protein 3D structure prediction. Accurate prediction of long-range transmembrane inter-helix residue contacts can significantly improve the quality of simulated membrane protein models. We found that deep convolutional neural network can mine latent residue contact patterns and thus improve inter-helix residue contact prediction. The new MemBrain is a two-stage inter-helix contact predictor. The first stage takes sequence-based features as inputs and outputs coarse contact probabilities for each residue pair, which will be further fed into convolutional neural network together with predictions from three direct-coupling analysis approaches in the second stage. The study is published in Jing Yang and Hong-Bin Shen, Bioinformatics, 2018, 34: 230-238.
AdipoCount, a new obesity cell segmentation and counting system. Obesity has spread worldwide and become a common health problem in modern society. One typical feature of obesity is the excessive accumulation of fat in adipocytes, which occurs through the following two physiological phenomena: hyperplasia (increase in quantity) and hypertrophy (increase in size) of adipocytes. In clinical and scientific research, the accurate quantification of the number and diameter of adipocytes is necessary for assessing obesity. We have developed a new bioimage-understanding based automatic adipocyte counting system, AdipoCount, which is accurate and supports further manual interaction. The outputs of this system are the labels and the statistical data of all adipose cells in the image. AdipoCount is published in Zhi et.al, Frontiers in Physiology, 2018, 9: 85.
Shen Group's project "Artificial Intelligence Algorithm Development for Biological Medical Big Data Understanding and Its Online Prediction Application Systems" has been elected to the Final list of SAIL award of Artificial Intelligence World Innovations 2018.