Incorporat organelle correlations into protein subcellular localization prediction


Introduction


Bioimages of subcellular protein distribution as a new data source have attracted much attention in the field of automated prediction of proteins subcellular localization. Performance of existing systems is significantly limited by the small number of high-quality images with explicit annotations, resulting in the small sample size learning problem. This condition is more serious for the multi-location proteins that co-exist at two or more organelles, because it is difficult to accurately annotate those proteins by biological experiments or automated systems.

In this study, we built protein subcellular localization predictors aiming to deal with the small sample size problem and multi-location proteins. Five semi-supervised algorithms that can make use of lower-quality data were integrated, and a new multi-label classification approach by incorporating the correlations among different organelles in cells was designed. The organelle correlations were modeled by the Bayesian network learning, and topology of the correlation graph can guide the feature space and the order of training binary classifiers in multi-label classification. The proposed protocol was used on both immunohistochemistry and immunofluorescence image datasets, and our experimental results demonstrated its efficiency.

The flow chart of incorporating label correlations into classification is shown in Figure 1.

Fig. 1. Flow chart of incorporating label correlations into classification.



Code and datasets

The data and code are contained in the following compressed files:
Click here to download the datasets (309Mb), and click here to download the source code (23Mb). The code package has been tested using Matlab 2014a under Windows 7 in a 64bit architecture.



Reference

Ying-Ying Xu, Fan Yang, Hong-Bin Shen, Incorporating organelle correlations into semi-supervised learning for protein subcellular localization prediction, Submitted.