Consistency and Variation of Protein Subcellular Location Annotations


Introduction


A major challenge for protein databases is reconciling information from diverse sources. This is especially difficult when some information consists of secondary, human-interpreted rather than primary data. For example, the Swiss-Prot database contains curated annotations of subcellular location that are based on predictions from protein sequence, statements in scientific articles, and published experimental evidence. The Human Protein Atlas (HPA) consists of millions of high-resolution microscopic images that show protein spatial distribution on a cellular and subcellular level. These images are manually annotated with protein subcellular locations by trained experts. The image annotations in HPA can capture the variation of subcellular location across different cell lines, tissues, or tissue states. Systematic investigation of the consistency between HPA and Swiss-Prot assignments of subcellular location, which is important for understanding and utilizing protein location data from the two databases, has not been described previously. In this paper, we quantitatively evaluate the consistency of subcellular location annotations between HPA and Swiss-Prot at multiple levels, as well as variation of protein locations across cell lines and tissues. Our results show that annotations of these two databases differ significantly in many cases, leading to proposed procedures for deriving and integrating the protein subcellular location data. We also find that proteins having highly variable locations are more likely to be biomarkers of diseases, providing support for incorporating analysis of subcellular location in protein biomarker identification and screening.



Code and Data

The data and code are contained in the following compressed files:

Click here to download the supplementary files (19Mb), including protein subcellular location annotations in the HPA and Swiss-Prot,subcellular location mapping relationships, and low-consistency proteins, and click here to download the source code (38Mb). The code package has been tested using Matlab R2019a under Mac OS X Yosemite 10.10.5 and Windows 7 in a 64bit architecture.



Intermediate Data

Intermediate HPA data can be generated by running the downloadProteins.m function in the code.zip. They can also simply be downloaded here (966Mb). To use these, place the contents of this unzipped file under the 'data/HPA' directory.



Reference

Ying-Ying Xu, Hang Zhou, Robert F. Murphy, and Hong-Bin Shen, Consistency and variation of protein subcellular location annotations. Proteins: Structure, Function, and Bioinformatics, 2020.