ReadMe

INSP: Identification of Nucleus Signal Peptide from Protein Primary Sequence

1. Introduction
Nuclear localization signals (NLSs) are protein peptides binding to carrier proteins. They can transport nuclear proteins into the nucleus. The identification of NLS can help elucidate protein function. However, experimental identification of such signals is expensive and only a limited number of NLSs have been currently determined. It is therefore important to develop an algorithm for computational prediction of NLS. Till now, there are several automatic methods to predict NLS.The biggest problem with nuclear localization signal prediction is that the accuracy and recall rate are difficult to balance. Since the number of existing validated NLSs is limited and mostly rich in lysine and arginine, the machine learning based NLS prediction algorithm tends to be NLSs with higher coverage of lysine and arginine. As long as a motif contains many lysine and arginine residues, it is easy to think of NLS, resulting in excessive redundancy, and ignore some other types of NLS, especially some NLSs rare in lysine and arginine. Here, we establish a nuclear localization signal prediction algorithm, named INSP. It is a consensus scoring system based on frequent pattern and machine learning. Firstly, the frequent pattern mining knowledge is used to obtain some motifs frequently appearing in the nuclear database, so as to solve the tendency problem in machine learning, and it is easy to find some special NLSs. Then, in the scoring mechanism based on machine learning, comprehensive utilization of evolutionary information (PSSM), disorder score, sequence feature information (word-vector) and statistical information (mean) are used to enhance some filtering conditions to reduce redundancy. Combining the two scoring mechanisms, it is possible to obtain some NLSs with higher matching with known NLSs, and to find some special NLSs rich in nuclear sequences. INSP is a web server developed for nuclear localization signals prediction. The algorithm flowchart show as follows:
Figure 1. Flow chart of Statistical knowledge-based and machine learning SVM-based searching for NLS segments in a query sequence.
Figure 2. The Flowchart for searching for NLS peptide based on large-scale frequent pattern mining.
Figure 3. The algorithm of INSP for searching NLS motifs by fusing the frequent pattern mining and machine learning predictions.

2. Inputs
(A). Protein sequence Input the protein sequence into the input box in Fasta format.
(B). Email address You need to input your email address to receive an email notification of your prediction results.

3. Outputs

The prediction result will be showed on the web and sent to your email. And the result of each sequence can be combined by three parts:
1)INSP: The model predicts NLS by the consensus model combined by large-scale frequent pattern mining model and statistical knowledge-based and machine learning SVM-based model. The score is from statistical knowledge-based and machine learning SVM-based model, thus the result of large-scale frequent pattern mining model does not have score.
2)Statistical knowledge-based and machine learning SVM-based model without merging: The model selects top5 NLSs predicted by our statistical knowledge-based and machine learning SVM-based model without merging the fragments.
3)Large-scale frequent pattern mining model: The model selects top3 NLSs in the enrichment score and top3 in the ranking score from the frequent pattern database.