1. What is LocSigDB?

LocSigDB is a manually curated database of experimental and predicted localization signals for eight distinct subcellular locations in eukaryotic cells. By performing extensive literature study, we compiled a set of 535 experimentally determined localization signals. In addition, we developed an in silico mutagenesis based scoring method to identify 785 potential sorting candidates that can likely function as protein localization signals. Each signal is annotated with its localization, source, a confidence score or PubMed references as applicable, and all proteins in Swissprot database that contains the signal. LocSigDB is the only database that covers targeting signals for eight different subcellular organelles, and with a collective count of 1320 signals, it is the most comprehensive compendium of localization signals, to date.

2. Query

2.1. Signal

If users know the precise signal they want to search, input the same either as regular amino acid signal or as a regular expression of amino acids as shown in the examples below.

b) TEKK[QG]KSILYDCA (Occurrence of amino acid 'Q' or 'G' at the fifth position)
c) RRRx{11}KRRK (Occurrence of any amino acid 'x', 11 number of times)
d) KRx{7,9}PQPKKKP (Occurrence of any amino acid 'x', 7 to 9 number of times )

And combination of the above regular expressions

e) [RK]x{2}Lx{1}[VY]x{2}[VI]x{1}[KR]x{3}[KR]

Location information is optional in this case. But if the user has prior knowledge of the location of the signal, the query will give more meaningful results. The optional checkbox 'Check to avoid substring search' limits the search to exact matches.

2.2. Protein ID:

User can query the database using the RefSeq ID or Uniprot ID/AC of the protein of interest. The database will pull-up the sequence of the protein with the RefSeq ID in question and report the localization signals in it (if any). Example of the acceptable RefSeq ID is shown below.


Similarly, the Uniprot Protein IDs allowed are:

HNRPK_HUMAN (or) P61978

Location information is required while searching protein id. If the location of the protein is not known, user can use ngLOC to get the location information.

2.2. Sequence (in FASTA format):

User can query the database for finding localization signals in the protein of interest by submitting the sequence of the protein, but only in FASTA format. Example of the acceptable sequence format (in FASTA) is shown below.


Location information is required while searching using sequence. If the location of the protein sequence is not known, user can use ngLOC to get the location information.

Note: All the examples above (either for signal or sequence or protein Id) are for nucleus. So please select 'Nucleus' as the localization from the dropdown menu while performing the example search using Protein ID and Sequence as the qualifiers, while for the Signal, selecting the localization is an optional parameter.

3. Interpreting the output

Signal: The localization motif (signal peptide or targeting peptide), represented as a combination of single letter amino acid codes. It can be present as a complete signal peptide represented just by the amino acids or regular expressions to accommodate for specific patterns. Cross- links to more information.

Origin: It tells whether the localization motif was found from literature (experimental validation) or it was generated through in-silico mutagenesis.

Protein(s): The Protein in which the experimental Localization motifs were reported in literature.

Localization: Eight different locations for both experimental and potential motifs.

Reference(s): Literature related to the detection of localization motif (experimental). This field further cross-links to the Pubmed citation.

Confidence Score: This is the score assigned to the potential signals based on its length, frequency of occurrence (counted only once per sequence) and normalized with respect the size of the proteome for that particular localization.

Coordinate(s): The start and end positions of the signals found is a given sequence. This parameter is displayed only during search by 'Protein Id' or 'Sequence'.

Swissprot id(s): Swissprot Ids of the proteins that contain a given signal. For the potential signals, the Swissprot ids have been derived by combination (non-redundant) of two datasets (i) the subcellular localization annotated(ignored potential, similarity based) proteins in Swissprot and (ii) manually annotated and reviewed proteins from Uniprot (i.e Swissprot section) targeted to distinct locations by ngLOC. However, the Swissprot ids for literature signals come solely from subcellular localization annotated (again ignoring keywords like potential, similarity based) proteins in Swissprot.