1. What is LocSigDB?

LocSigDB is a manually curated database of experimental localization signals for eight distinct subcellular locations; primarily in a eukaryotic cell with brief coverage of bacterial proteins. By performing extensive literature study, we compiled a set of 533 experimentally determined localization signals thus making LocSigDB the most comprehensive compendium of localization signals, to date. Each signal in LocSigDB is annotated with the protein(s) in which the experimental localization signal was reported in the literature, the exclusive subcellular location where the protein containing the targeting signal is found, PubMed references and UniProt IDs of all proteins that contain a given signal or the same amino acid pattern.

2. Query

Users can query the database in three ways:

2.1. Signal

If users know the precise signal they want to search, input the same either as regular amino acid signal or as a regular expression of amino acids as shown in the examples below.

  2. TEKK[QG]KSILYDCA (Occurrence of amino acid 'Q' or 'G' at the fifth position)
  3. RRRx{11}KRRK (Occurrence of any amino acid 'x', 11 number of times)
  4. KRx{7,9}PQPKKKP (Occurrence of any amino acid 'x', 7 to 9 number of times )
  5. And combination of the above regular expressions

  6. [RK]x{2}Lx{1}[VY]x{2}[VI]x{1}[KR]x{3}[KR]

Location information is optional in this case. But if the user has prior knowledge of the location of the signal, the query will give more meaningful results. The optional checkbox 'Check to avoid substring search' limits the search to exact matches.

2.2. Protein ID:

User can query the database using the RefSeq ID or Uniprot ID/AC of the protein of interest. The database will pull-up the sequence of the protein with the RefSeq ID in question and report the localization signals in it (if any). Example of the acceptable RefSeq ID is shown below.


Similarly, the Uniprot Protein IDs allowed are:

THA_RAT (or) P63059

2.2. Sequence (in FASTA format):

User can query the database for finding localization signals in the protein of interest by submitting the sequence of the protein, but only in FASTA format. Example of the acceptable sequence format (in FASTA) is shown below.


Location information is optional in all the three searches. But, if the user has prior knowledge of the subcellular location of the protein, the query will result in more appropriate results. The optional checkbox 'Check to search the pattern as a substring' can be used when a user wants the result to include the signals that are a part of the input pattern.

Figure 1: An outline of the query and the search result interface in LocSigDB with examples from the FAQ file itself.

LocSigDB provides three search functions to retrieve the localization signal information. (A) Queries using a signal displays links to all the available descriptors for the signal in question; (B) Queries by a protein ID or a protein sequence (FASTA format) retrieves corresponding sequence from the public database. In case of B, the sequence is displayed in the result interface, and the user can mouse over the corresponding signals identified in the protein while highlighting the signal coordinates in red on the protein sequence. Each signal is in turn linked to the signal attribute window that displays the annotations for that signal

3. Interpreting the output

Signal: The localization motif (signal peptide or targeting peptide), represented as a combination of single letter amino acid codes. It can be present as a complete signal peptide represented just by the amino acids or regular expressions to accommodate for specific patterns. Cross- links to more information.

Protein(s): The Protein in which the experimental Localization motifs were reported in literature.

Localization: Eight different subcellular locations for which the experimental signals were collected.

Reference(s): Literature related to the detection of experimental localization signal. This field further cross-links to the PubMed citation.

Coordinate(s): The start and end positions of the signals found in a given sequence. This parameter is displayed only during search by 'Protein Id' or 'Sequence'.

UniProt ID(s): UniProt accessions of all proteins that contain the same amino acid pattern as that of a given signal. Preferably SwissProt accessions for the protein are presented but in few cases the signals could not be mapped onto SwissProt annotated proteins; there the TrEMBL accessions have been reported. This helps achieve complete integration with the UniProt.

Organism(s): The organism(s) where each signal is found and this information is retrieved from the UniProt as it corresponds to the UniProt accessions and the respective organism representing the same. But the organism field is unique such that each organism is reported only once although more than one protein sequences (accessions) from the same organism may contain the signal of interest.