Frequently asked questions


What is FASTA format?

FASTA format is a text-based format for representing either nucleotide sequences or peptide sequences, in which nucleotides or amino acids are represented using single-letter codes. A sequence in FASTA format begins with a single-line description, followed by lines of sequence data. The description line always starts with a greater-than (">") symbol.
>SEQUENCE_1
MTEITAAMVKELRESTGAGMMDCKNALSETNGDFDKAVQLLREKGLGKAAKKADRLAAEG
LVSVKVSDDFTIAAMRPSYLSYEDLDMTFVENEYKALVAELEKENEERRRLKDPNKPEHK

What is ngLOC ?

ngLOC is an n-gram-based Bayesian classifier that predicts subcellular localization of protein sequences both in prokaryotes and eukaryotic species.

Can I run the program locally on my Windows or Mac Os computer?

Yes, you can download the source code and compile with your choice of compiler. In the future releases of this software, we plan to provide the program as a Windows console application and a Mac OS version.

Back to top

Can I run the program locally on my Linux/Unix computer?

Yes, we provide the executables for Linux OS version xxxx, and also a make file to compile the source code under other Unix-based environments. You can refer to ngLOC manual or ReadMe files to learn more.

  • Linux version 2.6.18
  • gcc version 4.1.2

Back to top

How to setup standalone version on Linux/Unix machine?

This is explained in the ReadMe.txt file, which comes with the source code.

Back to top

How to adjust n-gram length?

To be able to adjust this option you need to download the standalone version. In the web version the n-gram length is set to 6.

Back to top

How to change species in standalone version?

The program has been set up to work with animal sequences in the current distribution, but it is simple to change this to another species. To change the species used, just open the def.h file with your favorite text editor (e.g. vim, emacs or notepad) and remove the two slashes ("//") in front of your choice species and put "//" in the beginning of the line of the species that you want to deactivate. There should be only one species selected at a time. Then you can recompile and use the program to make predictions on the new species.

Back to top

What is the minimum length for my query sequence?

The minimum sequence length accepted by the web version is 10 amino acids, but you can change this in the standalone version if necessary.

Back to top

Why do I get less number of sequences than what I entered for prediction?

Because, the sequences that are shorter than 10 amino acids are discarded by the program by default.

Back to top

How to understand the output of ngLOC?

ngLOC results are printed in a structured format so users can load them directly into an Excel spreadsheet for further processing. There are 9 columns in the output file. You can find the descriptions for these columns below.

  • Column 1: Input sequence ID. Only the first word or a maximum of 30 characters are printed
  • Column 2: Predicted location(s) by ngLOC. Note that some sequences can have multiple locations predicted. In those cases, the two location codes are separated by a slash.
  • Columns 3, 5 and 7: Pred1-3: These columns have the top three predictions in the descending order.
  • Columns 4, 6 and 8: Prob1-3: These columns have the probabilities (times 100) for the top three predictions in the descending order. The sum of probabilities from all locations predicted (not just the top 3) equals one.
  • Column 9: MLCS is the multi-localized confidence score. Only when MLCS is greater than equal to 60, a sequence is assigned to two locations.

Back to top

What are three letter codes stand for in the prediction results?

The set of three letter codes are relevant only to a given species. Below, you can find the codes and corresponding locations by species.

  • Animal datasets
    • CYT - Cytoplasm
    • CSK - Cytoskeleton
    • END - Endoplasmic Reticulum
    • EXC - Extra cellular or Secreted
    • GOL - Golgi
    • JNC - Junction
    • LYS - Lysosome
    • MIT - Mitochondria
    • NUC - Nucleus
    • POX - Peroxisome
    • PLA - Plasma Membrane
  • Plant datasets
    • CHL - Chloroplast
    • CYT - Cytoplasm
    • CSK - Cytoskeleton
    • END - Endoplasmic Reticulum
    • EXC - Extracellular or Secreted
    • GOL - Golgi
    • MIT - Mitochondria
    • NUC - Nucleus
    • POX - Peroxisome
    • PLA - Plasma Membrane
    • VAC - Vacuole
  • Gram-negative bacterial datasets
    • CYT - Cytoplasm
    • EXC - Extracellular or secreted
    • IMB - Inner membrane
    • OMB - Outer membrane
    • PER - Periplasm
  • Gram-positive bacterial datasets
    • CYT - Cytoplasm
    • EXC - Extracellular or secreted
    • IMB - Inner membrane
    • WAL - Cell wall

Back to top

What is MLCS in the predication table?

MLCS is stands for multi-localization confidence score. This is the likelihood of the sequence present in more than one subcellular location in the cell.

Back to top