Categorizer

Categorizer cat

Mirror sites

Asia server: http://ssbio.cau.ac.kr/software/categorizer
North America server: https://gsponerlab.msl.ubc.ca/software/categorizer/

Citation

Categorizer is published in BMC Genomics 2014, 15:1091.

Contact

For any inquries, please contact us via Dokyun Na (dna@ssbio.cau.ac.kr) or Joerg Gsponer (gsponer@msl.ubc.ca).

Download

You may download the repository from our Categorizer Github repository at this URL https://github.com/ubc-msl/categorizer

Please do not extract the zipped files into a folder with non-alphabetic characters.


Introduction

Categorizer v1.0 is a tool to classify genes into user-defined groups (categories) based on GeneOntology (GO) annotations and their semantic similarities. Most GO-based analysis tools are designed to identify enrichments of individual GO terms in a set of genes, and they frequently output lists of redundant or highly specific GO terms that can be difficult to interpret. Categorizer assigns genes to user-defined categories and calculates p-values for the enrichment of each category. This new tool takes advantage of the hierarchical structure of GO annotations and the semantic similarity between GO terms for a reliable categorization. Categorizer will help experimental and computational biologists analyzing genomic and proteomic data according to their specific interests.

For detailed information on semantic similarities, refer to the supplementary information of our paper and the following articles:

Lord PW, Stevens RD, Brass A, Goble CA. 2003. Investigating semantic similarity measures across the Gene Ontology: the relationship between sequence and annotation. Bioinformatics 19:1275-1283

Wu X, Pang E, Lin K, Pei Z-M. 2013. Improving the measurement of semantic similarity between gene ontology terms and gene products: insights from an edge- and IC-based hybrid method. PLoS ONE 8:e66745


Installation

Categorizer was implemented using a platform-independent programming language, Python. The program can be run on any operating system where Python and the required libraries are installed. For user’s convenience, we provide a pre-compiled verson of Categorizer that runs on the Windows operating system. For other operating systems, the program must be run from the source code.

For Windows users, download the three tools below

Please note that this compiled version uses only a single core due to a compiling issue. If you need to categorize thousands of genes/proteins, we recommend running from the source code, which supports use of multiple cores.  All versions and files are now hosted on Github.

Categorizer with GUI: Categorizer with a graphical user interface.
Categorizer without GUI: A command-line Categorizer tool.
Rebuild: A tool for building indexes for semantic similarity scores.

For those who want to run from the source code, please download this

In order to run Categorizer from the source code, you need to install the following software:

Python 2.7 or higher
Numpy 1.8.1 or higher
Scipy 0.13.0 or higher
matplotlib 1.3.1 or higher
wxPython 3.0 or higher

If you are not familiar with Python or installing libraries, we recommend installing “Enthought Canopy” (free and acamedic versions are okay), a Python distribution containing many scientific libraries including those listed above.

The zipped file contains the following tools:

APP_Categorizer.py: GUI version of Categorizer
Categorizer.py: command-line version of Categorizer
rebuild.py: a tool to build indexes for semantic similarity scores from your data.
GOTermSeeker.py: a simple tool to search for GO terms containing a certain keyword. This tool may be helpful when creating categories.


Execution

Categorizer is shipped with example files that can be found in the ./data folder.

example_categories.txt

example_gene_association.fb

example_genes.txt

example_background_genes.txt

.

Categorizer (GUI version)

Please note that the compiled version supports use of only a single core, while running from the source code supports use of multiple cores.

Run the pre-compiled version

For those who downloaded the pre-compiled Windows version of Categorizer, double-click CategorizerGUI.exe).

main_win_small

Run from the source code

Run the source code named APP_Categorizer.py.

python APP_Categorizer.py

cat_exe

To do categorization, you need to provide at least three files (highlighted in yellow): a category file, a gene annotation file, and a gene list file. A background gene list file is optional.

  • Step 1 Category file

The category file contains a list of biological categories and GO terms belonging to each category. We have created three category files that are expected to be used commonly: biological processes, enzyme classification, and cellular localization. If you want to create your own categories, please see Category file format for more detailed information. For instance, we provide the category file of biological processes, which contains 27 categories and these categories can be copied into a custom category file (please see biological_processes.txt in the ./data folder).

Click on the button below “Step 1”; a window used to select a category file will show up. After loading, a list of categories and the number of GO terms belonging to each category are shown.

  • Step 2 Annotation file

The annotation file contains gene and protein IDs, their names, and related GO terms. You can download a variety of annotation files from GeneOntology. Categorizer is shipped with a Drosophila gene annotation file, which can be downloaded from FlyBase or GeneOntology.

annotation_file_format

Categorizer reads all the columns, but uses only the three marked columns: IDs and names (green), and GO terms (orange).

  • Step 3 Gene list file

This file contains list of gene identifiers. Categorizer loads both gene IDs and gene names (please see the green columns in the above figure), so either identfier may be used.

  • Step 4 Background gene list file (optional)

Categorizer provides an enrichment analysis function that determines which categories are significantly enriched for a given set of genes. For this analysis, a background list of genes is required, which could be a whole genome or a set of genes.

  • Step 5 Options

A gene or protein belongs to a particular biological process or set of processes. For example, the protein p53 belongs to both the signaling process category and the transcription category. Categorizer can classify a gene/protein to the single category with the highest similarity score or multiple categories with a similarity score over the user-defined cutoff.

Categorizer is able to utilize multiple cores. This feature is disabled in the pre-compiled version by default due to a compiling issue.

  • Step 6 RUN!

Clicking on the ‘RUN’ button will begin the computations. Categorizer loads all required files and categorizes user-entered genes. When complete, a window with results will show up.

  • Results

result_win
Upon completion of categorization, a window like in the figure above will show up.

  1. Category statistics (Left)

    On the left-side, categorization statistics as a pie chart and a table are shown.
    In this figure, the metabolism category is the largest one, while protein folding is the smallest. There are also uncategorized genes. This could be due to the lack of annotation information for the genes.

  2. Categorization result (Middle)

    In the middle of the result window, a list of all entered genes and their categories with a similarity score are shown. In this example Categorizer was allowed to classify genes into multiple categories with a cutoff value of 0.3; thus some of the genes in the figure belong to more than one category.

  3. Enrichment analysis result (Right)

    If background genes are entered, enrichment analysis result will be shown. Statistical enrichment is expressed as a p-value, and the log10(p-value) is shown in the graph. Dark red represents a significantly enriched category. The lower bound of the graph can be adjusted by moving the slider bar up and down, and clicking on the Redraw button.

  4. Save

    You can save graphs and result tables from the menu: Menu > Save results.

Categorizer (non-GUI version)

Please note that this non-GUI version does not support enrichment analysis.

 

Run the pre-compiled version

Open a DOS terminal, change to the directory where the Categorizer.exe file is, and enter the command below. Please note that this compiled version supports only a single core.

Categorizer.exe -d [category file] -a [annotation file] -i [gene list file] -m/-s [cutoff]

Example:

Categorizer.exe -d .\data\example_categories.txt -a .\data\example_gene_association.fb -i .\data\example_genes.txt -m 0.3

 

Run from the source code

python Categorizer.py -d [category file] -a [annotation file] -i [gene list file] -m/-s [cutoff] -cpu [integer]  

Example:

python Categorizer.py -d ./data/example_categories.txt -a ./data/example_gene_association.fb -i ./data/example_genes.txt -m 0.3-cpu 3

Parameters

  • -d [category file]: Category file. Please see Category file format
  • -a [annotation file]: Annotation file. This file can be downloaded from http://www.geneontology.org/GO.downloads.annotations.shtml or created using the file format described below.
  • -i [gene list file]: Input file. This file contains a list of gene IDs or names.
  • -m [cutoff]: When this option is specified, Categorizer may classify one gene/protein into multiple categories with a semantic similarity score over a cutoff value (0<cutoff<=1).
  • -s [cutoff]: When this option is specified, Categorizer classifies one gene/protein into only the category which has the highest similarity score and which is over the specified threshold.
    Note: -m and -s are mutually exclusive.
  • -cpu [integer]: Specifies number of cores to be used (default=1). This is disabled in the pre-compiled version.

Additional information

Category file format

The current version of Categorizer contains three example category files: biological processes, cellular localizations, and enzyme functions. However, Categorizer allows users to define their own categories.

Example category file

def_file_format

The format of a category file is quite simple. Determine a category name and add GO term IDs that belong to the category. In this example file, the Cell cycle category has four GO terms related to cell cycle. # can be used for comments.

Simply speaking, if a gene has one of the annotations of the four terms, say “GO:0000910” (cytokinesis), it will be categorized into Cell cycle. If a gene has an annotation that is close to the defined terms belonging to cell cycle, the Categorizer calculates pairwise semantic similarity scores between the gene’s GO term and the four defined GO terms, and takes the maximal score. If the score obtained from cell cycle is larger than those obtained from other categories, the gene will be classified into the cell cycle category. If multiple categories are allowed, the gene will classfied into any categories scoring above the selected cutoff.

If you have trouble in finding the proper GO terms, please use GOTermSeeker.py to find GO terms that contain a keyword of your interest.

python GOTermSeeker.py [ontology file] [keyword] [output file name]

Example:

python GOTermSeeker.py ./data/gene_ontology_ext.obo "cell cycle" ./cell_cycle.txt

 

Annotation file format

The annotation file contains gene IDs and names, and their annotated GO IDs. Species-specific annotation files as well as integrated annotation files like UniProt can be dowloaded at GeneOntology.

annotation_file_format

The annotation file format is as above. Categorizer reads all the information in the file but uses only the marked three columns: second and third columns for name and fifth column for GO terms. Thus, you can enter either a gene/protein ID or name into Categorizer.

When annotation files are created, these three columns must be provided.

 

Rebuilding semantic similarity scores with your data

Categorizer employs an algorithm to calculate semantic similarity scores from the occurence of GO terms in the  GO annotations of UniProt proteins compiled in 2013. For detailed information, please see the Introduction section and cited articles.

We built indexes to accelerate the calculation performance. When you need to use a different dataset, for example, UniProt without IEA annotations or Human genes only, you need to rebuild the indexes by using rebuild.exe or rebuild.py.

Run the pre-compiled version

Open a DOS terminal, change to the directory where the rebuild.exe file is, and enter the command below.

rebuild.exe [annotation file] [ontology file]

Example:

rebuild.exe .\data\gene_association.goa_uniprot_noiea.txt .\data\gene_ontology_ext.obo

Run from the source code

python rebuild.py [annotation file] [ontology file]

Example:

python rebuild.py ./data/gene_association.goa_uniprot_noiea.txt ./data/gene_ontology_ext.obo

When done, two index files (go_index.txt and go_prob.txt) will be created. Copy these two files over to the Categorizer folder. As Categorizer automatically loads these index files, once copied you do not need to do any additional things.