MoRFchibi

AVAILABILITY

OVERVIEW

A new computational approach for a fast and accurate prediction of MoRFs in protein sequences. MoRFchibi SYSTEM (Malhis et al 2016) includes three different predictors:

  1. MoRFCHiBi is most appropriate as a component predictor in other applications.
  2. MoRFCHiBi_Light is most appropriate for high-throughput MoRF predictions. Plus, MoRFCHiBi_Light is the most accurate in targeting longer MoRFs (with more than 30 residues).
  3. MoRFCHiBi_Web is most appropriate for high-accuracy MoRF predictions. MoRFCHiBi_Web outperforms previously developed predictors by generating less than half the false positive rate for the same true positive rate at any practical threshold values.

For more details, please click here.

For a video overview, please click here.

BENCHMARKING

MoRFchibi SYSTEM predictors were evaluated using three test data-sets. They outperformed previously developed predictors by generating more than twice their precision. In addition to that, MoRFchibi SYSTEM predictors have relatively high processing speed.  For more details, please see the BENCHMARKING page.

DATASETS

  • Two training sets:
    • TRAINING_HT (_HT; for high-throughput collection) with 421 sequences collected by (Disfani et al 2012) is used to train MoRFchibi SYSTEM predictors.
    • TRAINING_SY (_SY; for synthetic) with 4000 synthetic sequences. Each includes a MoRF region (sizes from 10 to 20 amino acids) in between two flanking regions (8 amino acids each), and a section that represents a general protein region (Other). Amino acid compositions for each region were derived from the amino acid compositions of its corresponding regions in TRAINING_HT; i.e. each residue in each region R ∈ {MoRF, Flanks, Other} in TRAINING_SY is chosen by selecting at random (uniform) a residue from the collection of all R regions in TRAINING_HT. Therefore, sequences in TRAINING_SY have no direct similarity to those in TRAINING_HT, but they do contain all the composition contrast information (Malhis and Gsponer 2015).
  • Three test sets that share up to 30% identity with the training data are used in evaluation:
    • TEST_EXP53 is our main test set. It contains 53 non-redundant sequences (up to 30% identity to each other) with a total of 25,186 residues including 2,432 MoRF residues. MoRF residues in TEST_EXP53 are divided into 729 from sections with up to 30 residues identified as MoRF residues based one or more PDB structures and 1,703 from sections longer than 30 residues. TEST_EXP53 sequences have been experimentally validated for their disordered properties in isolation. The annotation of TEST_EXP53 sequences is not limited to one MoRF section per sequence. For more on TEST_EXP53, please see (Malhis et al 2015).
    • TEST_HT has 464 sequences with a total of 296,362 residues including 5,779 MoRF residues. It is composed of two sets ‘Test dataset’ and ‘Test 2012’ which are both collected by (Disfani et al 2012). Each of TEST464 sequences is only annotated by a single 5 to 25 residues MoRF section.
    • TEST_EXP9 was collected by (Jones and Cozzetto 2015) and consists of 9 sequences with a total of 2,209 residues, including 12 MoRF regions with 163 MoRF residues. This test set is used to compare the prediction quality with MFSPSSMpred and DISOPRED3.

OUTPUT

MoRFchibi SYSTEM predictors generate propensity scores for each residue to be a MoRF residue. If a cut-off value is needed, we suggest a MoRFCHiBi_Web  and MoRFCHiBi_Light value around 0.725. At this cut-off, MoRFCHiBi_Web  has a (TPR, FPR) of (0.567, 0.077) and (0.407, 0.057) on TEST_EXP53 and TEST_HT, respectively

REFERENCES

Disfani FM, Hsu WL, Mizianty MJ, Oldfield CJ, Xue B, Dunker AK, et al. MoRFpred, a computational tool for sequence-based prediction and characterization of short disorder-to-order transitioning binding regions in proteins. Bioinformatics, 2012 Jun 15; 28 (12): i75–i83. doi: 10.1093/bioinformatics/bts209. pmid:22689782.

Jones DT and Cozzetto D. DISOPRED3: precise disordered region predictions with annotated protein-binding activity.Bioinformatics, 2015 Mar 15; 31 (6): 857–863. doi: 10.1093/bioinformatics/btu744. pmid:25391399.

Malhis N, and Gsponer J. Computational Identification of MoRFs in Protein SequencesBioinformatics (2015) 31 (11): 1738-1744. pmid: 25637562.

Malhis N, Wong TCE, Nassar R, Gsponer J. Computational Identification of MoRFs in Protein Sequences Using Hierarchical Application of Bayes Rule. PLOS ONE (2015), DOI: 10.1371/journal.pone.0141603. pmid: 26517836.

Malhis N, Jacobson M, and Gsponer J. MoRFchibi SYSTEM: Software Tools for the Identification of MoRFs in Protein sequencesNucleic Acids Research (2016), doi: 10.1093/nar/gkw409. pmid: 27174932.

Contact Nawar Malhis: nmalhis@msl.ubc.ca