Main Article Content
AutoresFrancy Liliana Camacho
This study demonstrates the importance of obtaining statistically stable results when using machine learning methods to predict the activity of antimicrobial peptides, due to the cost and complexity of the chemical processes involved in cases where datasets are particularly small (less than a few hundred instances). Like in other fields with similar problems, this results in large variability in the performance of predictive models, hindering any attempt to transfer them to lab practice. Rather than targeting good peak performance obtained from very particular experimental setups, as reported in related literature, we focused on characterizing the behavior of the machine learning methods, as a preliminary step to obtain reproducible results across experimental setups, and, ultimately, good performance. We propose a methodology that integrates feature learning (autoencoders) and selection methods (genetic algorithms) thorough the exhaustive use of performance metrics (permutation tests and bootstrapping), which provide stronger statistical evidence to support investment decisions with the lab resources at hand. We show evidence for the usefulness of 1) the extensive use of computational resources, and 2) adopting a wider range of metrics than those reported in the literature to assess method performance. This approach allowed us to guide our quest for finding suitable machine learning methods, and to obtain results comparable to those in the literature with strong statistical stability.
All articles included in the Revista Facultad de Ingeniería are published under the Creative Commons (BY) license.
Authors must complete, sign, and submit the Review and Publication Authorization Form of the manuscript provided by the Journal; this form should contain all the originality and copyright information of the manuscript.
The authors who publish in this Journal accept the following conditions:
a. The authors retain the copyright and transfer the right of the first publication to the journal, with the work registered under the Creative Commons attribution license, which allows third parties to use what is published as long as they mention the authorship of the work and the first publication in this Journal.
b. Authors can make other independent and additional contractual agreements for the non-exclusive distribution of the version of the article published in this journal (eg, include it in an institutional repository or publish it in a book) provided they clearly indicate that the work It was first published in this Journal.
c. Authors are allowed and recommended to publish their work on the Internet (for example on institutional or personal pages) before and during the process.
review and publication, as it can lead to productive exchanges and a greater and faster dissemination of published work.
d. The Journal authorizes the total or partial reproduction of the content of the publication, as long as the source is cited, that is, the name of the Journal, name of the author (s), year, volume, publication number and pages of the article.
e. The ideas and statements issued by the authors are their responsibility and in no case bind the Journal.
M. R. Borkar, R. R. S. Pissurlenkar, and E. C. Coutinho, “HomoSAR: Bridging comparative protein modeling with quantitative structural activity relationship to design new peptides,” Journal of Computational Chemistry, vol. 34, pp. 2635-2646, Nov. 2013. DOI: http://doi.org/10.1002/jcc.23436.
M. Shu, R. Yu, Y. Zhang, J. Wang, L. Yang, L. Wang, and Z. Lin, “Predicting the Activity of Antimicrobial Peptides with Amino Acid Topological Information,” Medicinal Chemistry, vol. 9, pp. 32-44, Feb. 2013. DOI: http://doi.org/10.2174/157340613804488350.
M. Torrent, D. Andreu, V. M. Nogues, and E. Boix, “Connecting peptide physicochemical and antimicrobial properties by a rational prediction model,” PLoS ONE, vol. 6, p. e16968, Jan. 2011. DOI: http://doi.org/10.1371/journal.pone.0016968,
Y. Wang, Y. Ding, H. Wen, Y. Lin, Y. Hu, Y. Zhang, Q. Xia, and Z. Lin, “QSAR Modeling and Design of Cationic Antimicrobial Peptides Based on Structural Properties of Amino Acids,” Combinatorial Chemistry & High Throughput Screening, vol. 15, pp. 347-353, May. 2012. DOI: http://doi.org/10.2174/138620712799361807.
Z. H. Lin, H. X. Long, Z. Bo, Y. Q. Wang, and Y. Z. Wu, New descriptors of amino acids and their application to peptide QSAR study, Oct. 2008.
X. Zhou, Z. Li, Z. Dai, and X. Zou, “QSAR modeling of peptide biological activity by coupling support vector machine with particle swarm optimization algorithm and genetic algorithm,” Journal of Molecular Graphics and Modelling, vol. 29, pp. 188-196, Sep. 2010. DOI: http://doi.org/10.1016/j.jmgm.2010.06.002.
C. Cortes and V. Vapnik, “Support-vector networks,” Machine Learning, vol. 20 (3), pp. 273-297, 1995. DOI: http://doi.org/10.1007/BF00994018.
F. Camacho, R. Torres and R. Ramos Pollán, “Feature learning using stacked autoencoders to predict the activity of antimicrobial peptides,” in Computational Methods in Systems Biology, 2015. DOI: http://doi.org/10.1007/978-3-319-23401-4_11.
R. Kiralj and M. M. C. Ferreira, “Basic validation procedures for regression models in QSAR and QSPR studies: Theory and application,” Journal of the Brazilian Chemical Society, vol. 20 (4), pp. 770-787, 2009. DOI: http://doi.org/10.1590/S0103-50532009000400021.
A. Tropsha. “Best Practices for QSAR Model Development, Validation and Exploitation,” Molecular Informatics, vol. 29, pp. 476-488, 2010. DOI: http://doi.org/10.1002/minf.201000061.
T. Bäck, Evolutionary algorithms in theory and practice: evolution strategies, evolutionary programming, genetic algorithms. Oxford university press, 1996.
A. Cherkasov and B. Jankovic, “Application of ‘inductive’ QSAR descriptors for quantification of antibacterial activity of cationic polypeptides,” Molecules, vol. 9, pp. 1034-1052, Jan. 2004. DOI: http://doi.org/10.3390/91201034.
Z. R. Li, H. H. Lin, L. Y. Han, L. Jiang, X. Chen, and Y. Z. Chen, “Update of PROFEAT: A web server for computing structural and physicochemical features of proteins and peptides from amino acid sequence,” Nucleic Acids Research, vol. 34, pp. W32-W37, Jul. 2006. DOI: http://doi.org/10.1093/nar/gkl305.
D. S. Cao, Q. S. Xu, and Y. Z. Liang, “Propy: A tool to generate various modes of Chou’s PseAAC,” Bioinformatics, vol. 29, pp. 960-962, Apr. 2013. DOI: http://doi.org/10.1093/bioinformatics/btt072.
P. Wang et. al, “Prediction of antimicrobial peptides based on sequence alignment and feature selection methods,” PLoS ONE, vol. 6, p. e18476, Jan. 2011. DOI: http://doi.org/10.1371/journal.pone.0018476.
J. Ruan, K. Wang, J. Yang, L. a. Kurgan, and K. Cios, “Highly accurate and consistent method for prediction of helix and strand content from primary protein sequences,” Artificial Intelligence in Medicine, vol. 35 (1-2), pp. 19-35, 2005. DOI: http://doi.org/10.1016/j.artmed.2005.02.006.
A. Ng, J. Ngiam, C.Y. Foo, Y. Mai, and C. Suen, Unsupervised Feature Learning and Deep Learning, http://ufldl.stanford.edu/wiki/index.php/UFLDL\_Tutorial.
H. C. Shin, M. R. Orton, D. J. Collins, S. J. Doran, M. O. Leach, “Stacked Autoencoders for Unsupervised Feature Learning and Multiple Organ Detection in a Pilot Study Using 4D Patient Data,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 35 (8), pp. 1930-1943, vol. 2013. DOI: http://doi.org/10.1109/TPAMI.2012.277.
T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learning, vol. 18. Springer, second ed., 2009. DOI: http://doi.org/10.1007/978-0-387-84858-7.
A. Ng, Machine Learning, 2009.
P. Golland, F. Liang, S. Mukherjee, and D. Panchenko, “Permutation Test for Classification,” Journal of Machine Learning Research, vol. 1, pp. 1-48, 2000.
M. Ojala and G. C. Garriga, “Permutation Tests for Studying Classifer Performance,” Proceedings - IEEE International Conference on Data Mining, ICDM, vol. 11, pp. 1833-1863, 2010.