Assessing the behavior of machine learning methods to predict the activity of antimicrobial peptides
Resumen
This study demonstrates the importance of obtaining statistically stable results when using machine learning methods to predict the activity of antimicrobial peptides, due to the cost and complexity of the chemical processes involved in cases where datasets are particularly small (less than a few hundred instances). Like in other fields with similar problems, this results in large variability in the performance of predictive models, hindering any attempt to transfer them to lab practice. Rather than targeting good peak performance obtained from very particular experimental setups, as reported in related literature, we focused on characterizing the behavior of the machine learning methods, as a preliminary step to obtain reproducible results across experimental setups, and, ultimately, good performance. We propose a methodology that integrates feature learning (autoencoders) and selection methods (genetic algorithms) thorough the exhaustive use of performance metrics (permutation tests and bootstrapping), which provide stronger statistical evidence to support investment decisions with the lab resources at hand. We show evidence for the usefulness of 1) the extensive use of computational resources, and 2) adopting a wider range of metrics than those reported in the literature to assess method performance. This approach allowed us to guide our quest for finding suitable machine learning methods, and to obtain results comparable to those in the literature with strong statistical stability.
Palabras clave
antimicrobial peptides, learning curves, machine learning, statistical stability, support vector regression
Citas
- O. Taboureau, “Methods for Building Quantitative Structure-Activity Relationship (QSAR) Descriptors and Predictive Models for Computer-Aided Design of Antimicrobial Peptides,” in Antimicrobial Peptides, Methods in Molecular Biology, vol. 8 (6), pp. 77-86, 2010. DOI: https://doi.org/10.1007/978-1-60761-594-1_6
- M. R. Borkar, R. R. S. Pissurlenkar, and E. C. Coutinho, “HomoSAR: Bridging comparative protein modeling with quantitative structural activity relationship to design new peptides,” Journal of Computational Chemistry, vol. 34, pp. 2635-2646, Nov. 2013. DOI: http://doi.org/10.1002/jcc.23436. DOI: https://doi.org/10.1002/jcc.23436
- M. Shu, R. Yu, Y. Zhang, J. Wang, L. Yang, L. Wang, and Z. Lin, “Predicting the Activity of Antimicrobial Peptides with Amino Acid Topological Information,” Medicinal Chemistry, vol. 9, pp. 32-44, Feb. 2013. DOI: http://doi.org/10.2174/157340613804488350. DOI: https://doi.org/10.2174/157340613804488350
- M. Torrent, D. Andreu, V. M. Nogues, and E. Boix, “Connecting peptide physicochemical and antimicrobial properties by a rational prediction model,” PLoS ONE, vol. 6, p. e16968, Jan. 2011. DOI: http://doi.org/10.1371/journal.pone.0016968, DOI: https://doi.org/10.1371/journal.pone.0016968
- Y. Wang, Y. Ding, H. Wen, Y. Lin, Y. Hu, Y. Zhang, Q. Xia, and Z. Lin, “QSAR Modeling and Design of Cationic Antimicrobial Peptides Based on Structural Properties of Amino Acids,” Combinatorial Chemistry & High Throughput Screening, vol. 15, pp. 347-353, May. 2012. DOI: http://doi.org/10.2174/138620712799361807. DOI: https://doi.org/10.2174/138620712799361807
- Z. H. Lin, H. X. Long, Z. Bo, Y. Q. Wang, and Y. Z. Wu, New descriptors of amino acids and their application to peptide QSAR study, Oct. 2008. DOI: https://doi.org/10.1016/j.peptides.2008.06.004
- X. Zhou, Z. Li, Z. Dai, and X. Zou, “QSAR modeling of peptide biological activity by coupling support vector machine with particle swarm optimization algorithm and genetic algorithm,” Journal of Molecular Graphics and Modelling, vol. 29, pp. 188-196, Sep. 2010. DOI: http://doi.org/10.1016/j.jmgm.2010.06.002. DOI: https://doi.org/10.1016/j.jmgm.2010.06.002
- C. Cortes and V. Vapnik, “Support-vector networks,” Machine Learning, vol. 20 (3), pp. 273-297, 1995. DOI: http://doi.org/10.1007/BF00994018. DOI: https://doi.org/10.1007/BF00994018
- F. Camacho, R. Torres and R. Ramos Pollán, “Feature learning using stacked autoencoders to predict the activity of antimicrobial peptides,” in Computational Methods in Systems Biology, 2015. DOI: http://doi.org/10.1007/978-3-319-23401-4_11. DOI: https://doi.org/10.1007/978-3-319-23401-4_11
- R. Kiralj and M. M. C. Ferreira, “Basic validation procedures for regression models in QSAR and QSPR studies: Theory and application,” Journal of the Brazilian Chemical Society, vol. 20 (4), pp. 770-787, 2009. DOI: http://doi.org/10.1590/S0103-50532009000400021. DOI: https://doi.org/10.1590/S0103-50532009000400021
- A. Tropsha. “Best Practices for QSAR Model Development, Validation and Exploitation,” Molecular Informatics, vol. 29, pp. 476-488, 2010. DOI: http://doi.org/10.1002/minf.201000061. DOI: https://doi.org/10.1002/minf.201000061
- T. Bäck, Evolutionary algorithms in theory and practice: evolution strategies, evolutionary programming, genetic algorithms. Oxford university press, 1996. DOI: https://doi.org/10.1093/oso/9780195099713.001.0001
- A. Cherkasov and B. Jankovic, “Application of ‘inductive’ QSAR descriptors for quantification of antibacterial activity of cationic polypeptides,” Molecules, vol. 9, pp. 1034-1052, Jan. 2004. DOI: http://doi.org/10.3390/91201034. DOI: https://doi.org/10.3390/91201034
- Z. R. Li, H. H. Lin, L. Y. Han, L. Jiang, X. Chen, and Y. Z. Chen, “Update of PROFEAT: A web server for computing structural and physicochemical features of proteins and peptides from amino acid sequence,” Nucleic Acids Research, vol. 34, pp. W32-W37, Jul. 2006. DOI: http://doi.org/10.1093/nar/gkl305. DOI: https://doi.org/10.1093/nar/gkl305
- D. S. Cao, Q. S. Xu, and Y. Z. Liang, “Propy: A tool to generate various modes of Chou’s PseAAC,” Bioinformatics, vol. 29, pp. 960-962, Apr. 2013. DOI: http://doi.org/10.1093/bioinformatics/btt072. DOI: https://doi.org/10.1093/bioinformatics/btt072
- P. Wang et. al, “Prediction of antimicrobial peptides based on sequence alignment and feature selection methods,” PLoS ONE, vol. 6, p. e18476, Jan. 2011. DOI: http://doi.org/10.1371/journal.pone.0018476. DOI: https://doi.org/10.1371/journal.pone.0018476
- J. Ruan, K. Wang, J. Yang, L. a. Kurgan, and K. Cios, “Highly accurate and consistent method for prediction of helix and strand content from primary protein sequences,” Artificial Intelligence in Medicine, vol. 35 (1-2), pp. 19-35, 2005. DOI: http://doi.org/10.1016/j.artmed.2005.02.006. DOI: https://doi.org/10.1016/j.artmed.2005.02.006
- A. Ng, J. Ngiam, C.Y. Foo, Y. Mai, and C. Suen, Unsupervised Feature Learning and Deep Learning, http://ufldl.stanford.edu/wiki/index.php/UFLDL_Tutorial.
- H. C. Shin, M. R. Orton, D. J. Collins, S. J. Doran, M. O. Leach, “Stacked Autoencoders for Unsupervised Feature Learning and Multiple Organ Detection in a Pilot Study Using 4D Patient Data,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 35 (8), pp. 1930-1943, vol. 2013. DOI: http://doi.org/10.1109/TPAMI.2012.277. DOI: https://doi.org/10.1109/TPAMI.2012.277
- T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learning, vol. 18. Springer, second ed., 2009. DOI: http://doi.org/10.1007/978-0-387-84858-7. DOI: https://doi.org/10.1007/978-0-387-84858-7
- A. Ng, Machine Learning, 2009.
- P. Golland, F. Liang, S. Mukherjee, and D. Panchenko, “Permutation Test for Classification,” Journal of Machine Learning Research, vol. 1, pp. 1-48, 2000.
- M. Ojala and G. C. Garriga, “Permutation Tests for Studying Classifer Performance,” Proceedings - IEEE International Conference on Data Mining, ICDM, vol. 11, pp. 1833-1863, 2010. DOI: https://doi.org/10.1109/ICDM.2009.108