Tradução automática de um conjunto de treinamento para extração semântica de relações
Resumo
A tradução automática (TA) é usada para obter corpus anotados partindo de corpus da língua inglesa, que podem ser aplicáveis a diferentes tarefas de processamento de linguagem natural (PLN). Levando em conta que existem mais recursos ou conjuntos de dados para treinamento de modelos PLN em inglês, este artigo explora a aplicação da TA para automatizar tarefas PLN em espanhol. Desta forma, o artigo descreve um conjunto de dados para extração de relações genéricas (reACE) e a construção de um modelo de extração semântica de relações em espanhol (ER), baseado no conjunto de amostras traduzidas do inglês para o espanhol. Os resultados mostram que para a tarefa de TA é necessário implementar um processo de pré-edição do corpus em inglês, a fim de evitar erros de tradução e pós-edição e manter as anotações do corpus original. Os modelos ER em espanhol alcançam medidas de acurácia, completude e valor F comparáveis às obtidas pelo modelo na língua inglesa, o que sugere que a tradução automática é uma ferramenta útil para realizar tarefas de PLN na língua espanhola.
Palavras-chave
linguística computacional, tradução automática, linguística de corpus, extração de relações
Referências
- Ananthram, A., Allaway, E., & McKeown, K. (2020). Event Guided Denoising for Multilingual Relation Learning. arXiv preprint: arXiv:2012.02721. https://doi.org/10.18653/v1/2020.coling-main.131
- Anastasopoulos, A. (2019). An Analysis of Source-Side Grammatical Errors in NMT. arXiv preprint: arXiv:1905.10024.
- Bach, N., & Sameer, B. (2007). A Survey on Relation Extraction. Language Technologies Institute, Carnegie Mellon University 178. https://doi.org/10.1007/978-981-10-7359-5_6
- Bahr, R. H., Lebby, S., & Wilkinson, L. C. (2020). Spelling Error Analysis of Written Summaries in an Academic Register by Students with Specific Learning Disabilities: Phonological, Orthographic, and Morphological Influences. Reading and Writing, 33(1), 121-142. https://doi.org/10.1007/s11145-019-09977-0
- Belinkov, Y., & Glass, J. (2019). Analysis Methods in Neural Language Processing: A Survey. Transactions of the Association for Computational Linguistics, 7, 49-72. https://doi.org/10.1162/tacl_a_00254
- Carrino, C. P., Costa-Jussà, M. R., & Fonollosa, J. A. (2020). Automatic Spanish Translation of SQuAD Dataset for Multilingual Question Answering. In Proceedings of the 12th Language Resources and Evaluation Conference (5515-5523).
- Castillo, M. N. (2020). Corpus Básico del Español de Chile ©: metodología de procesamiento y análisis. Lexis, 44(2), 483-523. https://doi.org/10.18800/lexis.202002.004
- Cheng, Y. (2019). Neural Machine Translation. In Joint Training for Neural Machine Translation (1-10). Springer. https://doi.org/10.1007/978-981-32-9748-7_1
- Collantes, C., Mallo, J., Parra, C., Quiñones, H. & Serrano, R. (2018). Pásate al lado oscuro: ventajas de la traducción automática para el traductor profesional. La Linterna del Traductor, 17, 33-39.
- Gamallo, P., & García, M. (2017). LinguaKit: Uma ferramenta multilingue para a análise linguística e a extração de informação. Linguamática, 9(1), 19-28. https://doi.org/10.21814/lm.9.1.243
- Guan, H., Li, J., Xu, H., & Devarakonda, M. (2020). Robustly Pre-trained Neural Model for Direct Temporal Relation Extraction. arXiv preprint: arXiv:2004.06216. https://doi.org/10.1109/ICHI52183.2021.00090
- Gurulingappa, H., Rajput, A. M., Roberts, A., Fluck, J., Hofmann-Apitius, M., & Toldo, L. (2012). Development of a Benchmark Corpus to Support the Automatic Extraction of Drug-Related Adverse Effects from Medical Case Reports. Journal of Biomedical Informatics, 45(5), 885–892. https://doi.org/10.1016/j.jbi.2012.04.008
- Hachey, B., Grover, C., & Tobin, R. (2012). Datasets for Generic Relation Extraction. Natural Language Engineering, 18(1), 21–59. http://dx.doi.org/10.1017/S1351324911000106
- Haque, R., Hasanuzzaman, M., & Way, A. (2020). Analysing Terminology Translation Errors in Statistical and Neural Machine Translation. Machine Translation, 34(2), 149-195. https://doi.org/10.1007/s10590-02009251-z
- Hidalgo-Ternero, C. M. (2021). Google Translate vs. DeepL. MonTI. Monografías de Traducción e Interpretación, 154-177.
- Kramer, O. (2016). Scikit-learn. In Machine learning for evolution strategies. Studies in Big Data, vol 20 (pp. 45-53). Springer, Cham. https://doi.org/10.1007/978-3-319-33383-0_5
- Kumar, S. (2017). A Survey of Deep Learning Methods for Relation Extraction. arXiv preprint: arXiv:1705.03645.
- Lin, Y., Liu, Z., & Sun, M. (2017). Neural Relation Extraction with Multi-Lingual Attention. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 34–43. Association for Computational Linguistics. http://dx.doi.org/10.18653/v1/P17-1004
- Mesquita, F., Schmidek, J., & Barbosa, D. (2013). Effectiveness and Efficiency of Open Relation Extraction. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, 447-457. Association for Computational Linguistics.
- Mikelenić, B., & Tadić, M. (2020). Building the Spanish-Croatian Parallel Corpus. In Proceedings of the 12th Language Resources and Evaluation Conference, 3932-3936. European Language Resources Association
- Mitchell, A., Strassel, S., Huang, S., & Zakhary, R. (2005). Ace 2004 Multilingual Training Corpus. Linguistic Data Consortium, Philadelphia, 1, 1-1.
- Nasar, Z., Jaffry, S. W., & Malik, M. K. (2021). Named Entity Recognition and Relation Extraction: State-of-the-Art. ACM Computing Surveys (CSUR), 54(1), 1-39. https://doi.org/10.1145/3445965
- Ni, J., & Florian, R. (2019). Neural Cross-Lingual Relation Extraction Based on Bilingual Word Embedding Mapping. arXiv preprint: arXiv:1911.00069. https://doi.org/10.18653/v1/D19-1038
- Pastor, G. C. (2018). Laughing One’s Head Off in Spanish Subtitles: A Corpus-Based Study on Diatopic Variation and Its Consequences for Translation1. Fraseología, Diatopía y Traducción/Phraseology, Diatopic Variation and Translation, 17, 32. https://doi.org/10.1075/ivitra.17.03co
- Pawar, S., Palshikar, G. K., & Bhattacharyya, P. (2017). Relation Extraction: A Survey. arXiv preprint: arXiv:1712.05191.
- Popović, M. (2020). Relations Between Comprehensibility and Adequacy Errors in Machine Translation Output. In Proceedings of the 24th Conference on Computational Natural Language Learning, (pp. 256-264). Association for Computational Linguistics. https://doi.org/10.18653/v1/2020.conll-1.19
- Pyysalo, S., Ginter, F., Heimonen, J., Björne, J., Boberg, J., Järvinen, J., & Salakoski, T. (2007). BioInfer: A Corpus for Information Extraction in the Biomedical Domain. BMC Bioinformatics, 8(1), 50. https://doi.org/10.1186/1471-2105-8-50
- Rodrigues, J., & Branco, A. (2020). Argument Identification in a Language Without Labeled Data. In International Conference on Computational Processing of the Portuguese Language, (pp. 335-345). https://doi.org/10.1007/978-3-030-41505-1_32
- Sánchez, A. (2010). Traducción automática, corpus lingüísticos y desambiguación automática de los significados de las palabras. En R. Rabadán, M. Fernández & T. Guzmán (coords.), Lengua, traducción, recepción: en honor de Julio César Santoyo, vol. 1 (pp. 555-587). Universidad de León, Área de Publicaciones.
- Smirnova, A., & Cudré-Mauroux, P. (2018). Relation Extraction Using Distant Supervision: A Survey. ACM Computing Surveys (CSUR), 51(5), 1-35. https://doi.org/10.1145/3241741
- Torres, J. P., De Piñérez Reyes, R. G., & Bucheli, V. A. (2018). Support Vector Machines for Semantic Relation Extraction in Spanish Language. Colombian Conference on Computing, 326-337. https://doi.org/10.1007/978-3-319-98998- 3_26
- Verga, P., Belanger, D., Strubell, E., Roth, B., & McCallum, A. (2015). Multilingual Relation Extraction Using Compositional Universal Schema. arXiv preprint: arXiv:1511.06396. https://doi.org/10.18653/v1/N16-1103
- Virmani, C., Pillai, A., & Juneja, D. (2017). Extracting Information from Social Networks Using NLP. International Journal of Computational Intelligence Research, 13(4), 621-630.
- Walker, C., Strassel, S., Medero, J., & Maeda, K. (2006). ACE 2005 Multilingual Training Corpus. Linguistic Data Consortium. https://doi.org/10.35111/mwxc-vh88
- Wu, Y., Schuster, M., Chen, Z., Le, Q., Norouzi, M., Machery, W., Krikun, M. et al. (2016). Google’s Neural Machine Translation System: Bridging the Gap Between Human and Machine Translation. arXiv preprint: arXiv:1609.08144.
- Yamada, M. (2019). The Impact of Google Neural Machine Translation on Post-Editing by Student Translators. The Journal of Specialised Translation, 31, 87-106.
- Zelenko, D., Chinatsu, A., and Anthony, R. (2003, Feb.). Kernel Methods for Relation Extraction. Journal of Machine Learning Research, 3, 1083-1106. https://dl.acm.org/doi/10.3115/1118693.1118703
- Zhang, Q., Mengdong C., and Lianzhong, L. (2017). A Review on Entity Relation Extraction. In 2017 Second International Conference on Mechanical, Control and Computer Engineering (ICMCCE). IEEE. https://doi.org/10.1109/ICMCCE.2017.14
- Zhila, A., & Gelbukh, A. (2013). Comparison of Open Information Extraction for Spanish and English. Computational Linguistics and Intellectual Technologies, 12(1), 794-802.