Towards a supervised rescoring system for unstructured data bases used to build specialized dictionaries

Main Article Content

Autores

Antonio Rico-Sulayes

Abstract

This article proposes the architecture for a system that uses previously learned weights to sort query results from unstructured data bases when building specialized dictionaries. A common resource in the construction of dictionaries, unstructured data bases have been especially useful in providing information about lexical items frequencies and examples in use. However, when building specialized dictionaries, whose selection of lexical items does not rely on frequency, the use of these data bases gets restricted to a simple provider of examples. Even in this task, the information unstructured data bases provide may not be very useful when looking for specialized uses of lexical items with various meanings and very long lists of results. In the face of this problem, long lists of hits can be rescored based on a supervised learning model that relies on previously helpful results. The allocation of a vast set of high quality training data for this rescoring system is reported here. Finally, the architecture of sucha system, an unprecedented tool in specialized lexicography, is proposed.

Keywords:

Article Details

Licence

All articles included in the Revista Facultad de Ingeniería are published under the Creative Commons (BY) license.

Authors must complete, sign, and submit the Review and Publication Authorization Form of the manuscript provided by the Journal; this form should contain all the originality and copyright information of the manuscript.

The authors who publish in this Journal accept the following conditions:

a. The authors retain the copyright and transfer the right of the first publication to the journal, with the work registered under the Creative Commons attribution license, which allows third parties to use what is published as long as they mention the authorship of the work and the first publication in this Journal.

b. Authors can make other independent and additional contractual agreements for the non-exclusive distribution of the version of the article published in this journal (eg, include it in an institutional repository or publish it in a book) provided they clearly indicate that the work It was first published in this Journal.

c. Authors are allowed and recommended to publish their work on the Internet (for example on institutional or personal pages) before and during the process.
review and publication, as it can lead to productive exchanges and a greater and faster dissemination of published work.

d. The Journal authorizes the total or partial reproduction of the content of the publication, as long as the source is cited, that is, the name of the Journal, name of the author (s), year, volume, publication number and pages of the article.

e. The ideas and statements issued by the authors are their responsibility and in no case bind the Journal.

References

[1] G. Haensch, Los diccionarios del español en el umbral del siglo XX, Salamanca, Spain: Universidad de Salamanca, 1997.

[2] G. Haensch, “Tipología de las obras lexicográficas”, in G. Haensch, L. Wolf, S Ettinger, and R. Werner, La lexicografía: De la lingüística teórica a la lexicografía práctica, pp. 95-187, Madrid, Spain: Gredos, 1982.

[3] S. Hockey, “Textual Databases”, in J. Lawler and H. Aristar-Dry (Eds.), Using Computers in Linguistics: A Practical Guide, pp. 101-137, Routledge, 1998.

[4] P. Baker (Ed.), Contemporary Corpus Linguistics, London, UK: Continuum, 2009.

[5] S. Hockey, Electronic Texts in the Humanities: Principles and Practice, New York, NY, USA: Oxford University, 2000.

[6] H. Lindquist, Corpus Linguistics and the Description of English, Edinburgh, UK: Edinburgh University, 2009.

[7] R. A. Fontenelle (Ed.), Practical Lexicography, pp. 31-50, New York, NY, USA: Oxford University, 2008.

[8] J. A. Porto Dapena, Manual de técnica lexicográfica, Madrid, Spain: Arco libros, 2002.

[9] H. Yong and J. Peng, Bilingual Lexicography from a Communicative Perspective, Philadelphia, USA: John Benjamins, 2007.

[10] E. Bajo, Los diccionarios: Introducción a la lexicografía del español, Gijon, Spain: Trea, 2002.

[11] Collins Cobuild Primary Learner’s Dictionary, (2nd ed.), London, UK: HarperCollins, 2014.

[12] Collins COBUILD Advanced Learner’s Dictionary, (8th ed.), London, UK: HarperCollins, 2014.

[13] Collins COBUILD English Usage, (2nd ed.), London, UK: HarperCollins, 2013.

[14] L. F. Lara (Ed.), Diccionario del español de México, México: El Colegio de México, 2010.

[15] F. Plager (Ed.), Diccionario integral del español de la Argentina, Buenos Aires: Voz Activa, 2008.

[16] R. Ávila, “¿El fin de los diccionarios diferenciales? ¿El principio de los diccionarios integrales?”, Revista de Lexicografía, vol. X, pp. 7-20, 2003-2004.

[17] I. Bosque, Diccionario combinatorio del español contemporáneo: Las palabras en su contexto, Madrid: SM, 2004.

[18] I. Bosque, Diccionario combinatorio práctico del español contemporáneo: Las palabras en su contexto, Madrid: SM, 2006.

[19] L. F. Lara (Ed.), Diccionario del español usual en México, (2nd ed.), México: El Colegio de México, 2009.

[20] L. F. Lara (Ed.), Diccionario del español usual en México, México: El Colegio de México, 1996.

[21] L. F. Lara (Ed.), Diccionario básico del español de México, México: El Colegio de México, 1986.

[22] B. T. Atkins, “Theoretical Lexicography and its Relation to Dictionary-Making”, in R. A. Fontenelle (Ed.), Practical Lexicography, pp. 31-50, New York, NY, USA: Oxford University, 2008.

[23] B.T. Atkins and M. Rundell, The Oxford Guide to Practical Lexicography, New York, USA: Oxford University, 2008.

[24] Real Academia Española, Diccionario de la lengua española, (22nd ed.), Madrid: Espasa Calpe, 2001.

[25] D. Nadeau and S. Sekine, “A survey of named entity recognition and classification”, LingvisticaeInvestigationes, vol. 30(1), 3-26, 2007.

[26] H.F. Witschel, “Terminology extraction and automatic indexing - comparison and qualitative evaluation of methods”, in Proceedings of the 8th International Conference on Terminology and Knowledge Engineering, (Copenhagen), 2005.

[27] J. Sinclair, “Lexicographic evidence” in R. Ilson (Ed.), Dictionaries, lexicography and language learning, pp. 81-94, UK: Pergamon, 1985.

[28] T. P. Vartanian, Secondary data analysis, (22nd ed.), New York, NY, USA: Oxford University, 2011.

[29] L. F. Lara, “Los diccionarios contemporáneos del español y la normatividad”, in Proceedings of the II Congreso internacional de la lengua española: El español en la sociedad de la información, Valladolid, Spain, 2002.

[30] L. Bowker, “The Contribution of Corpus Linguistics to the Development of Specialised Dictionaries for Learners”, in P. A. Fuertes Olivera (Ed.), Specialised Dictionaries for Learners, pp. 155-168, Berlín, Germany: Walter de Gruyter, 2010.

[31] D. Biber, S. Conrad, and R. Reppen, Corpus linguistics: Investigating language structure and use, Cambridge, UK: Cambridge University, 1998.

[32] R. Ávila and G. Aguilar, Diccionario inicial del español en México, México: Trillas, 2003.

[33] G. Gómez de Silva, Diccionario breve de mexicanismos, México: Fondo de cultura económica, 2003.

[34] G. Colín Sánchez, Así habla la delincuencia y otros más…, México: Porrúa, 2001.

[35] A. Jiménez, Tumbaburro de la picardía mexicana: Diccionario de términos vulgares, (52nd ed.), Mexico: Diana, 1999.

[36] P. M. Usandizaga, El chingolés: Primer diccionario del lenguaje popular mexicano, (8th ed.), Mexico: Costa-Amic, 1994.

[37] A. Rico Sulayes, De vulgaridades, insultos y malsonancias: El diccionario del subestándar mexicano, Baja California, México:UABC, in press.

[38] L. R. Gay and P. W. Airasian, Educational research: Competencies for analysis and application, (7a. ed.), Englewood Cliffs, NJ, USA: Prentice Hall, 2002.

[39] Real Academia Española, Corpus de referencia del español actual, available in: http://corpus.rae.es/creanet.html, accessed: November, 2014.

[40] J. M. Iglesias, Diccionario de argot español, Madrid, Spain: Alianza, 2003.

[41] R. A. Spears, Forbidden American English: A serious compilation of taboo American English, Madrid, Spain: Alianza, 2003.

[42] J. Ayto and J. Simpson, Forbidden American English: A serious compilation of taboo American English, UK: Oxford University, 1992.

[43] J. García-Robles, Diccionario de modismos mexicanos, México: Porrúa, 2011.

[44] C. Company Company (Ed.), Diccionario de mexicanismos, México: Siglo XXI, 2010.

[45] M. P. Montes de Oca Sicilia (Ed.), El chingonario: Diccionario de uso, rehuso y abuso del chingar y sus derivados, México: Lectorum,
2010.

[46] R. Renaud (Ed.), Diccionario de hispanoamericanismos no recogidos por la Real Academia Española, Madrid: Cátedra, 2006.

[47] J. Flores y Escalante, Morralla del caló mexicano, (2nd ed.), Mexico: AMEF, 2004.

[48] El Colegio de México, Corpus del español mexicano contemporáneo, available in: http://cemc.colmex.mx/, accessed: November, 2014.

[49] Real Academia Española, Corpus Diacrónico del Español, available in: http://corpus.rae.es/cordenet.html, accessed: November, 2014.

[50] C. D. Manning, P. Raghavan, and H. Schütze, Introduction to Information Retrieval, New York, NY, USA: Cambridge, 2008.

[51] I. H. Witten, E. Frank, and M. A. Hall, Data Mining: Practical Machine Learning Tools and Techiniques, (3rd ed.), Burlington, MA, USA: Morgan Kaufmann, 2011.

[52] S. I. Hill and A. Doucet, “Adapting two-class support vector classification methods to many class problems”, in Proceedings of the 22nd international conference on Machine learning, (New York), pp. 313-320, ICML, 2005.

Downloads

Download data is not yet available.