Skip to main navigation menu Skip to main content Skip to site footer

Towards a supervised rescoring system for unstructured data bases used to build specialized dictionaries

Abstract

This article proposes the architecture for a system that uses previously learned weights to sort query results from unstructured data bases when building specialized dictionaries. A common resource in the construction of dictionaries, unstructured data bases have been especially useful in providing information about lexical items frequencies and examples in use. However, when building specialized dictionaries, whose selection of lexical items does not rely on frequency, the use of these data bases gets restricted to a simple provider of examples. Even in this task, the information unstructured data bases provide may not be very useful when looking for specialized uses of lexical items with various meanings and very long lists of results. In the face of this problem, long lists of hits can be rescored based on a supervised learning model that relies on previously helpful results. The allocation of a vast set of high quality training data for this rescoring system is reported here. Finally, the architecture of sucha system, an unprecedented tool in specialized lexicography, is proposed.

Keywords

unstructured data bases, supervised rescoring, specialized lexicography, dictionary making

PDF HTML

References

  1. G. Haensch, Los diccionarios del español en el umbral del siglo XX, Salamanca, Spain: Universidad de Salamanca, 1997.
  2. G. Haensch, “Tipología de las obras lexicográficas”, in G. Haensch, L. Wolf, S Ettinger, and R. Werner, La lexicografía: De la lingüística teórica a la lexicografía práctica, pp. 95-187, Madrid, Spain: Gredos, 1982.
  3. S. Hockey, “Textual Databases”, in J. Lawler and H. Aristar-Dry (Eds.), Using Computers in Linguistics: A Practical Guide, pp. 101-137, Routledge, 1998.
  4. P. Baker (Ed.), Contemporary Corpus Linguistics, London, UK: Continuum, 2009.
  5. S. Hockey, Electronic Texts in the Humanities: Principles and Practice, New York, NY, USA: Oxford University, 2000.
  6. H. Lindquist, Corpus Linguistics and the Description of English, Edinburgh, UK: Edinburgh University, 2009.
  7. R. A. Fontenelle (Ed.), Practical Lexicography, pp. 31-50, New York, NY, USA: Oxford University, 2008.
  8. J. A. Porto Dapena, Manual de técnica lexicográfica, Madrid, Spain: Arco libros, 2002.
  9. H. Yong and J. Peng, Bilingual Lexicography from a Communicative Perspective, Philadelphia, USA: John Benjamins, 2007.
  10. E. Bajo, Los diccionarios: Introducción a la lexicografía del español, Gijon, Spain: Trea, 2002.
  11. Collins Cobuild Primary Learner’s Dictionary, (2nd ed.), London, UK: HarperCollins, 2014.
  12. Collins COBUILD Advanced Learner’s Dictionary, (8th ed.), London, UK: HarperCollins, 2014.
  13. Collins COBUILD English Usage, (2nd ed.), London, UK: HarperCollins, 2013.
  14. L. F. Lara (Ed.), Diccionario del español de México, México: El Colegio de México, 2010.
  15. F. Plager (Ed.), Diccionario integral del español de la Argentina, Buenos Aires: Voz Activa, 2008.
  16. R. Ávila, “¿El fin de los diccionarios diferenciales? ¿El principio de los diccionarios integrales?”, Revista de Lexicografía, vol. X, pp. 7-20, 2003-2004.
  17. I. Bosque, Diccionario combinatorio del español contemporáneo: Las palabras en su contexto, Madrid: SM, 2004.
  18. I. Bosque, Diccionario combinatorio práctico del español contemporáneo: Las palabras en su contexto, Madrid: SM, 2006.
  19. L. F. Lara (Ed.), Diccionario del español usual en México, (2nd ed.), México: El Colegio de México, 2009.
  20. L. F. Lara (Ed.), Diccionario del español usual en México, México: El Colegio de México, 1996.
  21. L. F. Lara (Ed.), Diccionario básico del español de México, México: El Colegio de México, 1986.
  22. B. T. Atkins, “Theoretical Lexicography and its Relation to Dictionary-Making”, in R. A. Fontenelle (Ed.), Practical Lexicography, pp. 31-50, New York, NY, USA: Oxford University, 2008.
  23. B.T. Atkins and M. Rundell, The Oxford Guide to Practical Lexicography, New York, USA: Oxford University, 2008.
  24. Real Academia Española, Diccionario de la lengua española, (22nd ed.), Madrid: Espasa Calpe, 2001.
  25. D. Nadeau and S. Sekine, “A survey of named entity recognition and classification”, LingvisticaeInvestigationes, vol. 30(1), 3-26, 2007.
  26. H.F. Witschel, “Terminology extraction and automatic indexing - comparison and qualitative evaluation of methods”, in Proceedings of the 8th International Conference on Terminology and Knowledge Engineering, (Copenhagen), 2005.
  27. J. Sinclair, “Lexicographic evidence” in R. Ilson (Ed.), Dictionaries, lexicography and language learning, pp. 81-94, UK: Pergamon, 1985.
  28. T. P. Vartanian, Secondary data analysis, (22nd ed.), New York, NY, USA: Oxford University, 2011.
  29. L. F. Lara, “Los diccionarios contemporáneos del español y la normatividad”, in Proceedings of the II Congreso internacional de la lengua española: El español en la sociedad de la información, Valladolid, Spain, 2002.
  30. L. Bowker, “The Contribution of Corpus Linguistics to the Development of Specialised Dictionaries for Learners”, in P. A. Fuertes Olivera (Ed.), Specialised Dictionaries for Learners, pp. 155-168, Berlín, Germany: Walter de Gruyter, 2010.
  31. D. Biber, S. Conrad, and R. Reppen, Corpus linguistics: Investigating language structure and use, Cambridge, UK: Cambridge University, 1998.
  32. R. Ávila and G. Aguilar, Diccionario inicial del español en México, México: Trillas, 2003.
  33. G. Gómez de Silva, Diccionario breve de mexicanismos, México: Fondo de cultura económica, 2003.
  34. G. Colín Sánchez, Así habla la delincuencia y otros más…, México: Porrúa, 2001.
  35. A. Jiménez, Tumbaburro de la picardía mexicana: Diccionario de términos vulgares, (52nd ed.), Mexico: Diana, 1999.
  36. P. M. Usandizaga, El chingolés: Primer diccionario del lenguaje popular mexicano, (8th ed.), Mexico: Costa-Amic, 1994.
  37. A. Rico Sulayes, De vulgaridades, insultos y malsonancias: El diccionario del subestándar mexicano, Baja California, México:UABC, in press.
  38. L. R. Gay and P. W. Airasian, Educational research: Competencies for analysis and application, (7a. ed.), Englewood Cliffs, NJ, USA: Prentice Hall, 2002.
  39. Real Academia Española, Corpus de referencia del español actual, available in: http://corpus.rae.es/creanet.html, accessed: November, 2014.
  40. J. M. Iglesias, Diccionario de argot español, Madrid, Spain: Alianza, 2003.
  41. R. A. Spears, Forbidden American English: A serious compilation of taboo American English, Madrid, Spain: Alianza, 2003.
  42. J. Ayto and J. Simpson, Forbidden American English: A serious compilation of taboo American English, UK: Oxford University, 1992.
  43. J. García-Robles, Diccionario de modismos mexicanos, México: Porrúa, 2011.
  44. C. Company Company (Ed.), Diccionario de mexicanismos, México: Siglo XXI, 2010.
  45. M. P. Montes de Oca Sicilia (Ed.), El chingonario: Diccionario de uso, rehuso y abuso del chingar y sus derivados, México: Lectorum,
  46. R. Renaud (Ed.), Diccionario de hispanoamericanismos no recogidos por la Real Academia Española, Madrid: Cátedra, 2006.
  47. J. Flores y Escalante, Morralla del caló mexicano, (2nd ed.), Mexico: AMEF, 2004.
  48. El Colegio de México, Corpus del español mexicano contemporáneo, available in: http://cemc.colmex.mx/, accessed: November, 2014.
  49. Real Academia Española, Corpus Diacrónico del Español, available in: http://corpus.rae.es/cordenet.html, accessed: November, 2014.
  50. C. D. Manning, P. Raghavan, and H. Schütze, Introduction to Information Retrieval, New York, NY, USA: Cambridge, 2008.
  51. I. H. Witten, E. Frank, and M. A. Hall, Data Mining: Practical Machine Learning Tools and Techiniques, (3rd ed.), Burlington, MA, USA: Morgan Kaufmann, 2011.
  52. S. I. Hill and A. Doucet, “Adapting two-class support vector classification methods to many class problems”, in Proceedings of the 22nd international conference on Machine learning, (New York), pp. 313-320, ICML, 2005.

Downloads

Download data is not yet available.