Medición de la representatividad utilizando principios de la matriz de cobertura

Alexander Castro-Romero; Carlos-Alberto Cobos-Lozada

Vol. 32 No. 65 (2023)
July-September 2023 (Continuous Publication)

Papers

Measuring Representativeness Using Covering Array Principles

Published 2023-09-30

Alexander Castro-Romero
Carlos-Alberto Cobos-Lozada

Alexander Castro-Romero
Universidad Pedagógica y Tecnológica de Colombia

Carlos-Alberto Cobos-Lozada
Universidad del Cauca

How to Cite

Castro-Romero, A., & Cobos-Lozada, C.-A. (2023). Measuring Representativeness Using Covering Array Principles. Revista Facultad de Ingeniería, 32(65), e15314. Retrieved from https://revistas.uptc.edu.co/index.php/ingenieria/article/view/15314

Download Citation

This work is licensed under a Creative Commons Attribution 4.0 International License.

Abstract

Representativeness is an important data quality characteristic in data science processes; a data sample is said to be representative when it reflects a larger group as accurately as possible. Having low representativeness indices in the data can lead to the generation of biased models. Hence, this study shows the elements that make up a new model for measuring representativeness using a mathematical object testing element of coverage arrays called the "P Matrix". To test the model, an experiment was proposed where a data set is taken, divided into training and test data subsets using two sampling strategies: Random and Stratified, and the representativeness values are compared. If the data division is adequate, the two sampling strategies should present similar representativeness indexes. The model was implemented in a prototype software using Python (for data processing) and Vue (for data visualization) technologies, this version of the model only allows to analyze binary data sets (for now). To test the model, the "Wines" dataset (UC Irvine Machine Learning Repository) was fitted. The conclusion is that both sampling strategies generate similar representativeness results for this dataset, although this result is predictable, it is clear that adequate representativeness of the data is important when generating the test and training datasets subsets. Therefore, as future work we plan to extend the model to categorical data and explore more complex datasets.

Keywords

classification algorithms, coverage arrays, data quality, data sets, data representativeness

PDF XML

References

D. Srivastava, M. Scannapieco, T. C. Redman, “Ensuring high-quality private data for responsible data science: Vision and challenges,” Journal of Data and Information Quality, vol. 11, no. 1, pp. 1–9, 2019. https://doi.org/10.1145/3287168
R. Clarke, “Big data, big risks,” Information Systems Journal, vol. 26, no. 1, pp. 77–90, 2016. https://doi.org/10.1111/isj.12088
A. Alsudais, Incorrect Data in the Widely Used Inside Airbnb Dataset, 2020. http://arxiv.org/abs/2007.03019.
A. Yapo, J. Weiss, “Ethical Implications of Bias in Machine Learning,” in Proceedings of the 51st Hawaii International Conference on System Sciences, 2018. https://doi.org/10.24251/hicss.2018.668
N. Polyzotis, S. Roy, S. E. Whang, M. Zinkevich, “Data lifecycle challenges in production machine learning: A survey,” SIGMOD Record, vol. 47, no. 2, pp. 17–28, 2018. https://doi.org/10.1145/3299887.3299891
J. A. Rojas, M. Beth Kery, S. Rosenthal, A. Dey, “Sampling techniques to improve big data exploration”, in 7th Symposium on Large Data Analysis and Visualization, 2017, pp. 26–35. https://doi.org/10.1109/LDAV.2017.8231848
V. Mayer-Schönberger, K. Cukier, Big data: La revolución de los datos masivos, Turner, 2013.
J. Torres-Jimenez, I. Izquierdo-Marquez, “Survey of covering arrays,” in 15th International Symposium on Symbolic and Numeric Algorithms for Scientific Computing, 2013, pp. 20–27. https://doi.org/10.1109/SYNASC.2013.10
J. Adriana Timaná-Peña, C. Alberto Cobos-Lozada, J. Torres-Jimenez, “Metaheuristic algorithms for building Covering Arrays A review,” Revista Facultad de Ingeniería, vol. 25, no. 43, pp. 31–45, 2016. https://doi.org/10.19053/01211129.v25.n43.2016.5295
C. L. Blake, C. J. Merz, UCI Repository of Machine Learning Databases, University of California, Oakland, 1998. https://archive.ics.uci.edu/ml/datasets/wine

Downloads

Download data is not yet available.

Most read articles by the same author(s)

Juan Sebastián González-Sanabria, Juan Antonio Morente-Molinera, Alexander Castro-Romero, DeSoftIn: A methodological proposal for individual software development , Revista Facultad de Ingeniería: Vol. 26 No. 45 (2017)
Manuel-Alejandro Pastrana-Pardo, Hugo-Armando Ordoñez-Erazo, Carlos-Alberto Cobos-Lozada, Approach to the Best Practices of Software Development Based on DevOps and SCRUM Used in Very Small Entities , Revista Facultad de Ingeniería: Vol. 31 No. 61 (2022): July-September 2022 (Continuous Publication)
Manuel-Alejandro Pastrana-Pardo, Hugo-Armando Ordoñez-Erazo, Carlos-Alberto Cobos-Lozada, Process Model Represented in BPMN for Guiding the Implementation of Software Development Practices in Very Small Companies Harmonizing DEVOPS and SCRUM , Revista Facultad de Ingeniería: Vol. 31 No. 62 (2022): October-December 2022 (Continuous Publication)
Martha-Eliana Mendoza-Becerra, Hugo-Armando Ordoñez-Eraso, Miguel-Ángel Niño-Zambrano, Carlos-Alberto Cobos-Lozada, Rodolfo García-Sierra, Characterization of Energy Portability Tools to Implement in Colombia , Revista Facultad de Ingeniería: Vol. 31 No. 60 (2022): April-June 2022 (Continuous Publication)
Carlos-Robinson Campo, Juan-Pablo Salazar, Carlos-Alberto Cobos-Lozada, ModeLab - Web Tool for the Modeling of Bus Rapid Transit Systems , Revista Facultad de Ingeniería: Vol. 30 No. 56 (2021): April-June 2021 (Continuous Publication)
Jimena-Adriana Timaná-Peña, Carlos-Alberto Cobos-Lozada, Jason-Paul Anturi-Martínez, José-Luis Paz-Realpe, SentiFuzzy: A Twitter Sentiment Classifier Based on Fuzzy Logic , Revista Facultad de Ingeniería: Vol. 32 No. 66 (2023): October-December 2023 (Continuous Publication)
Carlos-Alberto Cobos-Lozada, Henry Muñoz-Collazos, Richar Urbano-Muñoz, Comparative Study of Cuckoo-Inspired Algorithms to Solve Large-Scale Continuous Optimization Problems , Revista Facultad de Ingeniería: Vol. 33 No. 69 (2024): July-September 2024: Open Science at the Service of Engineering

		Fuente Academica Premier

		(Categoría B)

Abstract

Keywords

References

Downloads

Most read articles by the same author(s)

Similar Articles