Anomalies detection for big data

Main Article Content

Autores

Omar Torres-Domínguez http://orcid.org/0000-0003-4582-7549
Samuel Sabater-Fernández http://orcid.org/0000-0003-4552-9103
Lisandra Bravo-Ilisatigui, M.Sc. http://orcid.org/0000-0002-8209-4121
Diana Martin-Rodríguez, Ph. D.
Milton García-Borroto, Ph. D. http://orcid.org/0000-0002-3154-177X

Abstract

The development of the digital age has resulted in a considerable increase in data volumes. These large volumes of data have been called big data since they exceed the processing capacity of conventional database systems. Several sectors consider various opportunities and applications in the detection of anomalies in big data problems. This type of analysis can be very useful the use of data mining techniques because it allows extracting patterns and relationships from large amounts of data. The processing and analysis of these data volumes need tools capable of processing them as Apache Spark and Hadoop. These tools do not have specific algorithms for detecting anomalies. The general objective of the work is to develop a new algorithm for the detection of neighborhood-based anomalies in big data problems. From a comparative study, the KNNW algorithm was selected by its results, in order to design a big data variant. The implementation of the big data algorithm was done in the Apache Spark tool, using the parallel programming paradigm MapReduce. Subsequently different experiments were performed to analyze the behavior of the algorithm with different configurations. Within the experiments, the execution times and the quality of the results were compared between the sequential variant and the big data variant. Getting better results, the big data variant with significant difference. Getting the big data variant, KNNW-BigData, can process large volumes of data.


Keywords: big data; data mining; detecting anomalies; MapReduce.

Keywords:

Article Details

Licence

Creative Commons License

This work is licensed under a Creative Commons Attribution 4.0 International License.

The journal authorizes the total or partial reproduction of the published article, as long as the source, including the name of the Journal, author(s), year, volume, issue, and pages are cited.

The ideas and assertions expressed by the authors are their solely responsibility and do not represent the views and opinions of the Journal or its editors.

All articles included in the Revista Facultad de Ingeniería are published under the Creative Commons (BY) license.

Authors must complete, sign, and submit the Review and Publication Authorization Form of the manuscript provided by the Journal; this form should contain all the originality and copyright information of the manuscript.

The authors  keep copyright, however, once the work in the Journal has been published, the authors must always allude to it.

The Journal allows and invites authors to publish their work in repositories or on their website after the presentation of the number in which the work is published with the aim of generating greater dissemination of the work.

References

[1] R. Bolton, and D. Hand, "Statistical fraud detection: A review," Statistical science, pp. 235-249, 2002.

[2] K. Chitra, and B. Subashini, "Data mining techniques and its applications in banking sector," International Journal of Emerging Technology and Advanced Engineering, vol. 3, pp. 219-226, 2013.

[3] S.-H. Li, D. C. Yen, W.-H. Lu, and C. Wang, "Identifying the signs of fraudulent accounts using data mining techniques," Computers in Human Behavior, vol. 28 (3), pp. 1002-1013, May. 2012. DOI: https://doi.org/10.1016/j.chb.2012.01.002.

[4] V. Chandola, A. Banerjee, and V. Kumar, "Anomaly detection: A survey," ACM computing surveys (CSUR), vol. 41 (3), pp. 1-15, Jul. 2009. DOI: https://doi.org/10.1145/1541880.1541882.

[5] J. Zhang, "Advancements of outlier detection: A survey," ICST Transactions on Scalable Information Systems, vol. 13 (1), pp. 1-26, Feb. 2013. DOI: https://doi.org/10.4108/trans.sis.2013.01-03.e2.

[6] L. M. Cruz-Quispe, and M. T. Rantes-García, "Detección de fraudes usando técnicas de clustering," 2010.

[7] M. Vadoodparast, and A. R. Hamdan, "Fraudulent Electronic Transaction Detection using dynamic KDA model," International Journal of Computer Science and Information Security, vol. 13, p. 90, 2015.

[8] M.Zhang, J.Salerno, and P.Yu, "Applying data mining in investigating money laundering crimes," in Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, 2003, pp. 747-752. DOI: https://doi.org/10.1145/956750.956851.

[9] B. Baesens, Analytics in a big data world: The essential guide to data science and its applications. New Jersey: John Wiley & Sons, 2014.

[10] J. Coumaros, S. D. Roys, L. Chretien, J. Buvat, S. KVJ, V. Clerk, et al., "Big Data Alchemy: How can Banks Maximize the Value of their Customer Data?," Capgemini Consulting, 2014.

[11] V. Mayer-Schönberger, and K. Cukier, Big data: A revolution that will transform how we live, work, and think. New York: Houghton Mifflin Harcourt, 2013.

[12] N. Marz, and J. Warren, Big Data: Principles and best practices of scalable realtime data systems. Manning Publications Co., 2015.

[13] S. Ryza, U. Laserson, S. Owen, and J. Wills, Advanced Analytics with Spark: Patterns for Learning from Data at Scale. O'Reilly Media, Inc., 2015.

[14] H. Karau, Fast Data Processing with Spark: Packt Publishing Ltd, 2013.

[15] H. Karau, A. Konwinski, P. Wendell, and M. Zaharia, Learning spark: lightning-fast big data analysis. O'Reilly Media, Inc., 2015.

[16] M. Breungi, P. Kriegel, R. Ng, and J. Sander, "LOF: identifying density-based local outliers," in ACM sigmod record, 2000, pp. 93-104. DOI: https://doi.org/10.1145/335191.335388.

Downloads

Download data is not yet available.