Anomalies detection for big data

Main Article Content

Autores

Omar Torres-Domínguez http://orcid.org/0000-0003-4582-7549
Samuel Sabater-Fernández http://orcid.org/0000-0003-4552-9103
Lisandra Bravo-Ilisatigui, M.Sc. http://orcid.org/0000-0002-8209-4121
Diana Martin-Rodríguez, Ph. D.
Milton García-Borroto, Ph. D. http://orcid.org/0000-0002-3154-177X

Abstract

The development of the digital age has resulted in a considerable increase in data volumes. These large volumes of data have been called big data since they exceed the processing capacity of conventional database systems. Several sectors consider various opportunities and applications in the detection of anomalies in big data problems. This type of analysis can be very useful the use of data mining techniques because it allows extracting patterns and relationships from large amounts of data. The processing and analysis of these data volumes need tools capable of processing them as Apache Spark and Hadoop. These tools do not have specific algorithms for detecting anomalies. The general objective of the work is to develop a new algorithm for the detection of neighborhood-based anomalies in big data problems. From a comparative study, the KNNW algorithm was selected by its results, in order to design a big data variant. The implementation of the big data algorithm was done in the Apache Spark tool, using the parallel programming paradigm MapReduce. Subsequently different experiments were performed to analyze the behavior of the algorithm with different configurations. Within the experiments, the execution times and the quality of the results were compared between the sequential variant and the big data variant. Getting better results, the big data variant with significant difference. Getting the big data variant, KNNW-BigData, can process large volumes of data.


Keywords: big data; data mining; detecting anomalies; MapReduce.

Keywords:

Article Details

Licence

Creative Commons License

This work is licensed under a Creative Commons Attribution 4.0 International License.

All articles included in the Revista Facultad de Ingeniería are published under the Creative Commons (BY) license.

Authors must complete, sign, and submit the Review and Publication Authorization Form of the manuscript provided by the Journal; this form should contain all the originality and copyright information of the manuscript.

The authors who publish in this Journal accept the following conditions:

a. The authors retain the copyright and transfer the right of the first publication to the journal, with the work registered under the Creative Commons attribution license, which allows third parties to use what is published as long as they mention the authorship of the work and the first publication in this Journal.

b. Authors can make other independent and additional contractual agreements for the non-exclusive distribution of the version of the article published in this journal (eg, include it in an institutional repository or publish it in a book) provided they clearly indicate that the work It was first published in this Journal.

c. Authors are allowed and recommended to publish their work on the Internet (for example on institutional or personal pages) before and during the process.
review and publication, as it can lead to productive exchanges and a greater and faster dissemination of published work.

d. The Journal authorizes the total or partial reproduction of the content of the publication, as long as the source is cited, that is, the name of the Journal, name of the author (s), year, volume, publication number and pages of the article.

e. The ideas and statements issued by the authors are their responsibility and in no case bind the Journal.

References

[1] R. Bolton, and D. Hand, "Statistical fraud detection: A review," Statistical science, pp. 235-249, 2002.

[2] K. Chitra, and B. Subashini, "Data mining techniques and its applications in banking sector," International Journal of Emerging Technology and Advanced Engineering, vol. 3, pp. 219-226, 2013.

[3] S.-H. Li, D. C. Yen, W.-H. Lu, and C. Wang, "Identifying the signs of fraudulent accounts using data mining techniques," Computers in Human Behavior, vol. 28 (3), pp. 1002-1013, May. 2012. DOI: https://doi.org/10.1016/j.chb.2012.01.002.

[4] V. Chandola, A. Banerjee, and V. Kumar, "Anomaly detection: A survey," ACM computing surveys (CSUR), vol. 41 (3), pp. 1-15, Jul. 2009. DOI: https://doi.org/10.1145/1541880.1541882.

[5] J. Zhang, "Advancements of outlier detection: A survey," ICST Transactions on Scalable Information Systems, vol. 13 (1), pp. 1-26, Feb. 2013. DOI: https://doi.org/10.4108/trans.sis.2013.01-03.e2.

[6] L. M. Cruz-Quispe, and M. T. Rantes-García, "Detección de fraudes usando técnicas de clustering," 2010.

[7] M. Vadoodparast, and A. R. Hamdan, "Fraudulent Electronic Transaction Detection using dynamic KDA model," International Journal of Computer Science and Information Security, vol. 13, p. 90, 2015.

[8] M.Zhang, J.Salerno, and P.Yu, "Applying data mining in investigating money laundering crimes," in Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, 2003, pp. 747-752. DOI: https://doi.org/10.1145/956750.956851.

[9] B. Baesens, Analytics in a big data world: The essential guide to data science and its applications. New Jersey: John Wiley & Sons, 2014.

[10] J. Coumaros, S. D. Roys, L. Chretien, J. Buvat, S. KVJ, V. Clerk, et al., "Big Data Alchemy: How can Banks Maximize the Value of their Customer Data?," Capgemini Consulting, 2014.

[11] V. Mayer-Schönberger, and K. Cukier, Big data: A revolution that will transform how we live, work, and think. New York: Houghton Mifflin Harcourt, 2013.

[12] N. Marz, and J. Warren, Big Data: Principles and best practices of scalable realtime data systems. Manning Publications Co., 2015.

[13] S. Ryza, U. Laserson, S. Owen, and J. Wills, Advanced Analytics with Spark: Patterns for Learning from Data at Scale. O'Reilly Media, Inc., 2015.

[14] H. Karau, Fast Data Processing with Spark: Packt Publishing Ltd, 2013.

[15] H. Karau, A. Konwinski, P. Wendell, and M. Zaharia, Learning spark: lightning-fast big data analysis. O'Reilly Media, Inc., 2015.

[16] M. Breungi, P. Kriegel, R. Ng, and J. Sander, "LOF: identifying density-based local outliers," in ACM sigmod record, 2000, pp. 93-104. DOI: https://doi.org/10.1145/335191.335388.

Downloads

Download data is not yet available.