Anomalies detection for big data
Abstract
The development of the digital age has resulted in a considerable increase in data volumes. These large volumes of data have been called big data since they exceed the processing capacity of conventional database systems. Several sectors consider various opportunities and applications in the detection of anomalies in big data problems. This type of analysis can be very useful the use of data mining techniques because it allows extracting patterns and relationships from large amounts of data. The processing and analysis of these data volumes need tools capable of processing them as Apache Spark and Hadoop. These tools do not have specific algorithms for detecting anomalies. The general objective of the work is to develop a new algorithm for the detection of neighborhood-based anomalies in big data problems. From a comparative study, the KNNW algorithm was selected by its results, in order to design a big data variant. The implementation of the big data algorithm was done in the Apache Spark tool, using the parallel programming paradigm MapReduce. Subsequently different experiments were performed to analyze the behavior of the algorithm with different configurations. Within the experiments, the execution times and the quality of the results were compared between the sequential variant and the big data variant. Getting better results, the big data variant with significant difference. Getting the big data variant, KNNW-BigData, can process large volumes of data.
Keywords: big data; data mining; detecting anomalies; MapReduce.
Keywords
big data, data mining, detecting anomalies, MapReduce
References
[1] R. Bolton, and D. Hand, "Statistical fraud detection: A review," Statistical science, pp. 235-249, 2002.
[2] K. Chitra, and B. Subashini, "Data mining techniques and its applications in banking sector," International Journal of Emerging Technology and Advanced Engineering, vol. 3, pp. 219-226, 2013.
[3] S.-H. Li, D. C. Yen, W.-H. Lu, and C. Wang, "Identifying the signs of fraudulent accounts using data mining techniques," Computers in Human Behavior, vol. 28 (3), pp. 1002-1013, May. 2012. DOI: https://doi.org/10.1016/j.chb.2012.01.002.
[4] V. Chandola, A. Banerjee, and V. Kumar, "Anomaly detection: A survey," ACM computing surveys (CSUR), vol. 41 (3), pp. 1-15, Jul. 2009. DOI: https://doi.org/10.1145/1541880.1541882.
[5] J. Zhang, "Advancements of outlier detection: A survey," ICST Transactions on Scalable Information Systems, vol. 13 (1), pp. 1-26, Feb. 2013. DOI: https://doi.org/10.4108/trans.sis.2013.01-03.e2.
[6] L. M. Cruz-Quispe, and M. T. Rantes-García, "Detección de fraudes usando técnicas de clustering," 2010.
[7] M. Vadoodparast, and A. R. Hamdan, "Fraudulent Electronic Transaction Detection using dynamic KDA model," International Journal of Computer Science and Information Security, vol. 13, p. 90, 2015.
[8] M.Zhang, J.Salerno, and P.Yu, "Applying data mining in investigating money laundering crimes," in Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, 2003, pp. 747-752. DOI: https://doi.org/10.1145/956750.956851.
[9] B. Baesens, Analytics in a big data world: The essential guide to data science and its applications. New Jersey: John Wiley & Sons, 2014.
[10] J. Coumaros, S. D. Roys, L. Chretien, J. Buvat, S. KVJ, V. Clerk, et al., "Big Data Alchemy: How can Banks Maximize the Value of their Customer Data?," Capgemini Consulting, 2014.
[11] V. Mayer-Schönberger, and K. Cukier, Big data: A revolution that will transform how we live, work, and think. New York: Houghton Mifflin Harcourt, 2013.
[12] N. Marz, and J. Warren, Big Data: Principles and best practices of scalable realtime data systems. Manning Publications Co., 2015.
[13] S. Ryza, U. Laserson, S. Owen, and J. Wills, Advanced Analytics with Spark: Patterns for Learning from Data at Scale. O'Reilly Media, Inc., 2015.
[14] H. Karau, Fast Data Processing with Spark: Packt Publishing Ltd, 2013.
[15] H. Karau, A. Konwinski, P. Wendell, and M. Zaharia, Learning spark: lightning-fast big data analysis. O'Reilly Media, Inc., 2015.
[16] M. Breungi, P. Kriegel, R. Ng, and J. Sander, "LOF: identifying density-based local outliers," in ACM sigmod record, 2000, pp. 93-104. DOI: https://doi.org/10.1145/335191.335388.