Skip to main content
Log in

Energy-based anomaly detection for mixed data

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

Anomalies are those deviating significantly from the norm. Thus, anomaly detection amounts to finding data points located far away from their neighbors, i.e., those lying in low-density regions. Classic anomaly detection methods are largely designed for single data type such as continuous or discrete. However, real-world data is increasingly heterogeneous, where a data point can have both discrete and continuous attributes. Mixed data poses multiple challenges including (a) capturing the inter-type correlation structures and (b) measuring deviation from the norm under multiple types. These challenges are exaggerated under (c) high-dimensional regimes. In this paper, we propose a new scalable unsupervised anomaly detection method for mixed data based on Mixed-variate Restricted Boltzmann Machine (Mv.RBM). The Mv.RBM is a principled probabilistic method that estimates density of mixed data. We propose to use free energy derived from Mv.RBM as anomaly score as it is identical to data negative log-density up to an additive constant. We then extend this method to detect anomalies across multiple levels of data abstraction, an effective approach to deal with high-dimensional settings. The extension is dubbed \(\mathtt {MIXMAD}\), which stands for MIXed data Multilevel Anomaly Detection. In \(\mathtt {MIXMAD}\), we sequentially construct an ensemble of mixed-data Deep Belief Nets (DBNs) with varying depths. Each DBN is an energy-based detector at a predefined abstraction level. Predictions across the ensemble are finally combined via a simple rank aggregation method. The proposed methods are evaluated on a comprehensive suit of synthetic and real high-dimensional datasets. The results demonstrate that for anomaly detection, (a) a proper handling of mixed types is necessary, (b) free energy is a powerful anomaly scoring method, (c) multilevel abstraction of data is important for high-dimensional data, and (d) empirically Mv.RBM and \(\mathtt {MIXMAD}\) are superior to popular unsupervised detection methods for both homogeneous and mixed data.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

Notes

  1. A preliminary version of this paper has been published in [16].

  2. The original Mv.RBM also covers rank, but we do not consider in this paper.

  3. http://yann.lecun.com/exdb/mnist/.

  4. https://archive.ics.uci.edu/ml/datasets.html.

References

  1. Aggarwal CC, Hinneburg A, Keim DA (2001) On the surprising behavior of distance metrics in high dimensional space. In: International conference on database theory, Springer, pp 420–434

  2. Aggarwal CC, Sathe S (2015) Theoretical foundations and algorithms for outlier ensembles. ACM SIGKDD Explor Newsl 17(1):24–47

    Article  Google Scholar 

  3. Akoglu L, Tong H, Vreeken J, Faloutsos C (2012) Fast and reliable anomaly detection in categorical data. In: Proceedings of the 21st ACM international conference on information and knowledge management, ACM, pp 415–424

  4. Angiulli, F, Pizzuti C (2002) Fast outlier detection in high dimensional spaces. In: European conference on principles of data mining and knowledge discovery, Springer, pp 15–27

  5. Becker J, Havens TC, Pinar A, Schulz TJ (2015) Deep belief networks for false alarm rejection in forward-looking ground-penetrating radar. In: SPIE defense+ security, International Society for Optics and Photonics, pp 94540W–94540W

  6. Bengio Y, Courville A, Vincent P (2013) Representation learning: a review and new perspectives. IEEE Trans Pattern Anal Mach Intell 35(8):1798–1828

    Article  Google Scholar 

  7. Bontemps L, McDermott J, Le-Khac NA et al (2016) Collective anomaly detection based on long short-term memory recurrent neural networks. In: International conference on future data and security engineering, Springer, pp 141–152

  8. Bouguessa M (2015) A practical outlier detection approach for mixed-attribute data. Expert Syst Appl 42(22):8637–8649

    Article  Google Scholar 

  9. Breunig MM, Kriegel HP, Ng RT, Sander J (2000) LOF: identifying density-based local outliers. In: ACM sigmod record, vol 29. ACM, pp 93–104

  10. Campos GO, Zimek A, Sander J, Campello RJGB, Micenková B, Schubert E, Assent I, Houle ME (2015) On the evaluation of unsupervised outlier detection: measures, datasets, and an empirical study. Data Min Knowl Discov 30(4):891–927

    Article  MathSciNet  Google Scholar 

  11. Chandola V, Banerjee A, Kumar V (2009) Anomaly detection: a survey. ACM Comput Surv (CSUR) 41(3):15

    Article  Google Scholar 

  12. Chauhan S, Vig L (2015) Anomaly detection in ECG time signals via deep long short-term memory networks. In: IEEE international conference on data science and advanced analytics (DSAA), 2015. 36678 2015, IEEE, pp 1–7

  13. Cheng M, Xu Q, Lv J, Liu W, Li Q, Wang J (2016) MS-LSTM: a multi-scale LSTM model for BGP anomaly detection. In: IEEE 24th international conference on network protocols (ICNP), 2016, IEEE, pp 1–6

  14. Das K, Schneider J, Neill DB (2008) Anomaly pattern detection in categorical datasets. In: Proceedings of the 14th ACM SIGKDD international conference on knowledge discovery and data mining, ACM, pp 169–176

  15. De Leon AR, Chough KC (2013) Analysis of mixed data: methods & applications. CRC Press, Boca Raton

    Book  MATH  Google Scholar 

  16. Do K, Tran T, Phung D, Venkatesh S (2016) Outlier detection on mixed-type data: an energy-based approach. In: International conference on advanced data mining and applications (ADMA 2016)

  17. Fiore U, Palmieri F, Castiglione A, De Santis A (2013) Network anomaly detection with the restricted Boltzmann machine. Neurocomputing 122:13–23

    Article  Google Scholar 

  18. Gao N, Gao L, Gao Q, Wang H (2014) An intrusion detection model based on deep belief networks. In: Second international conference on advanced cloud and big data (CBD), 2014, IEEE, pp 247–252

  19. Ghoting A, Otey ME, Parthasarathy S (2004) Loaded: link-based outlier and anomaly detection in evolving data sets. In: ICDM, pp 387–390

  20. Hinton GE (2002) Training products of experts by minimizing contrastive divergence. Neural Comput 14:1771–1800

    Article  MATH  Google Scholar 

  21. Hinton GE, Salakhutdinov RR (2006) Reducing the dimensionality of data with neural networks. Science 313(5786):504–507

    Article  MathSciNet  MATH  Google Scholar 

  22. Ienco D, Pensa RG, Meo R (2016) A semisupervised approach to the detection and characterization of outliers in categorical data. IEEE Trans Neural Netw Learn Syst 28(5):1017–1029

    Article  Google Scholar 

  23. Kamyshanska H, Memisevic R (2015) The potential energy of an autoencoder. IEEE Trans Pattern Anal Mach Intell 37(6):1261–1273

    Article  Google Scholar 

  24. Kingma D, Ba J (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980

  25. Koufakou A, Georgiopoulos M (2010) A fast outlier detection strategy for distributed high-dimensional data sets with mixed attributes. Data Min Knowl Discov 20(2):259–289

    Article  MathSciNet  Google Scholar 

  26. Koufakou A, Georgiopoulos M, Anagnostopoulos GC (2008) Detecting outliers in high-dimensional datasets with mixed attributes. In: DMIN, Citeseer, pp 427–433

  27. LeCun Y, Bengio Y, Hinton G (2015) Deep learning. Nature 521(7553):436–444

    Article  Google Scholar 

  28. Lu YC, Feng C, Yating W, Lu CT (2016) Discovering anomalies on mixed-type data using a generalized student-t based approach. IEEE Trans Knowl Data Eng. https://doi.org/10.1109/TKDE.2016.2583429

  29. Malhotra P, Vig L, Shroff G, Agarwal P (2015) Long short term memory networks for anomaly detection in time series. In: Proceedings of ESANN, Presses universitaires de Louvain, pp 89–94

  30. Mehta P, Schwab DJ (2014) An exact mapping between the variational renormalization group and deep learning. arXiv preprint arXiv:1410.3831

  31. Nguyen TD, Tran T, Phung D, Venkatesh S (2013) Latent patient profile modelling and applications with mixed-variaterestricted Boltzmann machine. In: Proceedings of Pacific-Asia conference on knowledge discovery and datamining (PAKDD), Gold Coast, Queensland, Australia

  32. Nguyen TD, Tran T, Phung D, Venkatesh S (2013) Learning sparse latent representation and distance metric for image retrieval. In: Proceedings of IEEE international conference on multimedia & expo, California, USA, July 15–19

  33. Otey ME, Parthasarathy S, Ghoting A (2005) Fast lightweight outlier detection in mixed-attribute data. Techincal report, OSU–CISRC–6/05–TR43

  34. Pai HT, Wu F, Hsueh PYSS (2014) A relative patterns discovery for enhancing outlier detection in categorical data. Dec Support Syst 67:90–99

    Article  Google Scholar 

  35. Papadimitriou S, Kitagawa H, Gibbons PB, Faloutsos C (2003) Loci: fast outlier detection using the local correlation integral. In: Proceedings. 19th international conference on data engineering, 2003. IEEE, pp 315–326

  36. Salakhutdinov R, Hinton G (2009) Semantic hashing. Int J Approx Reas 50(7):969–978

    Article  Google Scholar 

  37. Serfling R, Wang S (2014) General foundations for studying masking and swamping robustness of outlier identifiers. Statis Methodol 20:79–90

    Article  MathSciNet  Google Scholar 

  38. Sun J, Wyss R, Steinecker A, Glocker P (2014) Automated fault detection using deep belief networks for the quality inspection of electromotors. tm-Technisches Messen 81(5):255–263

    Article  Google Scholar 

  39. Tagawa T, Tadokoro Y, Yairi T (2014) Structured denoising autoencoder for fault detection and analysis. In: ACML

  40. Tang G, Pei J, Bailey J, Dong G (2015) Mining multidimensional contextual outliers from categorical relational data. Intell Data Anal 19(5):1171–1192

    Article  Google Scholar 

  41. Taylor A, Leblanc S, Japkowicz N (2016) Anomaly detection in automobile control network data with long short-term memory networks. In: IEEE international conference on data science and advanced analytics (DSAA), 2016, IEEE, pp 130–139

  42. Tran N, Jin H (2012) Detecting network anomalies in mixed-attribute data sets. In: Third international conference on knowledge discovery and data mining, 2010. WKDD’10, IEEE, pp 383–386

  43. Tran T, Phung D, Venkatesh S (2013) Thurstonian Boltzmann machines: learning from multiple inequalities. In: International conference on machine learning (ICML), Atlanta, USA, June 16–21

  44. Tran T, Phung DQ, Venkatesh S (2011) Mixed-variate restricted Boltzmann machines. In: Proceedings of 3rd Asian conference on machine learning (ACML), Taoyuan, Taiwan

  45. Tran T, Luo W, Phung D, Morris J, Rickard K, Venkatesh S (2016) Preterm birth prediction: deriving stable and interpretable rules from high dimensional data. In: Conference on machine learning in healthcare, LA, USA

  46. Tuor A, Kaplan S, Hutchinson B, Nichols N, Robinson S (2017) Deep learning for unsupervised insider threat detection in structured cybersecurity data streams. In: Proceedings of the AAAI-17 Workshop on Artificial Intelligence for Cyber Security, pp 224–231

  47. Wang Y, Cai W, Wei P (2016) A deep learning approach for detecting malicious JavaScript code. Secur Commun Netw 9:1520–1534

    Article  Google Scholar 

  48. Ye M, Li X, Orlowska ME (2009) Projected outlier detection in high-dimensional mixed-attributes data set. Expert Syst Appl 36(3):7104–7113

    Article  Google Scholar 

  49. Zhai S, Cheng Y, Lu W, Zhang Z (2016) Deep structured energy based models for anomaly detection. arXiv preprint arXiv:1605.07717

  50. Zhang K, Jin H (2010) An effective pattern based outlier detection approach for mixed attribute data. In: Australasian joint conference on artificial intelligence, Springer, pp 122–131

  51. Zimek A, Schubert E, Kriegel HP (2012) A survey on unsupervised outlier detection in high-dimensional numerical data. Statis Anal Data Mining 5(5):363–387

    Article  MathSciNet  Google Scholar 

Download references

Acknowledgements

This work is partially supported by the Telstra-Deakin Centre of Excellence in Big Data and Machine Learning.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Truyen Tran.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Do, K., Tran, T. & Venkatesh, S. Energy-based anomaly detection for mixed data. Knowl Inf Syst 57, 413–435 (2018). https://doi.org/10.1007/s10115-018-1168-z

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-018-1168-z

Keywords

Navigation