DEMass: a new density estimator for big data

Ting, Kai Ming; Washio, Takashi; Wells, Jonathan R.; Liu, Fei Tony; Aryal, Sunil

doi:10.1007/s10115-013-0612-3

DEMass: a new density estimator for big data

Regular Paper
Published: 09 February 2013

Volume 35, pages 493–524, (2013)
Cite this article

Knowledge and Information Systems Aims and scope Submit manuscript

Kai Ming Ting¹,
Takashi Washio²,
Jonathan R. Wells¹,
Fei Tony Liu¹ &
…
Sunil Aryal¹

1148 Accesses
8 Citations
3 Altmetric
Explore all metrics

Abstract

Density estimation is the ubiquitous base modelling mechanism employed for many tasks including clustering, classification, anomaly detection and information retrieval. Commonly used density estimation methods such as kernel density estimator and \(k\)-nearest neighbour density estimator have high time and space complexities which render them inapplicable in problems with big data. This weakness sets the fundamental limit in existing algorithms for all these tasks. We propose the first density estimation method, having average case sub-linear time complexity and constant space complexity in the number of instances, that stretches this fundamental limit to an extent that dealing with millions of data can now be done easily and quickly. We provide an asymptotic analysis of the new density estimator and verify the generality of the method by replacing existing density estimators with the new one in three current density-based algorithms, namely DBSCAN, LOF and Bayesian classifiers, representing three different data mining tasks of clustering, anomaly detection and classification. Our empirical evaluation results show that the new density estimation method significantly improves their time and space complexities, while maintaining or improving their task-specific performances in clustering, anomaly detection and classification. The new method empowers these algorithms, currently limited to small data size only, to process big data—setting a new benchmark for what density-based algorithms can achieve.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

VDENCLUE: An Enhanced Variant of DENCLUE Algorithm

Dominant Set Based Density Kernel and Clustering

Density Based Clustering: Alternatives to DBSCAN

Notes

While there are ways to reduce the computational cost of KDE, \(k\)-NN and \(\epsilon \)-neighbourhood, they are usually limited to low-dimensional problems or incur significant preprocessing cost. See Sect. 7 for a discussion.
The implementation of \(T(\cdot )\) used in this paper is a tree-based nonparametric method. The binomial distribution is required for the error analysis only.

References

Achtert E, Kriegel H-P, Zimek A (2008) ELKI: a software system for evaluation of subspace clustering algorithms. In: Proceedings of the 20th international conference on scientific and statistical database management, pp 580–585
Angiulli F, Fassetti F (2009) DOLPHIN: an efficient algorithm for mining distance-based outliers in very large datasets. ACM Trans Knowl Discov Data 3(1):4:1–4:57
Article Google Scholar
Bay SD, Schwabacher M (2003) Mining distance-based outliers in near linear time with randomization and a simple pruning rule. In: Proceedings of the ninth ACM SIGKDD international conference on knowledge discovery and data mining, ACM, pp. 29–38
Beyer KS, Goldstein J, Ramakrishnan R, Shaft U (1999) When is “nearest neighbor” meaningful? In: Proceedings of the 7th international conference on database theory, pp 217–235
Beygelzimer A, Kakade S, Langford J (2006) Cover trees for nearest neighbor. In: Proceedings of the 23rd international conference on machine learning, pp 97–104
Breunig MM, Kriegel H-P, Ng RT, Sander J (2000) LOF: identifying density-based local outliers. In: Proceedings of ACM SIGMOD international conference on management of data, pp 93–104
Catlett J (1991) On changing continuous attributes into ordered discrete attributes. In: Proceedings of the European working session on learning, pp 164–178
Ciaccia P, Patella M, Zezula P (1997) M-tree: an efficient access method for similarity search in metric spaces. In: Proceedings of the 23rd international conference on very large data, bases, pp 426–435
Deegalla S, Bostrom H (2006) Reducing high-dimensional data by principal component analysis vs. random projection for nearest neighbor classification. In: Proceedings of the 5th international conference on machine learning and applications, IEEE Computer Society, Washington, pp 245–250
Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the EM algorithm. J Roy Stat Soc Ser B 39(1):1–38
MathSciNet MATH Google Scholar
Dougherty J, Kohavi R, Sahami M (1995) Supervised and unsupervised discretization of continuous features. In: Proceedings of the 12th international conference on machine learning, Morgan Kaufmann, pp 194–202
Ester M, Kriegel H-P, Sander J, Xu X (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceedings of KDD, AAAI Press, pp 226–231
Fayyad UM, Irani KB (1995) Multi-interval discretization of continuous valued attributes for classification learning. In: Proceedings of 14th international joint conference on artificial intelligence, pp 1034–1040
Frank A, Asuncion A (2010) UCI machine learning repository. University of California, Irvine, School of Information and Computer Sciences. URL: http://archive.ics.uci.edu/ml
Friedman N, Geiger D, Goldszmidt M (1997) Bayesian network classifiers. Mach Learn 29:131–163
Article MATH Google Scholar
Hastie T, Tibshirani R, Friedman J (2001) Chapter 8.5 the EM algorithm. In The elements of statistical learning, pp 236–243
Hido S, Tsuboi Y, Kashima H, Sugiyama M, Kanamori T (2011) Statistical outlier detection using direct density ratio estimation. Knowl Inf Syst 26(2):309–336
Article Google Scholar
Hinneburg A, Aggarwal CC, Keim DA (2000) What is the nearest neighbor in high dimensional spaces? In: Proceedings of the 26th international conference on very large data bases, pp 506–515
Hinneburg A, Keim DA (1998) An efficient approach to clustering in large multimedia databases with noise. In: Proceedings of KDD, AAAI Press, pp 58–65
Johnson WB, Lindenstrauss J (1984) Extensions of Lipschitz mapping into Hilbert space. In: Proceedings of conference in modern analysis and probability, contemporary mathematics, vol 26. American Mathematical Society, pp 189–206
Langley P, Iba W, Thompson K (1992) An analysis of Bayesian classifiers. In: Proceedings of the tenth national conference on artificial intelligence, pp 399–406
Langley P, John GH (1995) Estimating continuous distribution in Bayesian classifiers. In: Proceedings of eleventh conference on uncertainty in artificial intelligence
Lazarevic A, Kumar V (2005) Feature bagging for outlier detection. In: Proceedings of the eleventh ACM SIGKDD international conference on knowledge discovery and data mining, ACM, pp 157–166
Liu FT, Ting KM, Zhou Z-H (2010) On detecting clustered anomalies using sciforest. In: Proceedings of ECML PKDD, pp 274–290
Liu FT, Ting KM, Zhou Z-H (2012) Isolation-based anomaly detection. ACM Trans Knowl Discov Data 6(1):3:1–3:39
Article Google Scholar
Nanopoulos A, Theodoridis Y, Manolopoulos Y (2006) Indexed-based density biased sampling for clustering applications. IEEE Trans Data Knowl Eng 57(1):37–63
Article Google Scholar
Rocke DM, Woodruff DL (1996) Identification of outliers in multivariate data. J Am Stat Assoc 91(435):1047–1061
Article MathSciNet MATH Google Scholar
Schölkopf B, Platt JC, Shawe-Taylor JC, Smola AJ, Williamson RC (2001) Estimating the support of a high-dimensional distribution. Neural Comput 13(7):1443–1471
Article MATH Google Scholar
Silverman BW (1986) Density estimation for statistics and data analysis. Chapmal & Hall, London
MATH Google Scholar
Tan P-N, Steinbach M, Kumar V (2006) Introduction to data mining. Addison-Wesley, Reading
Google Scholar
Tan SC, Ting KM, Liu FT (2011) Fast anomaly detection for streaming data. In: Proceedings of IJCAI, pp 1151–1156
Tax DMJ, Duin RPW (2004) Support vector data description. Mach Learn 54(1):45–66
Article MATH Google Scholar
Ting KM, Washio T, Wells JR, Liu FT (2011) Density estimation based on mass. In: Proceedings of the 2011 IEEE 11th international conference on data mining, IEEE Computer Society, pp 715–724
Ting KM, Wells JR (2010) Multi-dimensional mass estimation and mass-based clustering. In: Proceedings of IEEE international conference on data mining, pp 511–520
Ting KM, Zhou G-T, Liu FT, Tan SC (2012) Mass estimation. Mach Learn, pp 1–34. doi:10.1007/s10994-012-5303-x
Vapnik VN (2000) The nature of statistical learning theory, 2nd edn. Springer, Berlin
MATH Google Scholar
Vries TD, Chawla S, Houle M (2012) Density-preserving projections for large-scale local anomaly detection. Knowl Inf Syst 32:25–52
Article Google Scholar
Webb GI, Boughton JR, Wang Z (2005) Aggregating one-dependence estimators. Mach Learn 58:5–24
Article MATH Google Scholar
Witten IH, Frank E, Hall MA (2011) Data mining: Practical machine learning tools and techniques, 3rd edn. Morgan Kaufmann, San Francisco
Google Scholar
Yamanishi K, Takeuchi J-I, Williams G, Milne P (2000) On-line unsupervised outlier detection using finite mixtures with discounting learning algorithms. In: Proceedings of ACM SIGKDD international conference on knowledge discovery and data mining, pp 320–324

Download references

Acknowledgments

This work is partially supported by the Air Force Research Laboratory, under agreement# FA2386-10-1-4052. The U.S. Government is authorised to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright notation thereon. Takashi Washio is partially supported by JSPS (Japan Society for the Promotion of Science) Grant-in-Aid for Scientific Research (B)# 22300054. Sunil Aryal is supported by Monash University Postgraduate Publications Award to complete the work on DEMass-Bayes. Xiao Yu Ge assisted in experiments using ELKI. Hiroshi Motoda, Zhouyu Fu and the anonymous reviewers had provided many helpful comments to improve this paper.

Author information

Authors and Affiliations

Gippsland School of Information Technology, Monash University, Gippsland Campus, Churchill, VIC, 3842, Australia
Kai Ming Ting, Jonathan R. Wells, Fei Tony Liu & Sunil Aryal
The Institute of Scientific and Industrial Research, Osaka University, 8-1 Mihogaoka, Ibarakishi, Osaka, 5670047, Japan
Takashi Washio

Authors

Kai Ming Ting
View author publications
You can also search for this author in PubMed Google Scholar
Takashi Washio
View author publications
You can also search for this author in PubMed Google Scholar
Jonathan R. Wells
View author publications
You can also search for this author in PubMed Google Scholar
Fei Tony Liu
View author publications
You can also search for this author in PubMed Google Scholar
Sunil Aryal
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Sunil Aryal.

Additional information

The source codes of DEMass-DBSCAN and DEMass-Bayes are available at http://sourceforge.net/projects/mass-estimation/.

The preliminary version of this paper appeared in Proceedings of the 2011 IEEE International Conference on Data Mining [33].

Appendix: Data characteristic of the RingCurve-Wave-TriGaussian data set

The characteristic of the RingCurve-Wave-TriGaussian data set, used in Sect. 5.1, is shown in Fig. 10. Each of the Ring-Curve, Wave and Triangular-Gaussian is a two-dimensional data set, and together, there is a total of seven clusters. Each cluster has 10,000 instances. When used in the scale-up experiment, the data size in each cluster was scaled by a factor of 0.1, 1, 75, 150 to 1,500.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Ting, K.M., Washio, T., Wells, J.R. et al. DEMass: a new density estimator for big data. Knowl Inf Syst 35, 493–524 (2013). https://doi.org/10.1007/s10115-013-0612-3

Download citation

Received: 07 December 2011
Revised: 28 August 2012
Accepted: 20 October 2012
Published: 09 February 2013
Issue Date: June 2013
DOI: https://doi.org/10.1007/s10115-013-0612-3

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

DEMass: a new density estimator for big data

Abstract

Access this article

Similar content being viewed by others

VDENCLUE: An Enhanced Variant of DENCLUE Algorithm

Dominant Set Based Density Kernel and Clustering

Density Based Clustering: Alternatives to DBSCAN

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Appendix: Data characteristic of the RingCurve-Wave-TriGaussian data set

Rights and permissions

About this article

Cite this article

Keywords

Navigation

DEMass: a new density estimator for big data

Abstract

Access this article

Similar content being viewed by others

VDENCLUE: An Enhanced Variant of DENCLUE Algorithm

Dominant Set Based Density Kernel and Clustering

Density Based Clustering: Alternatives to DBSCAN

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Appendix: Data characteristic of the RingCurve-Wave-TriGaussian data set

Appendix: Data characteristic of the RingCurve-Wave-TriGaussian data set

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation