Skip to main content
Log in

Dirichlet Process Mixture Models with Pairwise Constraints for Data Clustering

  • Published:
Annals of Data Science Aims and scope Submit manuscript

Abstract

The Dirichlet process mixture (DPM) model, a typical Bayesian nonparametric model, can infer the number of clusters automatically, and thus performing priority in data clustering. This paper investigates the influence of pairwise constraints in the DPM model. The pairwise constraint, known as two types: must-link (ML) and cannot-link (CL) constraints, indicates the relationship between two data points. We have proposed two relevant models which incorporate pairwise constraints: the constrained DPM (C-DPM) and the constrained DPM with selected constraints (SC-DPM). In C-DPM, the concept of chunklet is introduced. ML constraints are compiled into chunklets and CL constraints exist between chunklets. We derive the Gibbs sampling of the C-DPM based on chunklets. We further propose a principled approach to select the most useful constraints, which will be incorporated into the SC-DPM. We evaluate the proposed models based on three real datasets: 20 Newsgroups dataset, NUS-WIDE image dataset and Facebook comments datasets we collected by ourselves. Our SC-DPM performs priority in data clustering. In addition, our SC-DPM can be potentially used for short-text clustering.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Similar content being viewed by others

Notes

  1. http://qwone.com/~jason/20Newsgroups/.

References

  1. Antoniak CE (1974) Mixtures of Dirichlet processes with applications to Bayesian nonparametric problems. Ann Stat 2(6):1152–1174

    Article  Google Scholar 

  2. Basu S, Banerjee A, Mooney R (2004) Active semi-supervision for pairwise constrained clustering. In: Proceedings of SIAM international conference on data mining, pp 333–344

  3. Bilmes J (1997) A gentle tutorial of the EM algorithm and its application to parameter estimation for Gaussian mixture and hidden Markov models. Technical Report, ICSI

    Google Scholar 

  4. Blei DM, Ng AY, Jordan MI (2003) Latent Dirichlet allocation. J Mach Learn Res 3:993–1022

    Google Scholar 

  5. Boley D, Kawale J (2013) Constrained spectral clustering using l1 regularization. In: SDM’13, pp 103–111

  6. Chinrungrueng C, Squin CH (1995) Optimal adaptive k-means algorithm with dynamic adjustment of learning rate. IEEE Trans Neural Netw 6(1):157–169

    Article  Google Scholar 

  7. Chua TS, Tang J, Hong R, Li H, Luo Z, Zheng Y (2009) Nus-wide: a real-world web image database from national university of singapore. In: Proceedings of the ACM international conference on image and video retrieval, CIVR ’09, pp 48:1–48:9

  8. Davidson I (2012) Two approaches to understanding when constraints help clustering. In: Yang Q, Agarwal D, Pei J (eds) KDD. ACM, New York, pp 1312–1320

    Google Scholar 

  9. Davidson I, Ravi SS (2005) Clustering with constraints: feasibility issues and the k-means algorithm. In: Proceedings of 5th SIAM data mining conference

  10. Davidson I, Wagstaff KL, Basu S (2006) Measuring constraint-set utility for partitional clustering algorithms. In: Proceedings of 10th European conference on principles and practice of knowledge discovery in databases, pp 115–126

  11. Ferguson TS (1973) A Bayesian analysis of some nonparametric problems. Ann Stat 1(2):209–230

    Article  Google Scholar 

  12. Figueiredo MAT, Jain AK (2002) Unsupervised learning of finite mixture models. IEEE Trans Pattern Anal Mach Intell 24(3):381–396

    Article  Google Scholar 

  13. Finkel JR, Grenager T, Manning CD (2007) The infinite tree. In: Proceedings of the 45th annual meeting of the association of computational linguistics, pp 272–279

  14. Geman S, Geman D (1984) Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images. IEEE Trans Pattern Anal Mach Intell 6(6):721–741

    Article  Google Scholar 

  15. Gershman SJ, Blei DM (2011) A tutorial on Bayesian nonparametric models. J Math Psychol 56(1):1–12

    Article  Google Scholar 

  16. Goldwater S, Griffiths TL, Johnson M (2006) Contextual dependencies in unsupervised word segmentation. In: Proceedings of the 21st international conference on computational linguistics, pp 673–680

  17. Grira N, Crucianu M, Boujemaa N (2008) Active semi-supervised fuzzy clustering. Pattern Recogn 41(5):1851–1861

    Article  Google Scholar 

  18. House L (2006) Nonparametric Bayesian models in expression proteomic applications. Duke University, Durham

    Google Scholar 

  19. Johnson S (1967) Hierarchical clustering schemes. Psychometrika 32(3):241–254

    Article  Google Scholar 

  20. Li C, Phung D, Rana S, Venkatesh S (2013) Exploiting side information in distance dependent Chinese restaurant processes for data clustering. In: ICME

  21. Li C, Rana S, Phung D, Venkatesh S (2016) Hierarchical Bayesian nonparametric models for knowledge discovery from electronic medical records. Knowl Based Syst 99:168–182

    Article  Google Scholar 

  22. Li C, Rana S, Phung D, Venkatesh S (2015) Data clustering using side information dependent Chinese restaurant processes. Knowl Inf Syst 47(2):463–488

    Article  Google Scholar 

  23. Li C, Rana S, Phung D, Venkatesh S (2015) Small-variance asymptotics for Bayesian nonparametric models with constraints. Adv Knowl Discov Data Min 9078:92–105

    Google Scholar 

  24. Li C, Rana S, Phung D, Venkatesh S (2014) Regularizing topic discovery in EMRS with side information by using hierarchical Bayesian models. In: ICPR

  25. Mallapragada PK, Jin R, Jain AK (2008) Active query selection for semi-supervised clustering. In: ICPR, pp 1–4

  26. McLachlan GJ, Peel D (2000) Finite mixture models. Wiley series in probability and statistics, Wiley, New York

    Book  Google Scholar 

  27. Muller P, Quintana FA (2004) Nonparametric Bayesian data analysis. Stat Sci 19(1):95–110

    Article  Google Scholar 

  28. Neal RM (2000) Markov chain sampling methods for Dirichlet process mixture models. JCGS 9(2):249–265

    Google Scholar 

  29. Ng AY, Jordan MI, Weiss Y (2001) On spectral clustering: analysis and an algorithm. Advances in neural information processing systems. MIT Press, Cambridge, pp 849–856

    Google Scholar 

  30. Orbanz P (2010) Bayesian nonparametric models. In: Sammut C, Webb GI (eds) Encyclopedia of machine learning. Springer, Berlin

    Google Scholar 

  31. Orbanz P, Buhmann JM (2008) Nonparametric Bayesian image segmentation. Int J Comput Vis 77(1–3):25–45

    Article  Google Scholar 

  32. Ross J, Dy J (2013) Nonparametric mixture of Gaussian processes with constraints. ICML 28:1346–1354

    Google Scholar 

  33. Shental N, Bar-hillel A, Hertz T, Weinshall D (2003) Computing Gaussian mixture models with EM using equivalence constraints. Adv Neural Inf Process Syst 16:465–472

    Google Scholar 

  34. Sudderth E, Torralba A, Freeman W, Willsky A (2008) Describing visual scenes using transformed objects and parts. Int J Comput Vis 77(1):291–330

    Article  Google Scholar 

  35. Vlachos A, Ghahramani Z, Korhonen A (2008) Dirichlet process mixture models for verb clustering. In: ICML workshop on prior knowledge for text and language processing, pp 1–6

  36. Vlachos A, Korhonen A, Ghahramani Z (2009) Unsupervised and constrained Dirichlet process mixture models for verb clustering. GEMS ’09. Association for Computational Linguistics, Columbus, pp 74–82

    Chapter  Google Scholar 

  37. Vlachos A, Ghahramani Z, Briscoe T (2010) Active learning for constrained Dirichlet process mixture models. In: Proceedings of the 2010 workshop on geometrical models of natural language semantics, pp 57–61

  38. Vu VV, Labroche N, Bouchon-Meunier B (2012) Improving constrained clustering with active query selection. Pattern Recogn 45(4):1749–1758

    Article  Google Scholar 

  39. Wagstaff KL (2006) When is constrained clustering beneficial, and why. In: AAAI, pp 1–2

  40. Xiong S, Azimi J, Fern X (2014) Active learning of constraints for semi-supervised clustering. IEEE Trans Knowl Data Eng 26(1):43–54

    Article  Google Scholar 

  41. Xu Q, desJardins M, Wagstaff K (2005) Active constrained clustering by examining spectral eigenvectors. In: 8th International conference discovery science, vol 3735, pp 294–307

  42. Yu G, Huang R, Wang Z (2010) Document clustering via Dirichlet process mixture model with feature selection. In: Proceedings of the 16th ACM SIGKDD international conference on knowledge discovery and data mining, pp 763–772

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Cheng Li.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Li, C., Rana, S., Phung, D. et al. Dirichlet Process Mixture Models with Pairwise Constraints for Data Clustering. Ann. Data. Sci. 3, 205–223 (2016). https://doi.org/10.1007/s40745-016-0082-z

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s40745-016-0082-z

Keywords

Navigation