Skip to main content
Log in

Data clustering using side information dependent Chinese restaurant processes

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

Side information, or auxiliary information associated with documents or image content, provides hints for clustering. We propose a new model, side information dependent Chinese restaurant process, which exploits side information in a Bayesian nonparametric model to improve data clustering. We introduce side information into the framework of distance dependent Chinese restaurant process using a robust decay function to handle noisy side information. The threshold parameter of the decay function is updated automatically in the Gibbs sampling process. A fast inference algorithm is proposed. We evaluate our approach on four datasets: Cora, 20 Newsgroups, NUS-WIDE and one medical dataset. Types of side information explored in this paper include citations, authors, tags, keywords and auxiliary clinical information. The comparison with the state-of-the-art approaches based on standard performance measures (NMI, F1) clearly shows the superiority of our approach.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11

Similar content being viewed by others

Notes

  1. There are two cases. The link connecting the same customer has the similarity equal to 1. It does not affect the sampling process. The other is that the similarity equal to 0 cannot reflect the real relationship between two customers. We do not add these links to the computation of threshold \(\text {T}\).

  2. http://people.cs.umass.edu/~mccallum/code-data.html.

  3. http://scgroup20.ceid.upatras.gr:8000/tmg/.

  4. http://qwone.com/~jason/20Newsgroups/.

  5. Ethics approval obtained through University and the hospital Number 12/83.

References

  1. Aggarwal CC, Zhao Y, Yu PS (2012) On text clustering with side information. Int Conf Data Eng 0:894–904

    Google Scholar 

  2. Akaike H (1973) Information theory and an extension of the maximum likelihood principle, the 2nd international symposium on information theory, p 267–281

  3. Aldous D (1985) Exchangeability and related topics. Ecole d’Ete de Probabilities de Saint-Flour XIII 1983:1–198

    MathSciNet  MATH  Google Scholar 

  4. Antoniak CE (1974) Mixtures of Dirichlet processes with applications to Bayesian nonparametric problems. Ann Stat 2(6):1152–1174

    Article  MathSciNet  MATH  Google Scholar 

  5. Basu S, Banerjee A, Mooney R (2004) Active semi-supervision for pairwise constrained clustering. In: proceeding of SIAM international conference on data mining, pp 333–344

  6. Bilmes JA (1997) A gentle tutorial on the em algorithm and its application to parameter estimation for gaussian mixture and hidden markov models, Technical report

  7. Blei DM, Frazier PI (2011) Distance dependent chinese restaurant processes. J Mach Learn Res 12:2461–2488

    MathSciNet  MATH  Google Scholar 

  8. Blei DM, Griffiths TL, Jordan MI, Tenenbaum JB (2004) Hierarchical topic models and the nested Chinese restaurant process, advances in Neural information processing systems

  9. Cai D, He X, Han J, Huang TS (2011) Graph regularized non-negative matrix factorization for data representation. IEEE Trans Pattern Anal Mach Intell 33(8):1548–1560

    Article  Google Scholar 

  10. Chen Q, Song Z, Hua Y, Huang Z, Yan S (2012) Hierarchical matching with side information for image classification, computer vision and pattern recognition (CVPR), pp 3426–3433

  11. Chua T-S, Tang J, Hong R, Li H, Luo Z, Zheng Y (2009) Nus-wide: a real-world web image database from national university of singapore. In proceedings of the ACM international conference on image and video retrieval, pp 1–9

  12. Duan J, Guindani M, Gelfand A (2007) Generalized spatial Dirichlet process models. Biometrika 94:809–825

    Article  MathSciNet  MATH  Google Scholar 

  13. Elkan C (2006) Clustering documents with an exponential-family approximation of the dirichlet compound multinomial distribution. Int Conf Mach Learn 148:289–296

    Google Scholar 

  14. Ferguson TS (1973) A Bayesian analysis of some nonparametric problems. Ann Stat 1(2):209–230

    Article  MathSciNet  MATH  Google Scholar 

  15. Figueiredo MAT, Jain AK (2002) Unsupervised learning of finite mixture models. IEEE Trans Pattern Anal Mach Intell 24(3):381–396

    Article  Google Scholar 

  16. Finkel JR, Grenager T, Manning CD (2007) The infinite tree. In proceedings of the 45th Annual Meeting of the Association of Computational Linguistics, pp. 272–279

  17. Gelman A, Carlin JB, Stern HS, Rubin DB (2003) Bayesian data analysis, 2nd edn (Chapman & Hall/CRC texts in statistical science)

  18. Gershman SJ, Blei DM (2011) A tutorial on bayesian nonparametric models. J Math Psychol 56:1–12

  19. Ghosh S, Ungureanu AB, Sudderth EB, Blei DM (2011) Spatial distance dependent chinese restaurant processes for image segmentation, NIPS, pp. 1476–1484

  20. Griffin JE, Steel MFJ (2006) Order-based dependent Dirichlet processes. J Am Stat Assoc 101(473):179–194

    Article  MathSciNet  MATH  Google Scholar 

  21. Huang A (2008) Similarity measures for text document clustering, New Zealand computer science research student conference, pp 49–56

  22. Jiang W, Xie L, Chang S-F (2009) Visual saliency with side information, IEEE international conference on acoustics, speech and signal processing, pp 1765–1768

  23. Kim D, Oh A (2011) Accounting for data dependencies within a hierarchical dirichlet process mixture model, CIKM, pp 873–878

  24. Li C, Phung D, Rana S, Venkatesh S (2013) Exploiting side information in distance dependent chinese restaurant processes for data clustering, international conference on multimedia and expo (ICME), pp 1–6

  25. Lowe DG (1999) Object recognition from local scale-invariant features. In proceedings of the international conference on computer vision, Washington

  26. MacEachern SN (1999) Dependent nonparametric processes

  27. Marin JM, Mengersen KL, Robert C (2005) Bayesian modelling and inference on mixtures of distributions. Handbook stat 25(16):459–507

    Article  MathSciNet  Google Scholar 

  28. Neal RM (2000) Markov chain sampling methods for dirichlet process mixture models. J Comput Graph Stat 9(2):249–265

    MathSciNet  Google Scholar 

  29. Nigam K, Mccallum AK, Thrun S, Mitchell T (1999) Text classification from labeled and unlabeled documents using em, machine learning, pp 103–134

  30. Orbanz P (2010) Bayesian nonparametric models, Technical report

  31. Porteous I, Asuncion AU, Welling M (2010) Bayesian matrix factorization with side information and dirichlet process mixtures., AAAI

  32. Ross J, Dy J (2013) Nonparametric mixture of gaussian processes with constraints, international conference machine learning, pp 1346–1354

  33. Schwarz GE (1978) Estimating the dimension of a model. Ann Stat 6:461–464

    Article  MathSciNet  MATH  Google Scholar 

  34. Sethuraman J (1994) A constructive definition of Dirichlet priors. Stat Sin 4:639–650

    MathSciNet  MATH  Google Scholar 

  35. Socher R, Maas A, Manning CD (2011) Spectral chinese restaurant processes: nonparametric clustering based on similarities, 14th international conference on artificial intelligence and statistics (AISTATS)

  36. Song Y, Pan S, Liu S, Wei F, Zhou MX, Qian W (2010) Constrained co-clustering for textual documents, AAAI

  37. Soumya G, Michalis R, Leonid S, Erik S (2014) Nonparametric clustering with distance dependent hierarchies, uncertainty in artificial intelligence

  38. Sudderth EB (2006) Graphical models for visual object recognition and tracking, PhD thesis

  39. Sudderth EB, Torralba A, Freeman WT, Willsky AS (2005) Describing visual scenes using transformed dirichlet processes. Adv Neural Inf Process Syst 18:1299–1306

    Google Scholar 

  40. Sudderth E, Torralba A, Freeman W, Willsky A (2008) Describing visual scenes using transformed objects and parts. Int J Comput Vis 77(1):291–330

    Article  Google Scholar 

  41. Teh YW, Jordan MI, Beal MJ, Blei DM (2006) Hierarchical dirichlet processes. JASA 101:1566–1581

    Article  MathSciNet  MATH  Google Scholar 

  42. Vlachos A, Ghahramani Z, Korhonen A (2008) Dirichlet process mixture models for verb clustering, ICML workshop on prior knowledge for text and language processing

  43. Vlachos A, Korhonen A, Ghahramani Z (2009) Unsupervised and constrained dirichlet process mixture models for verb clustering, GEMS ’09. In: proceedings of the workshop on geometrical models of natural language semantics

  44. Wagsta K, Cardie C, Schroedl S (2001) Constrained k-means clustering with background knowledge, international conference on machine learning

  45. Wang X, Qian B, Davidson I (2012) On constrained spectral clustering and its applications, CoRR abs/1201.5338

  46. Xu W, Liu X, Gong Y (2003) Document clustering based on non-negative matrix factorization, international ACM SIGIR conference on research and development in information retrieval, pp 267–273

  47. Yang T, Jin R, Jain AK (2010) Learning from noisy side information by generalized maximum entropy model, international conference on machine learning

  48. Yang Y, Pedersen JO (1997) A comparative study on feature selection in text categorization, the 14th international conference on machine learning, pp 412–420

Download references

Acknowledgments

We thank anonymous reviewers for their very useful comments and suggestions.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Cheng Li.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Li, C., Rana, S., Phung, D. et al. Data clustering using side information dependent Chinese restaurant processes. Knowl Inf Syst 47, 463–488 (2016). https://doi.org/10.1007/s10115-015-0834-7

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-015-0834-7

Keywords

Navigation