Abstract
Side information, or auxiliary information associated with documents or image content, provides hints for clustering. We propose a new model, side information dependent Chinese restaurant process, which exploits side information in a Bayesian nonparametric model to improve data clustering. We introduce side information into the framework of distance dependent Chinese restaurant process using a robust decay function to handle noisy side information. The threshold parameter of the decay function is updated automatically in the Gibbs sampling process. A fast inference algorithm is proposed. We evaluate our approach on four datasets: Cora, 20 Newsgroups, NUS-WIDE and one medical dataset. Types of side information explored in this paper include citations, authors, tags, keywords and auxiliary clinical information. The comparison with the state-of-the-art approaches based on standard performance measures (NMI, F1) clearly shows the superiority of our approach.
Similar content being viewed by others
Notes
There are two cases. The link connecting the same customer has the similarity equal to 1. It does not affect the sampling process. The other is that the similarity equal to 0 cannot reflect the real relationship between two customers. We do not add these links to the computation of threshold \(\text {T}\).
Ethics approval obtained through University and the hospital Number 12/83.
References
Aggarwal CC, Zhao Y, Yu PS (2012) On text clustering with side information. Int Conf Data Eng 0:894–904
Akaike H (1973) Information theory and an extension of the maximum likelihood principle, the 2nd international symposium on information theory, p 267–281
Aldous D (1985) Exchangeability and related topics. Ecole d’Ete de Probabilities de Saint-Flour XIII 1983:1–198
Antoniak CE (1974) Mixtures of Dirichlet processes with applications to Bayesian nonparametric problems. Ann Stat 2(6):1152–1174
Basu S, Banerjee A, Mooney R (2004) Active semi-supervision for pairwise constrained clustering. In: proceeding of SIAM international conference on data mining, pp 333–344
Bilmes JA (1997) A gentle tutorial on the em algorithm and its application to parameter estimation for gaussian mixture and hidden markov models, Technical report
Blei DM, Frazier PI (2011) Distance dependent chinese restaurant processes. J Mach Learn Res 12:2461–2488
Blei DM, Griffiths TL, Jordan MI, Tenenbaum JB (2004) Hierarchical topic models and the nested Chinese restaurant process, advances in Neural information processing systems
Cai D, He X, Han J, Huang TS (2011) Graph regularized non-negative matrix factorization for data representation. IEEE Trans Pattern Anal Mach Intell 33(8):1548–1560
Chen Q, Song Z, Hua Y, Huang Z, Yan S (2012) Hierarchical matching with side information for image classification, computer vision and pattern recognition (CVPR), pp 3426–3433
Chua T-S, Tang J, Hong R, Li H, Luo Z, Zheng Y (2009) Nus-wide: a real-world web image database from national university of singapore. In proceedings of the ACM international conference on image and video retrieval, pp 1–9
Duan J, Guindani M, Gelfand A (2007) Generalized spatial Dirichlet process models. Biometrika 94:809–825
Elkan C (2006) Clustering documents with an exponential-family approximation of the dirichlet compound multinomial distribution. Int Conf Mach Learn 148:289–296
Ferguson TS (1973) A Bayesian analysis of some nonparametric problems. Ann Stat 1(2):209–230
Figueiredo MAT, Jain AK (2002) Unsupervised learning of finite mixture models. IEEE Trans Pattern Anal Mach Intell 24(3):381–396
Finkel JR, Grenager T, Manning CD (2007) The infinite tree. In proceedings of the 45th Annual Meeting of the Association of Computational Linguistics, pp. 272–279
Gelman A, Carlin JB, Stern HS, Rubin DB (2003) Bayesian data analysis, 2nd edn (Chapman & Hall/CRC texts in statistical science)
Gershman SJ, Blei DM (2011) A tutorial on bayesian nonparametric models. J Math Psychol 56:1–12
Ghosh S, Ungureanu AB, Sudderth EB, Blei DM (2011) Spatial distance dependent chinese restaurant processes for image segmentation, NIPS, pp. 1476–1484
Griffin JE, Steel MFJ (2006) Order-based dependent Dirichlet processes. J Am Stat Assoc 101(473):179–194
Huang A (2008) Similarity measures for text document clustering, New Zealand computer science research student conference, pp 49–56
Jiang W, Xie L, Chang S-F (2009) Visual saliency with side information, IEEE international conference on acoustics, speech and signal processing, pp 1765–1768
Kim D, Oh A (2011) Accounting for data dependencies within a hierarchical dirichlet process mixture model, CIKM, pp 873–878
Li C, Phung D, Rana S, Venkatesh S (2013) Exploiting side information in distance dependent chinese restaurant processes for data clustering, international conference on multimedia and expo (ICME), pp 1–6
Lowe DG (1999) Object recognition from local scale-invariant features. In proceedings of the international conference on computer vision, Washington
MacEachern SN (1999) Dependent nonparametric processes
Marin JM, Mengersen KL, Robert C (2005) Bayesian modelling and inference on mixtures of distributions. Handbook stat 25(16):459–507
Neal RM (2000) Markov chain sampling methods for dirichlet process mixture models. J Comput Graph Stat 9(2):249–265
Nigam K, Mccallum AK, Thrun S, Mitchell T (1999) Text classification from labeled and unlabeled documents using em, machine learning, pp 103–134
Orbanz P (2010) Bayesian nonparametric models, Technical report
Porteous I, Asuncion AU, Welling M (2010) Bayesian matrix factorization with side information and dirichlet process mixtures., AAAI
Ross J, Dy J (2013) Nonparametric mixture of gaussian processes with constraints, international conference machine learning, pp 1346–1354
Schwarz GE (1978) Estimating the dimension of a model. Ann Stat 6:461–464
Sethuraman J (1994) A constructive definition of Dirichlet priors. Stat Sin 4:639–650
Socher R, Maas A, Manning CD (2011) Spectral chinese restaurant processes: nonparametric clustering based on similarities, 14th international conference on artificial intelligence and statistics (AISTATS)
Song Y, Pan S, Liu S, Wei F, Zhou MX, Qian W (2010) Constrained co-clustering for textual documents, AAAI
Soumya G, Michalis R, Leonid S, Erik S (2014) Nonparametric clustering with distance dependent hierarchies, uncertainty in artificial intelligence
Sudderth EB (2006) Graphical models for visual object recognition and tracking, PhD thesis
Sudderth EB, Torralba A, Freeman WT, Willsky AS (2005) Describing visual scenes using transformed dirichlet processes. Adv Neural Inf Process Syst 18:1299–1306
Sudderth E, Torralba A, Freeman W, Willsky A (2008) Describing visual scenes using transformed objects and parts. Int J Comput Vis 77(1):291–330
Teh YW, Jordan MI, Beal MJ, Blei DM (2006) Hierarchical dirichlet processes. JASA 101:1566–1581
Vlachos A, Ghahramani Z, Korhonen A (2008) Dirichlet process mixture models for verb clustering, ICML workshop on prior knowledge for text and language processing
Vlachos A, Korhonen A, Ghahramani Z (2009) Unsupervised and constrained dirichlet process mixture models for verb clustering, GEMS ’09. In: proceedings of the workshop on geometrical models of natural language semantics
Wagsta K, Cardie C, Schroedl S (2001) Constrained k-means clustering with background knowledge, international conference on machine learning
Wang X, Qian B, Davidson I (2012) On constrained spectral clustering and its applications, CoRR abs/1201.5338
Xu W, Liu X, Gong Y (2003) Document clustering based on non-negative matrix factorization, international ACM SIGIR conference on research and development in information retrieval, pp 267–273
Yang T, Jin R, Jain AK (2010) Learning from noisy side information by generalized maximum entropy model, international conference on machine learning
Yang Y, Pedersen JO (1997) A comparative study on feature selection in text categorization, the 14th international conference on machine learning, pp 412–420
Acknowledgments
We thank anonymous reviewers for their very useful comments and suggestions.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Li, C., Rana, S., Phung, D. et al. Data clustering using side information dependent Chinese restaurant processes. Knowl Inf Syst 47, 463–488 (2016). https://doi.org/10.1007/s10115-015-0834-7
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-015-0834-7