Skip to main content
Log in

A comparative study of data-dependent approaches without learning in measuring similarities of data objects

  • Published:
Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Abstract

Conventional general-purpose distance-based similarity measures, such as Minkowski distance (also known as \(\ell _p\)-norm with \(p>0\)), are data-independent and sensitive to units or scales of measurement. There are existing general-purpose data-dependent measures, such as rank difference, Lin’s probabilistic measure and \(m_p\)-dissimilarity (\(p>0\)), which are not sensitive to units or scales of measurement. Although they have been shown to be more effective than the traditional distance measures, their characteristics and relative performances have not been investigated. In this paper, we study the characteristics and relationships of different general-purpose data-dependent measures. We generalise \(m_p\)-dissimilarity where \(p\ge 0\) by introducing \(m_0\)-dissimilarity and show that it is a generic data-dependent measure with data-dependent self-similarity, of which rank difference and Lin’s measure are special cases with data-independent self-similarity. We evaluate the effectiveness of a wide range of general-purpose data-dependent and data-independent measures in the content-based information retrieval and kNN classification tasks. Our findings show that the fully data-dependent measure of \(m_p\)-dissimilarity is a more effective alternative to other data-dependent and commonly-used distance-based similarity measures as its task-specific performance is more consistent across a wide range of datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

Notes

  1. Similarity is the inverse of dissimilarity. We use dissimilarity in this paper to be consistent with distance measures.

  2. Author(s) defined it as a similarity measure, but we define it as a dissimilarity measure to be consistent with other measures.

  3. Author(s) defined it as a similarity measure, but we define it as a dissimilarity measure to be consistent with other measures.

  4. Because \(A=\frac{1}{M}\), both numerator and denominator approach 0 when \(p\rightarrow 0\).

  5. http://web.ist.utl.pt/acardoso/datasets/.

  6. http://www.cs.waikato.ac.nz/ml/weka/datasets.html.

  7. http://aloi.science.uva.nl/.

  8. https://www.cs.toronto.edu/~kriz/cifar.html.

  9. http://tunedit.org/challenge/music-retrieval.

  10. https://archive.ics.uci.edu/ml/datasets.html.

  11. (Fernando and Webb 2017) have shown that it is a better alternative than any other p settings. Hereafter, to simplify notation, we refer to \(d_{rank}(\mathbf{x}, \mathbf{y}, 1)\) as \(d_{rank}(\mathbf{x}, \mathbf{y})\).

  12. We also examined whether the geometric mean of rank differences produced better results than the arithmetic mean (\(d_{rank}\)); but we observed that it produced worse results than \(d_{rank}\) in all six datasets.

  13. https://github.com/michaelstewart/metric-learn.

  14. The BoW text datasets were not used because there is no issue of scales and units of measurement as feature values are frequency counts.

  15. We did not use other datasets used in Sect. 6.3.2 because we do not have the actual text of documents and only got BoW vectors.

  16. https://github.com/hank110/bag-of-concepts.

  17. \(d_{cos}\) here is equivalent to \(d_{cosIdf}\) in Sect. 6.3.2 because IDF weights were applied as a part of BoC vector representation.

References

  • Ariyaratne HB, Zhang D (2012) A novel automatic hierachical approach to music genre classification. In: Proceedings of the 2012 IEEE international conference on multimedia and expo workshops. IEEE Computer Society, Washington DC, pp 564–569

  • Aryal S (2018) Anomaly detection technique robust to units and scales of measurement. In: Proceedings of the 22nd Pacific-Asia conference on knowledge discovery and data mining. Springer, Cham, pp 589–601

    Chapter  Google Scholar 

  • Aryal S, Ting KM, Haffari G, Washio T (2014a) Mp-dissimilarity: a data dependent dissimilarity measure. In: Proceedings of the IEEE international conference on data mining (ICDM), pp 707–712

  • Aryal S, Ting K, Wells JR, Washio T (2014b) Improving iforest with relative mass. In: Proceedings of the 18th Pacific-Asia conference on knowledge discovery and data mining, pp 510–521

    Chapter  Google Scholar 

  • Aryal S, Ting KM, Haffari G, Washio T (2015) Beyond tf-idf and cosine distance in documents dissimilarity measure. In: Proceedings of the 11th Asia information retrieval societies conference, pp 400–406

    Chapter  Google Scholar 

  • Aryal S, Ting KM, Washio T, Haffari G (2017) Data-dependent dissimilarity measure: an effective alternative to geometric distance measures. Knowl Inf Syst 53(2):479–506

    Article  Google Scholar 

  • Aryal S, Ting KM, Washio T, Haffari G (2019) A new simple and effective measure for bag-of-word inter-document similarity measurement. CoRR arxiv:abs/1902.03402

  • Bengio Y, Courville A, Vincent P (2013) Representation learning: a review and new perspectives. IEEE Trans Pattern Anal Mach Intell 35(8):1798–1828

    Article  Google Scholar 

  • Beyer KS, Goldstein J, Ramakrishnan R, Shaft U (1999) When is “Nearest Neighbor” meaningful? in: Proceedings of the 7th international conference on database theory. Springer, London, pp 217–235

    Google Scholar 

  • Black M (1952) The identity of indiscernibles. MIND: Q Rev Psychol Philos 61(242):153–164

    Article  Google Scholar 

  • Breiman L (2001) Random forests. Mach Learn 45(1):5–32

    Article  Google Scholar 

  • Cardoso-Cachopo A (2007) Improving methods for single-label text categorization. Instituto Superior Tecnico, Technical University of Lisbon, Lisbon PhD thesis

    Google Scholar 

  • Conover WJ, Iman RL (1981) Rank transformations as a bridge between parametric and nonparametric statistics. Am Stat 35(3):124–129

    MATH  Google Scholar 

  • Cristianini N, Shawe-Taylor J (2000) An introduction to support vector machines and other kernel-based learning methods. Cambridge University Press, Cambridge

    Book  Google Scholar 

  • Datar M, Immorlica N, Indyk P, Mirrokni VS (2004) Locality-sensitive hashing scheme based on p-stable distributions. In: Proceedings of the twentieth annual symposium on computational geometry, pp 253–262

  • Deza MM, Deza E (2009) Encyclopedia of distances. Springer, Berlin

    Book  Google Scholar 

  • Dua D, Graff C ( 2017) UCI machine learning repository, http://archive.ics.uci.edu/ml. School of Information and Computer Sciences, University of California, Irvine

  • Duda RO, Hart PE, Stork DG (2000) Pattern classification, 2nd edn. Wiley-Interscience, New York

    MATH  Google Scholar 

  • Fernando TL, Webb GI (2017) SimUSF: an efficient and effective similarity measure that is invariant to violations of the interval scale assumption. Data Min Knowl Disc 31(1):264–286

    Article  MathSciNet  Google Scholar 

  • François D, Wertz V, Verleysen M (2007) The concentration of fractional distances. IEEE Trans Knowl Data Eng 19(7):873–886

    Article  Google Scholar 

  • Frome A, Corrado GS, Shlens J, Bengio S, Dean J, Ranzato M, Mikolov T ( 2013) Devise: a deep visual-semantic embedding model. In: Proceedings of NIPS, pp 2121–2129

  • Geusebroek J-M, Burghouts GJ, Smeulders AW (2005) The Amsterdam library of object images. Int J Comput Vis 61(1):103

    Article  Google Scholar 

  • Gong Y, Kumar S, Rowley HA, Lazebnik S (2013) Learning binary codes for high-dimensional data using bilinear projections. In: Proceedings of the 2013 IEEE conference on computer vision and pattern recognition, pp 484–491

  • Goodfellow I, Bengio Y, Courville A (2016) Deep learning. MIT Press, Cambridge

    MATH  Google Scholar 

  • Han E-H, Karypis G (2000) Centroid-based document classification: analysis and experimental results, Proceedings of the 4th European Conference on Principles of Data Mining and Knowledge Discovery. Springer, London, pp 424–431

    Google Scholar 

  • Kiela D, Bottou L (2014) Learning image embeddings using convolutional neural networks for improved multi-modal semantics. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). Association for Computational Linguistics, pp 36–45

  • Kim HK, Kim H, Cho S (2017) Bag-of-concepts: comprehending document representation through clustering words in distributed representation. Neurocomputing 266:336–352

    Article  Google Scholar 

  • Krizhevsky A (2009) Learning multiple layers of features from tiny images, Master’s thesis, Department of Computer Science, University of Toronto, Alex Krizhevsky

  • Krumhansl CL (1978) Concerning the applicability of geometric models to similarity data: the interrelationship between similarity and spatial density. Psychol Rev 85(5):445–463

    Article  Google Scholar 

  • Kulis B (2013) Metric learning: a survey. Found Trends Mach Learn 5(4):287–364

    Article  Google Scholar 

  • LeCun Y, Bengio Y, Hinton G (2015) Deep learning. Nature 521:436–444

    Article  Google Scholar 

  • Li P, Shrivastava A, Moore JL, König AC (2011) Hashing algorithms for large-scale learning. Adv Neural Inf Process Syst 24:2672–2680

    Google Scholar 

  • Lin D (1998) An information-theoretic definition of similarity. In: Proceedings of the fifteenth international conference on machine learning (ICML). Morgan Kaufmann Publishers Inc., San Francisco, pp 296–304

  • Lin K, Yang H, Hsiao J, Chen C (2015) Deep learning of binary hash codes for fast image retrieval. In: 2015 IEEE conference on computer vision and pattern recognition workshops (CVPRW), pp 27–35

  • Liu F, Ting KM, Zhou Z-H (2008) Isolation forest. In: Proceedings of the eighth IEEE international conference on data mining, pp 413–422

  • Mahalanobis PC (1936) On the generalized distance in statistics. Proc Natl Inst Sci India 2:49–55

    MATH  Google Scholar 

  • Mansouri J, Khademi M (2015) Multiplicative distance: a method to alleviate distance instability for high-dimensional data. Knowl Inf Syst 45(3):783–805

    Article  Google Scholar 

  • Mikolov T, Sutskever I, Chen K, Corrado G, Dean J (2013a) Distributed Representations of Words and Phrases and their Compositionality. In: Proceedings of the 26th international conference on neural information processing systems. Curran Associates Inc., USA, pp 3111–3119

  • Nguyen N, Guo Y (2008) Metric learning: a support vector approach. In: Proceedings of the ECML PKDD 2008. Springer, Berlin, pp 125–136

  • Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, Vanderplas J, Passos A, Cournapeau D, Brucher M, Perrot M, Duchesnay E (2011) Scikit-learn: machine learning in python. J Mach Learn Res 12:2825–2830

    MathSciNet  MATH  Google Scholar 

  • Ren W, Yu Y, Zhang J, Huang K (2014) Learning convolutional nonlinear features for k nearest neighbor image classification. In: Proceedings of the 22nd international conference on pattern recognition, pp 4358–4363

  • Salton G, Buckley C (1988) Term-weighting approaches in automatic text retrieval. Inf Process Manag 24(5):513–523

    Article  Google Scholar 

  • Salton G, McGill MJ (1986) Introduction to modern information retrieval. McGraw-Hill Inc, New York

    MATH  Google Scholar 

  • Shi T, Horvath S (2006) Unsupervised learning with random forest predictors. J Comput Graph Stat 15(1):118–138

    Article  MathSciNet  Google Scholar 

  • Song D, Liu W, Ji R, Meyer DA, Smith JR (2015) Top rank supervised binary coding for visual search. In: Proceedings of the 2015 IEEE international conference on computer vision (ICCV), pp 1922–1930

  • Stevens SS (1946) On the theory of scales of measurement. Science 103(2684):677–680

    Article  Google Scholar 

  • Stewart M (2015) Metric learning algorithms in python. GitHub repository. https://github.com/michaelstewart/metric-learn

  • Sturges HA (1926) The choice of a class interval. J Am Stat Assoc 21(153):65–66

    Article  Google Scholar 

  • Tan P-N, Steinbach M, Kumar V (2006) Introduction to data mining. Addison-Wesley, Boston

    Google Scholar 

  • Ting KM, Zhu Y, Carman M, Zhu, Y, Zhou Z-H (2016). Overcoming key weaknesses of distance-based neighbourhood methods using a data dependent dissimilarity measure. In: Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, pp 1205–1214

  • Torkkola K, Tuv E (2005) Ensemble learning with supervised kernels. In: Proceedings of the 16th European conference on machine learning, ECML’05. Springer, Berlin, pp 400–411

    Chapter  Google Scholar 

  • Tsang IW, Kwok JT, Bay CW (2003) Distance metric learning with kernels. In: Proceedings of the international conference on artificial neural networks, pp 126–129

  • Tversky A (1977) Features of similarity. Psychol Rev 84(4):327–352

    Article  Google Scholar 

  • Wang F, Sun J (2015) Survey on distance metric learning and dimensionality reduction in data mining. Data Min Knowl Disc 29(2):534–564

    Article  MathSciNet  Google Scholar 

  • Wang J, Do HT, Woznica A, Kalousis A (2011) Metric learning with multiple kernels. In: Shawe-Taylor J, Zemel RS, Bartlett PL, Pereira E, Weinberger KQ (eds) Advances in neural information processing systems, vol 24. Curran Associates, Inc., pp 1170–1178

  • Wang J, Shen HT, Song J, Ji J (2014) Hashing for similarity search: a survey, CoRR. arXiv:1408.2927

  • Wang J, Zhang T, song J, Sebe N, Shen HT (2018) A survey on learning to hash. IEEE Trans Pattern Anal Mach Intell 40(4):769–790

    Article  Google Scholar 

  • Weinberger K, Blitzer J, Saul L (2005) Distance metric learning for large margin nearest neighbor classification. In: Proceedings of the advances in neural information processing systems. MIT Press, Cambridge

  • Xu Z, Weinberger KQ, Chapelle O (2013) Distance metric learning for kernel machines, Technical Report 1208.3422v2, arXiv

  • Zhong G, Wang L-N, Ling X, Dong J (2016) An overview on data representation learning: from traditional feature learning to recent deep learning. J Finance Data Sci 2(4):265–278

    Article  Google Scholar 

  • Zhou G-T, Ting KM, Liu FT, Yin Y (2012) Relevance feature mapping for content-based multimedia information retrieval. Pattern Recognit 45(4):1707–1720

    Article  Google Scholar 

Download references

Acknowledgements

The authors would like to thank A/Prof Peter Vamplew for interesting discussion and useful feedback in the first draft of the manuscript. The authors would also like to thank the anonymous reviewers for their valuable comments and suggestions to improve the manuscript.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Sunil Aryal.

Additional information

Responsible editor: Srinivasan Parthasarathy, Johannes Fürnkranz.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix A: Empirical evaluation in documents represented as bag-of-concepts

Here we report the performances of BoW versions of data-dependent measures \(d_{rank}, m_1, d_{lin}\) and \(m_0\) (discussed in Sect. 4) against cosine distance (\(d_{cos}\)) and dissimilarity based on simple dot product (\(d_{dot}\)) in text datasets where documents are represented as bag-of-concepts (BoC) vectors (Kim et al. 2017).

In the BoC representation, concepts are defined using deep neural network (LeCun et al. 2015; Goodfellow et al. 2016) based word embedding technique called word2vec (Mikolov et al. 2013a) and clustering similar words. BoC representation is shown to produce better results than BoW or doc2vec (vector representation of documents based on word embedding) representations in the documents classification task (Kim et al. 2017). We conducted experiments in the 5NN document classification task using three text datasets - NG20, R52 and R8 (the same datasets used by Kim et al. (2017))Footnote 15, where documents were represented by 500-dimensional BoC vectors. We used the python implementations of BoC representation provided by the authors (Kim et al. 2017).Footnote 16

The average 5NN classification error and standard error over a 10-fold cross-validation of \(d_{cos}\)Footnote 17, \(d_{dot}\) and BoW variants of \(d_{rank}, m_1, d_{lin}\) and \(m_0\) is provided in Table 14. The result shows that \(m_0\) produced the best or equivalent to the best result in all three datasets followed by \(d_{rank}\) in two datasets, and \(d_{lin}\) and \(d_{cos}\) in only one dataset each. This result is consistent with that in Sect. 6.3.2 where \(m_0\) produced better results overall.

The dissimilarity measure based on simple dot product (\(d_{dot}\)) produced significantly worse results than other contenders in all datasets. It is interesting to note that the self-similarity of instances using \(d_{dot}\) is also not constant like in \(m_p\). However, the self-similarity of \(d_{dot}\) is not data-dependent as it solely depends on feature values of an instance. Furthermore, it does not look at self-similarity in each dimension separately.

Table 14 Average 5NN classification error and standard error (within the parentheses in small font) over a 10-fold cross-validation in three text datasets with BoC vector representations

Appendix B: Effect of ensemble size in tree-based data-dependent measures

In order to investigate the effect of ensemble size (t) in tree-based data-dependent measures (\(d_{USF}\) and \(m_{IF}\)), we evaluated their task-specific performance by varying the number of trees. The CBIR performances of \(d_{USF}\) and \(m_{IF}\) with a number of trees up to \(t=1000\) in the Corel and Hba datasets are shown in Fig. 8.

Fig. 8
figure 8

Average MAP@25 in the Corel (\(M=67\)) and Hba (\(M=187\)) datasets with different ensemble size

As expected, MAP@25 of both measures increased with the increase of t in both datasets. However, they did not produce competitive retrieval results with the one-dimension data-dependent measure of \(m_0\) even with \(t=1000\) in both datasets where the number of dimensions (M) is much less than 1000: Corel (\(M=67\)) and Hba (\(M=187\)). This result shows that tree-based methods require a large ensemble size (\(t<M\)) to produce a good result, but using a large t makes them expensive to run. For example, average total runtime (building trees, pre-processing and retrieval) of one run in the Corel dataset with \(t=1000\) took 809 seconds in \(d_{USF}\) and 2216 seconds in \(m_{IF}\), whereas \(m_0\) took 281 seconds only.

Appendix C: Effectiveness of equal-frequency and equal-width discretisation approaches to speed up one-dimensional data-dependent measures

We evaluated the performances of one-dimensional data-dependent measures with equal-width discretisation (EWD), and equal-frequency discretisation (EFD) in the CBIR task using the Corel and Hba datasets. We used the same number of intervals \(\eta =\lfloor \log _2 N\rfloor +1\) with both discretisation approaches; therefore, the only difference between them was the discretisation approach. The average MAP@k of \(d_{rank}\), \(d_{lin}\), \(m_1\) and \(m_0\) over 10 runs in the Corel and Hba datasets with EFD and EWD are provided in Fig. 9.

Fig. 9
figure 9

Average MAP@25 of \(d_{rank}\), \(d_{lin}\), \(m_1\) and \(m_0\) over 10 runs in the Corel and Hba datasets with equal-frequency discretisation (EFD) and equal-width discretisation (EWD)

The CBIR results in Fig. 9 show that EFD produced either better or at least competitive results with EWD. It did not produce worse retrieval results than EWD in any case.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Aryal, S., Ting, K.M., Washio, T. et al. A comparative study of data-dependent approaches without learning in measuring similarities of data objects. Data Min Knowl Disc 34, 124–162 (2020). https://doi.org/10.1007/s10618-019-00660-0

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10618-019-00660-0

Keywords

Navigation