A comparative study of data-dependent approaches without learning in measuring similarities of data objects

Aryal, Sunil; Ting, Kai Ming; Washio, Takashi; Haffari, Gholamreza

doi:10.1007/s10618-019-00660-0

A comparative study of data-dependent approaches without learning in measuring similarities of data objects

Published: 30 October 2019

Volume 34, pages 124–162, (2020)
Cite this article

Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Sunil Aryal ORCID: orcid.org/0000-0002-6639-6824¹,
Kai Ming Ting²,
Takashi Washio³ &
…
Gholamreza Haffari⁴

709 Accesses
13 Citations
Explore all metrics

Abstract

Conventional general-purpose distance-based similarity measures, such as Minkowski distance (also known as \(\ell _p\)-norm with \(p>0\)), are data-independent and sensitive to units or scales of measurement. There are existing general-purpose data-dependent measures, such as rank difference, Lin’s probabilistic measure and \(m_p\)-dissimilarity (\(p>0\)), which are not sensitive to units or scales of measurement. Although they have been shown to be more effective than the traditional distance measures, their characteristics and relative performances have not been investigated. In this paper, we study the characteristics and relationships of different general-purpose data-dependent measures. We generalise \(m_p\)-dissimilarity where \(p\ge 0\) by introducing \(m_0\)-dissimilarity and show that it is a generic data-dependent measure with data-dependent self-similarity, of which rank difference and Lin’s measure are special cases with data-independent self-similarity. We evaluate the effectiveness of a wide range of general-purpose data-dependent and data-independent measures in the content-based information retrieval and kNN classification tasks. Our findings show that the fully data-dependent measure of \(m_p\)-dissimilarity is a more effective alternative to other data-dependent and commonly-used distance-based similarity measures as its task-specific performance is more consistent across a wide range of datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1

SimUSF: an efficient and effective similarity measure that is invariant to violations of the interval scale assumption

Article 11 May 2016

Thilak L. Fernando & Geoffrey I. Webb

Data-dependent dissimilarity measure: an effective alternative to geometric distance measures

Article 04 April 2017

Sunil Aryal, Kai Ming Ting, … Gholamreza Haffari

Survey on distance metric learning and dimensionality reduction in data mining

Article 27 June 2014

Fei Wang & Jimeng Sun

Notes

Similarity is the inverse of dissimilarity. We use dissimilarity in this paper to be consistent with distance measures.
Author(s) defined it as a similarity measure, but we define it as a dissimilarity measure to be consistent with other measures.
Author(s) defined it as a similarity measure, but we define it as a dissimilarity measure to be consistent with other measures.
Because \(A=\frac{1}{M}\), both numerator and denominator approach 0 when \(p\rightarrow 0\).
http://web.ist.utl.pt/acardoso/datasets/.
http://www.cs.waikato.ac.nz/ml/weka/datasets.html.
http://aloi.science.uva.nl/.
https://www.cs.toronto.edu/~kriz/cifar.html.
http://tunedit.org/challenge/music-retrieval.
https://archive.ics.uci.edu/ml/datasets.html.
(Fernando and Webb 2017) have shown that it is a better alternative than any other p settings. Hereafter, to simplify notation, we refer to \(d_{rank}(\mathbf{x}, \mathbf{y}, 1)\) as \(d_{rank}(\mathbf{x}, \mathbf{y})\).
We also examined whether the geometric mean of rank differences produced better results than the arithmetic mean (\(d_{rank}\)); but we observed that it produced worse results than \(d_{rank}\) in all six datasets.
https://github.com/michaelstewart/metric-learn.
The BoW text datasets were not used because there is no issue of scales and units of measurement as feature values are frequency counts.
We did not use other datasets used in Sect. 6.3.2 because we do not have the actual text of documents and only got BoW vectors.
https://github.com/hank110/bag-of-concepts.
\(d_{cos}\) here is equivalent to \(d_{cosIdf}\) in Sect. 6.3.2 because IDF weights were applied as a part of BoC vector representation.

References

Ariyaratne HB, Zhang D (2012) A novel automatic hierachical approach to music genre classification. In: Proceedings of the 2012 IEEE international conference on multimedia and expo workshops. IEEE Computer Society, Washington DC, pp 564–569
Aryal S (2018) Anomaly detection technique robust to units and scales of measurement. In: Proceedings of the 22nd Pacific-Asia conference on knowledge discovery and data mining. Springer, Cham, pp 589–601
Chapter Google Scholar
Aryal S, Ting KM, Haffari G, Washio T (2014a) Mp-dissimilarity: a data dependent dissimilarity measure. In: Proceedings of the IEEE international conference on data mining (ICDM), pp 707–712
Aryal S, Ting K, Wells JR, Washio T (2014b) Improving iforest with relative mass. In: Proceedings of the 18th Pacific-Asia conference on knowledge discovery and data mining, pp 510–521
Chapter Google Scholar
Aryal S, Ting KM, Haffari G, Washio T (2015) Beyond tf-idf and cosine distance in documents dissimilarity measure. In: Proceedings of the 11th Asia information retrieval societies conference, pp 400–406
Chapter Google Scholar
Aryal S, Ting KM, Washio T, Haffari G (2017) Data-dependent dissimilarity measure: an effective alternative to geometric distance measures. Knowl Inf Syst 53(2):479–506
Article Google Scholar
Aryal S, Ting KM, Washio T, Haffari G (2019) A new simple and effective measure for bag-of-word inter-document similarity measurement. CoRR arxiv:abs/1902.03402
Bengio Y, Courville A, Vincent P (2013) Representation learning: a review and new perspectives. IEEE Trans Pattern Anal Mach Intell 35(8):1798–1828
Article Google Scholar
Beyer KS, Goldstein J, Ramakrishnan R, Shaft U (1999) When is “Nearest Neighbor” meaningful? in: Proceedings of the 7th international conference on database theory. Springer, London, pp 217–235
Google Scholar
Black M (1952) The identity of indiscernibles. MIND: Q Rev Psychol Philos 61(242):153–164
Article Google Scholar
Breiman L (2001) Random forests. Mach Learn 45(1):5–32
Article Google Scholar
Cardoso-Cachopo A (2007) Improving methods for single-label text categorization. Instituto Superior Tecnico, Technical University of Lisbon, Lisbon PhD thesis
Google Scholar
Conover WJ, Iman RL (1981) Rank transformations as a bridge between parametric and nonparametric statistics. Am Stat 35(3):124–129
MATH Google Scholar
Cristianini N, Shawe-Taylor J (2000) An introduction to support vector machines and other kernel-based learning methods. Cambridge University Press, Cambridge
Book Google Scholar
Datar M, Immorlica N, Indyk P, Mirrokni VS (2004) Locality-sensitive hashing scheme based on p-stable distributions. In: Proceedings of the twentieth annual symposium on computational geometry, pp 253–262
Deza MM, Deza E (2009) Encyclopedia of distances. Springer, Berlin
Book Google Scholar
Dua D, Graff C ( 2017) UCI machine learning repository, http://archive.ics.uci.edu/ml. School of Information and Computer Sciences, University of California, Irvine
Duda RO, Hart PE, Stork DG (2000) Pattern classification, 2nd edn. Wiley-Interscience, New York
MATH Google Scholar
Fernando TL, Webb GI (2017) SimUSF: an efficient and effective similarity measure that is invariant to violations of the interval scale assumption. Data Min Knowl Disc 31(1):264–286
Article MathSciNet Google Scholar
François D, Wertz V, Verleysen M (2007) The concentration of fractional distances. IEEE Trans Knowl Data Eng 19(7):873–886
Article Google Scholar
Frome A, Corrado GS, Shlens J, Bengio S, Dean J, Ranzato M, Mikolov T ( 2013) Devise: a deep visual-semantic embedding model. In: Proceedings of NIPS, pp 2121–2129
Geusebroek J-M, Burghouts GJ, Smeulders AW (2005) The Amsterdam library of object images. Int J Comput Vis 61(1):103
Article Google Scholar
Gong Y, Kumar S, Rowley HA, Lazebnik S (2013) Learning binary codes for high-dimensional data using bilinear projections. In: Proceedings of the 2013 IEEE conference on computer vision and pattern recognition, pp 484–491
Goodfellow I, Bengio Y, Courville A (2016) Deep learning. MIT Press, Cambridge
MATH Google Scholar
Han E-H, Karypis G (2000) Centroid-based document classification: analysis and experimental results, Proceedings of the 4th European Conference on Principles of Data Mining and Knowledge Discovery. Springer, London, pp 424–431
Google Scholar
Kiela D, Bottou L (2014) Learning image embeddings using convolutional neural networks for improved multi-modal semantics. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). Association for Computational Linguistics, pp 36–45
Kim HK, Kim H, Cho S (2017) Bag-of-concepts: comprehending document representation through clustering words in distributed representation. Neurocomputing 266:336–352
Article Google Scholar
Krizhevsky A (2009) Learning multiple layers of features from tiny images, Master’s thesis, Department of Computer Science, University of Toronto, Alex Krizhevsky
Krumhansl CL (1978) Concerning the applicability of geometric models to similarity data: the interrelationship between similarity and spatial density. Psychol Rev 85(5):445–463
Article Google Scholar
Kulis B (2013) Metric learning: a survey. Found Trends Mach Learn 5(4):287–364
Article Google Scholar
LeCun Y, Bengio Y, Hinton G (2015) Deep learning. Nature 521:436–444
Article Google Scholar
Li P, Shrivastava A, Moore JL, König AC (2011) Hashing algorithms for large-scale learning. Adv Neural Inf Process Syst 24:2672–2680
Google Scholar
Lin D (1998) An information-theoretic definition of similarity. In: Proceedings of the fifteenth international conference on machine learning (ICML). Morgan Kaufmann Publishers Inc., San Francisco, pp 296–304
Lin K, Yang H, Hsiao J, Chen C (2015) Deep learning of binary hash codes for fast image retrieval. In: 2015 IEEE conference on computer vision and pattern recognition workshops (CVPRW), pp 27–35
Liu F, Ting KM, Zhou Z-H (2008) Isolation forest. In: Proceedings of the eighth IEEE international conference on data mining, pp 413–422
Mahalanobis PC (1936) On the generalized distance in statistics. Proc Natl Inst Sci India 2:49–55
MATH Google Scholar
Mansouri J, Khademi M (2015) Multiplicative distance: a method to alleviate distance instability for high-dimensional data. Knowl Inf Syst 45(3):783–805
Article Google Scholar
Mikolov T, Sutskever I, Chen K, Corrado G, Dean J (2013a) Distributed Representations of Words and Phrases and their Compositionality. In: Proceedings of the 26th international conference on neural information processing systems. Curran Associates Inc., USA, pp 3111–3119
Nguyen N, Guo Y (2008) Metric learning: a support vector approach. In: Proceedings of the ECML PKDD 2008. Springer, Berlin, pp 125–136
Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, Vanderplas J, Passos A, Cournapeau D, Brucher M, Perrot M, Duchesnay E (2011) Scikit-learn: machine learning in python. J Mach Learn Res 12:2825–2830
MathSciNet MATH Google Scholar
Ren W, Yu Y, Zhang J, Huang K (2014) Learning convolutional nonlinear features for k nearest neighbor image classification. In: Proceedings of the 22nd international conference on pattern recognition, pp 4358–4363
Salton G, Buckley C (1988) Term-weighting approaches in automatic text retrieval. Inf Process Manag 24(5):513–523
Article Google Scholar
Salton G, McGill MJ (1986) Introduction to modern information retrieval. McGraw-Hill Inc, New York
MATH Google Scholar
Shi T, Horvath S (2006) Unsupervised learning with random forest predictors. J Comput Graph Stat 15(1):118–138
Article MathSciNet Google Scholar
Song D, Liu W, Ji R, Meyer DA, Smith JR (2015) Top rank supervised binary coding for visual search. In: Proceedings of the 2015 IEEE international conference on computer vision (ICCV), pp 1922–1930
Stevens SS (1946) On the theory of scales of measurement. Science 103(2684):677–680
Article Google Scholar
Stewart M (2015) Metric learning algorithms in python. GitHub repository. https://github.com/michaelstewart/metric-learn
Sturges HA (1926) The choice of a class interval. J Am Stat Assoc 21(153):65–66
Article Google Scholar
Tan P-N, Steinbach M, Kumar V (2006) Introduction to data mining. Addison-Wesley, Boston
Google Scholar
Ting KM, Zhu Y, Carman M, Zhu, Y, Zhou Z-H (2016). Overcoming key weaknesses of distance-based neighbourhood methods using a data dependent dissimilarity measure. In: Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, pp 1205–1214
Torkkola K, Tuv E (2005) Ensemble learning with supervised kernels. In: Proceedings of the 16th European conference on machine learning, ECML’05. Springer, Berlin, pp 400–411
Chapter Google Scholar
Tsang IW, Kwok JT, Bay CW (2003) Distance metric learning with kernels. In: Proceedings of the international conference on artificial neural networks, pp 126–129
Tversky A (1977) Features of similarity. Psychol Rev 84(4):327–352
Article Google Scholar
Wang F, Sun J (2015) Survey on distance metric learning and dimensionality reduction in data mining. Data Min Knowl Disc 29(2):534–564
Article MathSciNet Google Scholar
Wang J, Do HT, Woznica A, Kalousis A (2011) Metric learning with multiple kernels. In: Shawe-Taylor J, Zemel RS, Bartlett PL, Pereira E, Weinberger KQ (eds) Advances in neural information processing systems, vol 24. Curran Associates, Inc., pp 1170–1178
Wang J, Shen HT, Song J, Ji J (2014) Hashing for similarity search: a survey, CoRR. arXiv:1408.2927
Wang J, Zhang T, song J, Sebe N, Shen HT (2018) A survey on learning to hash. IEEE Trans Pattern Anal Mach Intell 40(4):769–790
Article Google Scholar
Weinberger K, Blitzer J, Saul L (2005) Distance metric learning for large margin nearest neighbor classification. In: Proceedings of the advances in neural information processing systems. MIT Press, Cambridge
Xu Z, Weinberger KQ, Chapelle O (2013) Distance metric learning for kernel machines, Technical Report 1208.3422v2, arXiv
Zhong G, Wang L-N, Ling X, Dong J (2016) An overview on data representation learning: from traditional feature learning to recent deep learning. J Finance Data Sci 2(4):265–278
Article Google Scholar
Zhou G-T, Ting KM, Liu FT, Yin Y (2012) Relevance feature mapping for content-based multimedia information retrieval. Pattern Recognit 45(4):1707–1720
Article Google Scholar

Download references

Acknowledgements

The authors would like to thank A/Prof Peter Vamplew for interesting discussion and useful feedback in the first draft of the manuscript. The authors would also like to thank the anonymous reviewers for their valuable comments and suggestions to improve the manuscript.

Author information

Authors and Affiliations

School of Information Technology, Deakin University, Geelong, Australia
Sunil Aryal
School of Science, Engineering and IT, Federation University, Churchill, Australia
Kai Ming Ting
The Institute of Scientific and Industrial Research, Osaka University, Osaka, Japan
Takashi Washio
Clayton School of Information Technology, Monash University, Clayton, Australia
Gholamreza Haffari

Authors

Sunil Aryal
View author publications
You can also search for this author in PubMed Google Scholar
Kai Ming Ting
View author publications
You can also search for this author in PubMed Google Scholar
Takashi Washio
View author publications
You can also search for this author in PubMed Google Scholar
Gholamreza Haffari
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Sunil Aryal.

Additional information

Responsible editor: Srinivasan Parthasarathy, Johannes Fürnkranz.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix A: Empirical evaluation in documents represented as bag-of-concepts

Here we report the performances of BoW versions of data-dependent measures \(d_{rank}, m_1, d_{lin}\) and \(m_0\) (discussed in Sect. 4) against cosine distance (\(d_{cos}\)) and dissimilarity based on simple dot product (\(d_{dot}\)) in text datasets where documents are represented as bag-of-concepts (BoC) vectors (Kim et al. 2017).

In the BoC representation, concepts are defined using deep neural network (LeCun et al. 2015; Goodfellow et al. 2016) based word embedding technique called word2vec (Mikolov et al. 2013a) and clustering similar words. BoC representation is shown to produce better results than BoW or doc2vec (vector representation of documents based on word embedding) representations in the documents classification task (Kim et al. 2017). We conducted experiments in the 5NN document classification task using three text datasets - NG20, R52 and R8 (the same datasets used by Kim et al. (2017))^{Footnote 15}, where documents were represented by 500-dimensional BoC vectors. We used the python implementations of BoC representation provided by the authors (Kim et al. 2017).^{Footnote 16}

The average 5NN classification error and standard error over a 10-fold cross-validation of \(d_{cos}\)^{Footnote 17}, \(d_{dot}\) and BoW variants of \(d_{rank}, m_1, d_{lin}\) and \(m_0\) is provided in Table 14. The result shows that \(m_0\) produced the best or equivalent to the best result in all three datasets followed by \(d_{rank}\) in two datasets, and \(d_{lin}\) and \(d_{cos}\) in only one dataset each. This result is consistent with that in Sect. 6.3.2 where \(m_0\) produced better results overall.

The dissimilarity measure based on simple dot product (\(d_{dot}\)) produced significantly worse results than other contenders in all datasets. It is interesting to note that the self-similarity of instances using \(d_{dot}\) is also not constant like in \(m_p\). However, the self-similarity of \(d_{dot}\) is not data-dependent as it solely depends on feature values of an instance. Furthermore, it does not look at self-similarity in each dimension separately.

Table 14 Average 5NN classification error and standard error (within the parentheses in small font) over a 10-fold cross-validation in three text datasets with BoC vector representations

Full size table

Appendix B: Effect of ensemble size in tree-based data-dependent measures

In order to investigate the effect of ensemble size (t) in tree-based data-dependent measures (\(d_{USF}\) and \(m_{IF}\)), we evaluated their task-specific performance by varying the number of trees. The CBIR performances of \(d_{USF}\) and \(m_{IF}\) with a number of trees up to \(t=1000\) in the Corel and Hba datasets are shown in Fig. 8.

As expected, MAP@25 of both measures increased with the increase of t in both datasets. However, they did not produce competitive retrieval results with the one-dimension data-dependent measure of \(m_0\) even with \(t=1000\) in both datasets where the number of dimensions (M) is much less than 1000: Corel (\(M=67\)) and Hba (\(M=187\)). This result shows that tree-based methods require a large ensemble size (\(t<M\)) to produce a good result, but using a large t makes them expensive to run. For example, average total runtime (building trees, pre-processing and retrieval) of one run in the Corel dataset with \(t=1000\) took 809 seconds in \(d_{USF}\) and 2216 seconds in \(m_{IF}\), whereas \(m_0\) took 281 seconds only.

Appendix C: Effectiveness of equal-frequency and equal-width discretisation approaches to speed up one-dimensional data-dependent measures

We evaluated the performances of one-dimensional data-dependent measures with equal-width discretisation (EWD), and equal-frequency discretisation (EFD) in the CBIR task using the Corel and Hba datasets. We used the same number of intervals \(\eta =\lfloor \log _2 N\rfloor +1\) with both discretisation approaches; therefore, the only difference between them was the discretisation approach. The average MAP@k of \(d_{rank}\), \(d_{lin}\), \(m_1\) and \(m_0\) over 10 runs in the Corel and Hba datasets with EFD and EWD are provided in Fig. 9.

The CBIR results in Fig. 9 show that EFD produced either better or at least competitive results with EWD. It did not produce worse retrieval results than EWD in any case.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Aryal, S., Ting, K.M., Washio, T. et al. A comparative study of data-dependent approaches without learning in measuring similarities of data objects. Data Min Knowl Disc 34, 124–162 (2020). https://doi.org/10.1007/s10618-019-00660-0

Download citation

Received: 24 May 2017
Accepted: 22 October 2019
Published: 30 October 2019
Issue Date: January 2020
DOI: https://doi.org/10.1007/s10618-019-00660-0

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A comparative study of data-dependent approaches without learning in measuring similarities of data objects

Abstract

Access this article

Similar content being viewed by others

SimUSF: an efficient and effective similarity measure that is invariant to violations of the interval scale assumption

Data-dependent dissimilarity measure: an effective alternative to geometric distance measures

Survey on distance metric learning and dimensionality reduction in data mining

Notes

References

Acknowledgements