1 Introduction and motivation

Many machine learning algorithms rely on a distance measure to provide the closest match between a test point and example points from a database in order to find its nearest neighbours. The distance calculation is the core operation that has been applied to many data mining and machine learning tasks, including density estimation, clustering, anomaly detection and classification.

Despite its widespread applications, research in psychology has pointed out since the 1970’s that distance measures do not possess a key property of dissimilarity as judged by humans (Krumhansl 1978; Tversky 1977), namely that two points in a dense region of the space are less similar to each other than two points of the same interpoint distance in a sparse region. Researchers have suggested that a data dependent dissimilarity provides better performance than data independent geometric model based distance measures in psychological tests (Krumhansl 1978). For example, two Caucasians will be judged as less similar when compared in Europe (where there are many Caucasians) than in East Asia (where there are few Caucasians and many East Asians).

We introduce a new generic data dependent measure called mass-based dissimilarity which has the above-mentioned characteristic, and we provide concrete evidence that it is a better measure than standard distance measures for two existing algorithms which rely on distance calculations to perform classification and clustering tasks.

The new measure violates two metric axioms, namely, the constancy and minimality of self-dissimilarity. We show that the data dependency has two components: data dependent dissimilarity between identical points (aka self-dissimilarity) and data dependent dissimilarity between two non-identical points. Though there exist data dependent measures which are metrics or pseudo-metrics, they are data dependent on the second component only. The data dependent self-dissimilarity is a unique feature of mass-based dissimilarity.

Mass-based dissimilarity uses an estimate of the probability mass rather than distance as the means to find the closest match neighbourhood. This heralds a fundamental change of perspective: the neighbourhood is now determined by the lowest probability mass neighbours rather than the nearest neighbours.

Simply replacing the distance measure with mass-based dissimilarity effectively converts nearest neighbour (NN) algorithms to Lowest Probability Mass Neighbour (LMN) algorithms. Both types of algorithms employ exactly the same algorithmic procedures, except for the substitution of the dissimilarity measure.

This paper makes the following contributions:

  1. 1.

    Providing a generic data dependent dissimilarity, named mass-based dissimilarity, in which its data dependency has two components: self-dissimilarity and dissimilarity for two non-identical points \(\mathbf{x} \ne \mathbf{y}\).

  2. 2.

    Analysing the conditions under which mass-based dissimilarity will overcome key shortcomings of distance-based neighbourhood methods in classification and clustering tasks.

  3. 3.

    Through simulations and empirical evaluation, demonstrating that LMN algorithms overcome key shortcomings of NN algorithms in classification and clustering tasks. This is achieved through the replacement of the distance measure with the mass-based dissimilarity in existing algorithms.

  4. 4.

    Investigating the similarities and differences with existing data dependent measures.

The remainder of the paper is organised as follows. Section 2 presents the proposed mass-based dissimilarity. Section 3 describes the lowest probability mass neighbour (kLMN) algorithm and the analyses on the condition under which the kLMN classifier reduces the error rate of the kNN classifier. Section 4 describes how mass-based clustering can be obtained by simply replacing distance measure with mass-based dissimilarity in an existing density-based clustering algorithm. It also provides the analyses on the condition under which mass-based clustering performs better than density-based clustering. Section 5 describes the shortcomings of existing distance-based neighbourhood methods. Section 6 presents the empirical evaluation results. Related data dependent dissimilarities and metric axioms are discussed in Sect. 7. The relation to distance metric learning, data dependent kernel and similarity-based learning is presented in Sect. 8. A discussion of other issues and conclusions are provided in the last two sections.

2 Mass-based dissimilarity

Geometric model based measures depend solely on the geometric positions of data points to derive distance measures. Instead, mass-based dissimilarity measures mainly depend on the distribution of the data.

The intuition underlying the proposed measures is that the dissimilarity between two points primarily depends on the amount of probability mass of a region of a space covering the two points. Specifically, two points in a dense region are less similar to each other than two points of the same interpoint distance in a sparse region.

In order to turn the intuition above into a useful measure, we need to (a) define what the region covering two points is; and (b) provide a method for estimating its probability mass. Since we do not wish to make any parametric assumptions about the form of the underlying probability distribution that generated the data, we turn to non-parametric tree-based partitioning techniques (described in Sect. 2.2) to define a hierarchy of regions and calculate their probabilities. We can then define a distribution sensitive dissimilarity measure as the probability massFootnote 1 of the smallest region in the hierarchy covering both points. We name the proposed measure: mass-based dissimilarity.

We note that the new measure still makes use of a geometric model in order to define a region which encloses neighbouring points. However, once the regions are defined, the dissimilarity between any two points is determined by the probability mass of the smallest region covering the two points. We now formalise the concepts as follows.

Let H denote a hierarchical model that partitions the space \(\mathbb {R}^q\) into a set of non-overlapping axis-aligned regions that collectively span \(\mathbb {R}^q\). Each internal node (representing a region) in the hierarchy corresponds to the union of its child nodes/regions. Let \(\mathcal {H}(D)\) denote the set of all such hierarchical partitions H that are admissible under a data set D such that each non-overlapping region contains at least one point from D. As a result, each hierarchy \(H \in \mathcal {H}(D)\) has a finite height and a finite number of external nodes, both have the maximum of |D|.

The smallest region covering two points is defined as follows:

Definition 1

\(R(\mathbf{x},\mathbf{y}|H)\), the smallest local region covering \(\mathbf{x} \in \mathbb {R}^q\) and \(\mathbf{y} \in \mathbb {R}^q\) with respect to the hierarchical partitioning model H of \(\mathbb {R}^q\), is defined as:

$$\begin{aligned} R(\mathbf{x},\mathbf{y}|H)=\mathop {{{\mathrm{argmax}}}}\limits _{r \in H\ s.t. \{ \mathbf{x},\mathbf{y} \} \in r} depth(r;H) \end{aligned}$$
(1)

where depth(rH) is the depth of the node r in the hierarchical model H.

Note that the probability of data falling into the smallest region containing both \(\mathbf{x}\) and \(\mathbf{y}\), denoted \(P(R(\mathbf{x},\mathbf{y}|H))\), is analogous to the shortest distance between \(\mathbf{x}\) and \(\mathbf{y}\) used in the geometric model.

Let \(D=\{\mathbf{x}_1,\ldots ,\mathbf{x}_n\}, \mathbf{x}_i \in \mathbf {R}^d\) be a dataset sampled from an unknown probability density function \(\mathbf{x}_i \sim F\).

Definition 2

Mass-based dissimilarity of \(\mathbf{x}\) and \(\mathbf{y}\) wrt D and F is defined as the expectation, over all possible partitionings of the data, of the probability that a randomly chosen point \(\mathbf{z} \sim F\) would lie in the region \(R(\mathbf{x},\mathbf{y}|H)\):

$$\begin{aligned} m(\mathbf{x},\mathbf{y}|D,F) = E_{\mathcal {H}(D)} [ P_F(R(\mathbf{x},\mathbf{y}|H)) ] \end{aligned}$$
(2)

where \(P_F(\cdot )\) is the probability wrt F; and the expectation is taken over the probability distribution on all hierarchical partitioning \(H \in {{\mathcal {H}}}(D)\) of the dataset D.

In practice, the mass-based dissimilarity would be estimated from a finite number of models \(H_i \in \mathcal {H}(D), i=1,\ldots ,t\) as follows:

$$\begin{aligned} m_e(\mathbf{x},\mathbf{y}|D) = \frac{1}{t} \sum _{i=1}^t \tilde{P}(R(\mathbf{x},\mathbf{y}|H_i;D)) \end{aligned}$$
(3)

where \(\tilde{P}(R) = \frac{1}{|D|} \sum _{\mathbf{z} \in D} \mathbb {1}(\mathbf{z} \in R)\) estimates the probability of the region R using the count of data points in that region; and \(\mathbb {1}(\cdot )\) is an indicator function.

Hereafter we drop D from the notations when the context is clear.

2.1 Self-dissimilarity

A unique feature of the mass-based dissimilarity is self-dissimilarity, whereby the dissimilarity between a point and itself \(m_e(\mathbf{x},\mathbf{x})\) is non-constant across the space \(\mathbb {R}^q\), and ranges over [0, 1] with value depending on the data distribution and the partitioning strategy used. This relaxes the metric axioms which require self-dissimilarity to be minimum and constant.

The non-constancy stems from the fact that the partitions in H can contain different number of data points and thus different probability mass estimates. The non-constancy of self-dissimilarity implies that it cannot be adjusted by simply subtracting a constant from all values. This is unlike self-dissimilarity of other measures which usually take a constant value equal to the minimum dissimilarity.

The differences between \(m_e\) and \(\ell _p\) are shown in Table 1. The reasons for the two properties of \(m_e\) are given below:

  • \(\forall \mathbf{xy} ~~ (\mathbf{x} \ne \mathbf{y}) \rightarrow (m_{e}(\mathbf{x},\mathbf{x}) \le m_{e}(\mathbf{x},\mathbf{y}))\)

    This is true because \(R(\mathbf{x},\mathbf{x}|H) \subseteq R(\mathbf{x},\mathbf{y}|H)\). The self-dissimilarity is the base value in which the dissimilarity \(m_{e}(\mathbf{x},\mathbf{y})\) between two points is measured, i.e., \(m_{e}(\mathbf{x},\mathbf{y}) \ge m_{e}(\mathbf{x},\mathbf{x}) + m_{e}(\mathbf{y},\mathbf{y})\). Thus the self-dissimilarity has a direct impact on \(m_{e}(\mathbf{x},\mathbf{y})\) even though no duplicate points may exist in the dataset.

  • This is because the mass distribution for \(m_e(\mathbf{x},\mathbf{x})\) is not constantFootnote 2 over the space. If x is in the region that includes the maximum mass point, then its mass value will be larger than the dissimilarity of some points \(\mathbf{y}\) and \(\mathbf{z}\) which are close to the fringe or have the minimum mass values. Both mass distributions shown in Fig. 1b, c provide such an example: \(m_e(\mathbf{z},\mathbf{y})\) at either fringes of the distribution is less than the maximum mass \(m_e(\mathbf{x},\mathbf{x})\). This property always holds when the half-space partitioning strategy is used because mass distribution is concave (see below).

Table 1 \(m_e\) versus \(\ell _p\) (the usual metric based on p-norm)
Fig. 1
figure 1

a A true density distribution; b\({m}_{e}(x, x\vert D)\) based on half-space and c\({m}_{e}(x, x\vert D)\) based on random trees (level = 8) are two mass distributions generated using two different partitioning strategies from a dataset D with 1500 points randomly sampled from (a)

In general, the distribution of the data dependent self-dissimilarity is equivalent to the mass distribution defined by Ting et al. (2010), and its properties depend on the hierarchical partitioning strategy \({{\mathcal {H}}}\):

  • If a half-space partitioning strategy is used to define the regions in H, then \({m}_{e}(\mathbf{x},\mathbf{x})\) reduces to the half-space mass defined in Chen et al. (2015). Here the mass distribution is always concave within the area bounded by the data, irrespective of the density distribution of the given data set. It has a unique maximum mass point which can be treated as the median of the distribution. The minimum mass values are at the fringes of the distribution (Chen et al. 2015). An example is shown in Fig. 1b.

  • If instead a random tree partitioning strategy is used to define H, then \({m}_e(\mathbf{x},\mathbf{x})\) reduces to level-h mass estimation as defined by Ting et al. (2013b). An example is shown in Fig. 1c.

This paper uses the random trees partitioning strategy rather than half-space because the former provides finer granularity in mass-based dissimilarity. The relationships between these two partitioning strategies, and between mass estimation and mass-based dissimilarity are provided in Sect. 2.4.

An intuitive example is provided as follows. An East Asian Asian may be considered more similar to themselves in Europe (where there are fewer East Asians) than they are in East Asia (where there are a lot of East Asians), though they have exactly the same physical features, regardless of whether they are in East Asia or Europe. This is also reflected in Fig. 1c where high (low) density regions have high (low) self-dissimilarity. This self similarity influences the similarity measurement of two different persons. A consequence of the data dependency of self similarity leads to the outcome that two East Asians are more similar in Europe than they are in East Asia.

This example shows that the shortest distance (i.e., zero distance) is not equivalent to the closest (judged) similarity or smallest probability mass; and data dependent self-dissimilarity is an important aspect of a mass-based measure akin to judged dissimilarity.

2.2 Hierarchical partitioning method used to define regions

There are many methods to implement a model to define regions for mass estimation (Ting et al. 2013b). In this paper, we employ a method based on completely random treesFootnote 3 to implement mass-based dissimilarity.

We use a recursive partitioning scheme called iForest (isolation Forest) (Liu et al. 2008) to define regions. Though iForest was initially designed for anomaly detection (Liu et al. 2008), it has been shown that it is a special case of mass estimation (Ting et al. 2013b), which will be discussed in Sect. 2.4.

The implementation can be divided into two steps. There are two input parameters to this procedure. They are: t—the number of iTrees (isolation Trees); and \(\psi \)—the sub-sampling size used to build each iTree. The height limit h of each iTree is automatically set by \(\psi :h=\lceil log_{2}\psi \rceil \).

The first step is to build an iForest consisting of tiTrees as the partitioning structure R. Each iTree is built independently using a subset \({{\mathcal {D}}} \subset D\), sampled without replacement from D, where \(\vert {\mathcal {D}} \vert =\psi \). A randomly selected split is employed at each internal node of an iTree to partition the sample set at the node into two non-empty subsets until every point is isolated or the maximum tree height h is reached. Unless stated otherwise, axis-parallel splits are used at each node of an iTree to build an iForest. The details of the iTree building process can be found in “Appendix A”.

Figure 2 provides an example of six points partitioned by an iTree: \(m_e(a,b)=5\), and \(m_e(b,f)=6\), \(m_e(d,d)=4\) and \(m_e(f,f)=1\) (ignoring the normalising term \(|{{\mathcal {D}}}|=6\)).

After an iForest is built, all points in D are traversed through each tree in order to record the mass of each node.

Fig. 2
figure 2

Example: six points partitioned by a 2-level iTree

The second step is the evaluation step. Test points \(\mathbf{x}\) and \(\mathbf{y}\) are passed through each iTree to find the mass of the deepest node containing both \(\mathbf{x}\) and \(\mathbf{y}\), i.e., \(\sum _i |R(\mathbf{x},\mathbf{y}|H_i)|\). Finally, \(m_e(\mathbf{x},\mathbf{y})\) is the mean of these mass values over tiTrees as defined below:

$$\begin{aligned} m_e(\mathbf{x},\mathbf{y})= \frac{1}{t} \sum _{i=1}^{t} \frac{| R(\mathbf{x},\mathbf{y}|H_i)|}{|D|} \end{aligned}$$
(4)

2.3 Visualising the effects of a data dependent dissimilarity measure

The visualisation in Fig. 3 provides a different perspective of the advantage of using a data dependent dissimilarity measure. Here we performed multidimensional scaling (MDS)Footnote 4 with mass-based dissimilarity and distance measures. The red dense region in Fig. 3a becomes sparser and the two sparse regions in Fig. 3a become denser in the MDS plots using the mass-based dissimilarity measure, shown in Fig. 3b. This modified distribution enables all clusters to be detected by an existing clustering algorithm that would not have succeeded otherwise (see Sect. 4.2.2 for details).

Fig. 3
figure 3

MDS plots on the Thyroid dataset with two different dissimilarity measures. a MDS using distance. b MDS using \(m_e\)

2.4 Mass estimation and mass-based dissimilarity

Mass-based dissimilarity is a direct descendant of another line of research, called mass estimation (Ting et al. 2010, 2013b; Chen et al. 2015). Data distribution is often modelled in terms of density distribution. Mass estimation offers an alternative way to model data distribution in terms of mass distribution. Before the advent of mass estimation, iForest (Liu et al. 2008) was initially created for the sole purpose of anomaly detection. Its effectiveness in discerning and ranking anomalous points in a dataset enables it to be used for other tasks which require ranking. Indeed, the use of iForest has been adapted in a content-based information retrieval system called ReFeat (Zhou et al. 2012). The system finds points in a database which is relevant to a query, coupling with relevance feedback. iForest was then recognised as an implementation of mass estimation (Ting et al. 2013b), where the path length (as the anomaly score) traversed by a test point along each iTree is recognised as a proxy to mass. A direct use of mass facilitates the improved versions of iForest and ReFeat (Aryal et al. 2014b).

From another perspective, a special case of mass estimation, called half-space mass (Chen et al. 2015), shares common characteristics of data depth (Liu et al. 1999; Mosler 2013) which is aimed to find the median in a multi-dimensional space. They both model data distribution in terms of center-outward ranking rather than density or linear ranking. They have the following characteristics: (i) the resultant mass distribution is concave (convex for data depth) regardless of the underlying density distribution, (ii) the maximum mass point or the minimum data depth point can be considered as a generalisation of the median.

The half-space mass (Chen et al. 2015) is a generalisation of the univariate mass estimation (Ting et al. 2010, 2013b) to multi-dimensional spaces; and it facilitates the extension of the mass estimation of one point to a dissimilarity measure of two points based on mass. The generic definition of mass-based dissimilarity, presented in this paper, encompasses \(m_p\)-dissimilarity (Aryal et al. 2014a) as its special case. The details of \(m_p\)-dissimilarity are provided in Sect. 7.1.

SNN (dis)similarity (Jarvis and Patrick 1973) can be viewed as an early form of mass-based dissimilarity. The \(\mu \)-neighbourhood function introduced here is a new form of mass estimator which is defined based on mass-based dissimilarity. Using this function as a template, we can now see that the neighbourhood function based on SNN is an early variant of the \(\mu \)-neighbourhood function. The details of SNN similarity and its relation to mass-based dissimilarity are provided in Sect. 4.4.

It is interesting to examine the versatility of mass estimation. Thus far, mass estimation has been implemented using trees [including iForest (Liu et al. 2008) and its variants (Ting and Wells 2010; Tan et al. 2011; Ting et al. 2013a)], half-space (Chen et al. 2015) and nearest neighbour (Wells et al. 2014). These implementations have been applied to anomaly detection, clustering, information retrieval, classification, emerging new classes problems in data streams (Mu et al. 2017), and even density estimation based on mass called DEMass (Ting et al. 2013a; Wells et al. 2014).

Given the strong connection between mass estimation and mass-based dissimilarity, the above implementations of mass estimation can potentially be channelled for use in mass-based dissimilarity—iForest is an example used in this paper.

3 Lowest probability mass neighbour classifiers

We now describe the lowest probability mass neighbour (LMN) algorithm which is formed by simply replacing the distance measure in nearest neighbour (NN) algorithm with the mass-based dissimilarity. We focus on classification here. The nearest neighbour and lowest probability mass neighbour for NN and LMN classifiers, respectively, are obtained as follows:

$$\begin{aligned} {NN}(\mathbf{x}; D)= & {} \mathop {{{\mathrm{argmin}}}}\limits _{\mathbf{y} \in {D}} \ \ell _p(\mathbf{x},\mathbf{y}) \\ {LMN}(\mathbf{x}; D)= & {} \mathop {{{\mathrm{argmin}}}}\limits _{\mathbf{y} \in {D}} \ m_e(\mathbf{x},\mathbf{y}) \end{aligned}$$

In this paper, we assume that Euclidean distance is used for NN classifiers, i.e., \(p=2\) for \(\ell _p\).

The shortcoming of the classifier is given in the first subsection; and the condition under which kLMN will have a lower error rate than kNN is provided in the second subsection. Simulations to demonstrate the analytical results are presented in the third subsection.

3.1 Shortcoming of the kNN classifier

Let \(\Gamma \subseteq \mathbb {R}^q\) be a q-dimensional real space which is also a metric space \((\Gamma , \ell _p)\) and a probability space \((\Gamma , 2^{\Gamma }, P)\). Let \({{\mathcal {X}}}\) be a subset of \(\Gamma \) and have a finite volume for a classification problem. It is partitioned into two non-overlapping open subsets \({{\mathcal {X}}}_T\) and \({{\mathcal {X}}}_S\), where \({{\mathcal {X}}}_T\) contains positive instances only and \({{\mathcal {X}}}_S\) contains negative instances only, such that the problem is deterministic with Bayes error rate = 0.

Furthermore, let the probability density of instances be non-zero everywhere in \({{\mathcal {X}}}\); and zero outside \({{\mathcal {X}}}\), i.e., \(P(\mathbf{x}) > 0\ \forall \mathbf{x} \in {{\mathcal {X}}}\) and \(P(\mathbf{x}) = 0\ \forall \mathbf{x} \in \Gamma \backslash {{\mathcal {X}}}\). Assume \({{\mathcal {X}}}\) is contiguous and therefore no ‘border region’ with zero probability can exist between the two classes. Let the density of region \({{\mathcal {X}}}_T\) be higher than that of region \({{\mathcal {X}}}_S\): \(\forall \mathbf{x} \in {{\mathcal {X}}}_T, \forall \mathbf{y} \in {{\mathcal {X}}}_S, P(\mathbf{x}) > P(\mathbf{y})\).

Let a training dataset D be a union of a positive instance set \(D_T\) and a negative instance set \(D_S\) belonging to the dense subset \({{\mathcal {X}}}_T\) and the sparse subset \({{\mathcal {X}}}_S\), respectively. We consider a k nearest neighbours (kNN) classifier having \(k \ll |D|\) with D. Let the set of k nearest neighbours of x be \(NN_k(\mathbf{x})\) where the numbers of instances belonging to \({{\mathcal {X}}}_T\) and \({{\mathcal {X}}}_S\) are \(k_T(\mathbf{x})\) and \(k_S(\mathbf{x})\), respectively; and \(k_T(\mathbf{x})+k_S(\mathbf{x})=k\). For \(\mathbf{x} \in {{\mathcal {X}}}\), let \(B(\mathbf{x})\) be a ball centred at \(\mathbf{x}\) with the radius being the k-th nearest neighbour distance from \(\mathbf{x}\) in D.

Assume that the curvature of the border between \({{\mathcal {X}}}_T\) and \({{\mathcal {X}}}_S\) is sufficiently small. Hence, we presume that the border section of \({{\mathcal {X}}}_T\) and \({{\mathcal {X}}}_S\) covered by any \(B(\mathbf{x})\) is effectively linear or straight, when \(B(\mathbf{x})\) covers the border. We also assume that the densities of instances in \({{\mathcal {X}}}_T\) and \({{\mathcal {X}}}_S\) vary smoothly; and thus presume that the densities of instances in the intersections of any \(B(\mathbf{x})\) with \({{\mathcal {X}}}_T\) and \({{\mathcal {X}}}_S\) are almost uniform, and they are denoted as \(\rho _T(\mathbf{x})\) and \(\rho _S(\mathbf{x})\), respectively. These assumptions mostly hold when \(D_T\) and \(D_S\) are massive and smoothly distributed, and k nearest neighbour distance of \(\mathbf{x}\) is very small because of \(k \ll |D|\).

Fig. 4
figure 4

A dataset has two clusters with different densities. The points in three circles are k nearest neighbours of \(\mathbf{x}_1\), \(\mathbf{x}_2\) and \(\mathbf{x}_3\), respectively (\(k=10\)). \(\mathbf{x}_2\) is a border point with \(k_S({\mathbf{x}_2})=3\) and \(k_T({\mathbf{x}_2})=7\), while \(\mathbf{x}_1\) and \(\mathbf{x}_3\) are non-border points

A kNN border point \(\mathbf{x} \in D_S\) is defined as one which has \(NN_k(\mathbf{x})\) such that \(k_T(\mathbf{x}) \ge 1\); and a kNN non-border point \(\mathbf{x} \in D_S\) is one which has \(NN_k(\mathbf{x})\) such that \(k_T(\mathbf{x}) =0\). Similarly, a kNN border point and a non-border point \(\mathbf{x} \in D_T\) are defined to have \(NN_k(\mathbf{x})\) such that \(k_S(\mathbf{x}) \ge 1\) and \(k_S(\mathbf{x}) = 0\), respectively. Figure 4 illustrates some example border and non-border points in a dataset. Here, we further introduce expectations of \(\rho _T(\mathbf{x})\) and \(\rho _S(\mathbf{x})\) over the instances which are the kNN border points in D and denote them as \(\bar{\rho }_T\) and \(\bar{\rho }_S\), respectively. Then, the following theorem holds.

Theorem 1

In a dataset consisting of a dense subset (\({{\mathcal {X}}}_T\)) and a sparse subset (\({{\mathcal {X}}}_S\)) which do not overlap, the kNN classifier’s misclassification rate in \({{\mathcal {X}}}_S\): \(\varepsilon _S\) is a probabilistically increasing function of the density ratio \(\bar{\rho }_T/\bar{\rho }_S\), and that in \({{\mathcal {X}}}_T\): \(\varepsilon _T\) is most probably zero. The majority of the misclassification errors occurs in the region of the sparse \({{\mathcal {X}}}_S\), bordering the dense \({{\mathcal {X}}}_T\).

The proof is given in “Appendix B”.

3.2 The kLMN classifier has smaller error rate than the kNN classifier under certain condition

Under the identical problem setting with the kNN classifier, we consider a k lowest probability mass neighbours (kLMN) classifier having \(k \ (\ll |D|)\) that built from the dataset D and the mass-based dissimilarity. The mass-based dissimilarity \(m_e(\mathbf{x},\mathbf{y})\) between \(\mathbf{x}\) and \(\mathbf{y}\) is given by the expected data mass in a region \(R(\mathbf{x},\mathbf{y}|H;D)\) enclosing both \(\mathbf{x}\) and \(\mathbf{y}\) and containing the smallest number of data points in D within a hierarchical partition model H, where axis-parallel partitions are generated randomly. Because the region \({{\mathcal {X}}}\) is finite, \(R(\mathbf{x},\mathbf{y}|H;D)\) is effectively finite even if the partitions include some open regions. Therefore, we assume a finite \(R(\mathbf{x},\mathbf{y}|H;D)\) without loss of generality.

Let \(LMN_k(\mathbf{x})\) be the set of k lowest probability mass neighbours of \(\mathbf{x}\), and let \(m_e(\mathbf{x})\) be the mass-based dissimilarity between \(\mathbf{x}\) and the k-th lowest probability mass neighbour of \(\mathbf{x}\), i.e., \(m_e(\mathbf{x})=\max _{\mathbf{y} \in LMN_k(\mathbf{x})} m_e(\mathbf{x},\mathbf{y})\). Moreover, define a kLMN region \(R(\mathbf{x})\) as a set of all points where the mass-based dissimilarity between \(\mathbf{x}\) and any point in \(R(\mathbf{x})\) is less than or equal to \(m_e(\mathbf{x})\), i.e.,

$$\begin{aligned} R(\mathbf{x})=\{\mathbf{y} \in {{\mathcal {X}}}\ |\ m_e(\mathbf{x},\mathbf{y}) \le m_e(\mathbf{x})\}. \end{aligned}$$

By definition, \(R(\mathbf{x})\) includes all points in \(LMN_k(\mathbf{x})\).

Here, we introduce two open subsets of \(\Gamma \); \({{\mathcal {X}}}_A\) and \({{\mathcal {X}}}_B\) having the same associated mathematical definitions and assumptions as in \({{\mathcal {X}}}_T\) and \({{\mathcal {X}}}_S\), specified in Sect. 3.1. Their densities \(\rho _A({{\mathcal {X}}})\) and \(\rho _B({{\mathcal {X}}})\) are assumed to be different. \({{\mathcal {X}}}_A\) and \({{\mathcal {X}}}_B\) are interchangeably used to represent \({{\mathcal {X}}}_T\) and \({{\mathcal {X}}}_S\) in the proof of Theorem 2.

Fig. 5
figure 5

Two regions \({{\mathcal {X}}}_A\) and \({{\mathcal {X}}}_B\) have uniform but different densities \(\rho _A(\mathbf{x}) \ne \rho _B(\mathbf{x})\). \(R(\mathbf{x})\) is a kLMN region of a border point \(\mathbf{x} \in {{\mathcal {X}}}_A\) and thus intersecting with both \({{\mathcal {X}}}_A\) and \({{\mathcal {X}}}_B\). \(R(\mathbf{x},\mathbf{y}|H;D)\), shown as a rectangle because of axis-parallel partitions, is a region defining \(m_e(\mathbf{x},\mathbf{y})\). \(V(\mathbf{x},\mathbf{y})\) is the intersection of \(R(\mathbf{x})\) with a solid angle cone of \(\mathbf{x}\) capturing \(\mathbf{y}\). \(R_A^i(\mathbf{x})\) and \(R_A^o(\mathbf{x})\) are sub-regions of \(R(\mathbf{x}) \cap {{\mathcal {X}}}_A\) partitioned by a line crossing \(\mathbf{x}\) and parallel to the border between \({{\mathcal {X}}}_A\) and \({{\mathcal {X}}}_B\). \(R_A^i(\mathbf{x})\) and \(R_A^o(\mathbf{x})\) are farther from and closer to the border, respectively. The sub-region \(R_B(\mathbf{X})\) is \(R(\mathbf{x}) \cap {{\mathcal {X}}}_B\)

Figure 5 shows a kLMN region \(R(\mathbf{x})\) over \({{\mathcal {X}}}_A\) and \({{\mathcal {X}}}_B\), where \(\mathbf{x}\) is a border point in terms of \(LMN_k(\mathbf{x})\) and located in \({{\mathcal {X}}}_A\). Consider a point \(\mathbf{y}\) on the edge of \(R(\mathbf{x})\). By the above definition of \(R(\mathbf{x})\), any \(\mathbf{y}\) on the edge of \(R(\mathbf{x})\) has identical \(m_e(\mathbf{x},\mathbf{y})=m_e(\mathbf{x})\). For example, points \(\mathbf{y}_A\) and \(\mathbf{y}_B\) in Fig. 5 have dissimilarities \(m_e(\mathbf{x},\mathbf{y}_A)\) and \(m_e(\mathbf{x},\mathbf{y}_B)\), respectively, which are equal to \(m_e(\mathbf{x})\). We define sub-regions \(R_A(\mathbf{x})\) and \(R_B(\mathbf{x})\) which are the intersections of \(R(\mathbf{x})\) with \({{\mathcal {X}}}_A\) and \({{\mathcal {X}}}_B\), respectively. We split \(R_A(\mathbf{x})\) into \(R_A^i(\mathbf{x})\) and \(R_A^o(\mathbf{x})\) by a line crossing \(\mathbf{x}\) and parallel to the border between \({{\mathcal {X}}}_A\) and \({{\mathcal {X}}}_B\), where \(R_A^o(\mathbf{x})\) is the sub-region closer to the border; and \(R_A^i(\mathbf{x}) = R_A(\mathbf{x}) \setminus R_A^o(\mathbf{x})\). We further introduce \(V(\mathbf{x},\mathbf{y})\): an intersection of \(R(\mathbf{x})\) with a cone of \(\mathbf{x}\) having a solid angle \(\delta \Omega \) which includes point \(\mathbf{y}\); and let the probability mass in \(V(\mathbf{x},\mathbf{y})\) be \(\delta P(V(\mathbf{x},\mathbf{y}))\).

Let \(\tilde{E}_{{{\mathcal {H}}}(D)}[\bar{\rho }(R(\mathbf{x},\mathbf{y}|H;D))]\) be an expectation of the average density in region \(R(\mathbf{x},\mathbf{y}|H;D)\) over \({{\mathcal {H}}}(D)\), and let \(\bar{\rho }(V(\mathbf{x},\mathbf{y}))\) be the average density in \(V(\mathbf{x},\mathbf{y})\), respectively. \(\bar{\rho }(V(\mathbf{x},\mathbf{y}))\) and \(\tilde{E}_{{{\mathcal {H}}}(D)}[\bar{\rho }(R(\mathbf{x},\mathbf{y}|H;D))]\) are considered to be almost identical, and they are in the interval between \(\rho _A(\mathbf{x})\) and \(\rho _B(\mathbf{x})\), since they are both average densities in some vicinity of \(\mathbf{x}\) and \(\mathbf{y}\). We further introduce \(\phi (\mathbf{x})\) which is an upper bound of a ratio \(\{P(R_A^o(\mathbf{x}))+P(R_B(\mathbf{x}))\}/P(R_A^i(\mathbf{x}))\) at \(\mathbf{x}\) where \(P(R_*(\mathbf{x}))\) is the probability mass in \(R_*(\mathbf{x})\), and let \(\bar{\phi }\) be the average of \(\phi (\mathbf{x})\) taken over all the kLMN border points in D. Their rigorous definitions are provided in “Appendix C”.

To simplify the analysis of the kLMN classifier, we introduce the concept of ‘effective dimensions’ of \(R(\mathbf{x},\mathbf{y}|H,D)\). Assume that the border section of \({{\mathcal {X}}}_T\) and \({{\mathcal {X}}}_S\) is effectively linear in \(R(\mathbf{x})\), as in the case in Sect. 3.1. Effective dimensions are defined as the dimensions of \(R(\mathbf{x},\mathbf{y}|H,D)\) which are non-orthogonal to the normal line of the border between \({{\mathcal {X}}}_T\) and \({{\mathcal {X}}}_S\). We let the number of the effective dimensions of \(R(\mathbf{x},\mathbf{y}|H,D)\) be \(\tilde{q}\). For example, \(\tilde{q}\) of the hyper-rectangle \(R(\mathbf{x},\mathbf{y}|H,D)\) depicted in Fig. 6a is 1, since its horizontal dimension is parallel to the border and thus orthogonal to the normal line. On the other hand, \(\tilde{q}\) in Fig. 6b is 2, because both dimensions are non-orthogonal to the normal line. Note that \(1 \le \tilde{q} \le q\) holds. Then, we obtain the following theorem.

Theorem 2

In a dataset consisting of a dense subset (\({{\mathcal {X}}}_T\)) and a sparse subset (\({{\mathcal {X}}}_S\)) which do not overlap, assuming \(\bar{\rho }(V(\mathbf{x},\mathbf{y})) \simeq \tilde{E}_{{{\mathcal {H}}}(D)}[\bar{\rho }(R(\mathbf{x},\mathbf{y}|H,D))]\), the kLMN classifier’s misclassification rate in \({{\mathcal {X}}}_S\): \(\varepsilon _S\) and that in \({{\mathcal {X}}}_T\): \(\varepsilon _T\) have the following properties:

  1. (i)

    If the effective dimension of the kLMN classifier \(\tilde{q}=1\), both \(\varepsilon _S\) and \(\varepsilon _T\) are most probably very small.

  2. (ii)

    If \(\tilde{q}>1\), \(\varepsilon _S\) is most probably a non-negligible and increasing function of the density ratio \((\bar{\rho }_T/\bar{\rho }_S)^{1-1/\tilde{q}}\) when \(\bar{\phi }\) is close to the upper bound in the interval \([1, (\bar{\rho }_T/\bar{\rho }_S)^{1-1/\tilde{q}}]\), and

  3. (iii)

    If \(\tilde{q}>1\), \(\varepsilon _T\) is most probably zero.

The proof is given in “Appendix C”.

Theorem 2 indicates that the kLMN classifier shows non-negligible error \(\varepsilon _S\) which probabilistically increases with \((\bar{\rho }_T/\bar{\rho }_S)^{1-1/\tilde{q}}\); while \(\varepsilon _T\) is almost zero. It also suggests a possibility to make both \(\varepsilon _S\) and \(\varepsilon _T\) very small by controlling the number of the effective dimensions \(\tilde{q}\) at unity.

In summary, Theorems 1 and 2 unveil that

  1. (a)

    The majority of misclassification errors of both the kNN and kLMN classifiers occur in the region of sparse subset, bordering the dense subset. Intuitively, the fact that the sparse subset has significantly fewer points means that it is harder to find points from the sparse subset (than those from the dense subset) to form the k lowest probability mass neighbours (or k nearest neighbours) of a border point. As a result, a border point of either subset is more likely to be predicted to belong to the dense subset. This is the reason why the error rate is higher in the sparse subset than in the dense subset for either kNN or kLMN.

  2. (b)

    The kLMN classifier has a lower misclassification error than the kNN classifier. This is because the rates of increase of \(\varepsilon _S\) of the kNN and kLMN classifiers become identical, i.e., \(\bar{\rho }_T/\bar{\rho }_S\), only if \(\tilde{q} \rightarrow \infty \). Otherwise, kLMN’s rate of increase of \(\varepsilon _S\) is slower than that of kNN’s by a factor of \((\bar{\rho }_T/\bar{\rho }_S)^{1/\tilde{q}}\) as \((\bar{\rho }_T/\bar{\rho }_S)\) increases. This conclusion can be made despite the fact that an analytical comparison of \(\varepsilon _S\) is intractable between the kNN and kLMN classifiers.

3.3 Simulations

This section provides two simulations to demonstrate the analytical results presented in the last two sections.

We built one dataset for each of the two simulations. Both datasets have two classes, and each class is generated using uniform density distribution. The only difference is that the boundary of the first dataset is axis parallel while the second is non-axis parallel, as shown in Fig. 6a, b.

Fig. 6
figure 6

Data distributions used in two simulations. Label “P” and “N” indicate positive and negative points, respectively. The dash lines indicate the boundaries due to kNN (\(\ell _p\) border) and kLMN (\(m_e\) border). An additional boundary due to kLMN, where \(m_e\) is implemented using non-axis parallel splits, is also shown. a Data distribution of simulation 1. b Data distribution of simulation 2

In each simulation, we gradually increased the density ratio between the two classes in the training set, and then evaluated the performance of the kNN and kLMN classifiers (where \(k=\sqrt{n}\) ; n=training set size). We report the result in terms of false negative rate (FNR), false positive rate (FPR), and error rate (ERR) which is the average value of FNR and FPR, where the positive class is the one having the higher density. The test set consists of 1250 instances for each class. This is to ensure that a sufficient number of points along the entire border in order to obtain the FPR and FNR which can be compared across different density ratios.Footnote 5

Tables 2 and 3 show the results of the two simulations. In agreement with Theorems 1 and 2, the results show that (i) the majority of the errors of the kNN and kLMN classifiers are in the sparse region, bordering the dense region (having high false positive rate); and (ii) kLMN has a lower false positive rate than kNN, and this leads to a lower error rate.

Table 2 Performance of the kNN and kLMN classifiers in simulation 1. Each result is an average over 10 runs
Table 3 Performance of the kNN and kLMN classifiers in simulation 2. Average over 10 runs

Figure 6 further demonstrates that the boundary of kLMN, unlike that of kNN, is not parallel to the (true) boundary between the dense and sparse regions. Rather, the two boundaries (between the \(m_e\) border and the true border) are closer in the centre, and the gap widens towards the edges. This is because the mass distribution has higher mass in the centre and lower mass at the edges (Ting et al. 2010), even in a uniform density distribution.

4 Mass-based clustering

Here we describe how density-based clustering can be transformed into mass-based clustering by simply replacing the distance measure with mass-based dissimilarity.

A new neighbourhood function using mass-based dissimilarity is introduced in the first subsection. Its characteristics are described in the second subsection. The condition under which mass-based clustering will provide a better clustering result than density-based clustering is presented in the third subsection. The fourth subsection discusses a closely related similarity measure.

4.1 \(\mu \)-neighbourhood mass

We introduce a new function: \(\mu \)-neighbourhood mass which counts the number of points that have mass-based dissimilarity less than or equal to a maximum value \(\mu \). This is similar to \(\epsilon \)-neighbourhood densityFootnote 6 which denotes the set of points that has up to a maximum distance \(\epsilon \) defined by a distance measure. It is defined as follows:

$$\begin{aligned} M_\mu (\mathbf{x}) = \# \{ \mathbf{y} \in D\ |\ m_e(\mathbf{x},\mathbf{y}) \le \mu \} \end{aligned}$$

Like standard \(\epsilon \)-neighbourhood density, \(\mu \)-neighbourhood mass makes an estimate based on a dissimilarity measure. However, the estimate is defined in terms of the expected probability mass (instead of distance). Like \(\epsilon \), the parameter \(\mu \) controls the set size: a large \(\mu \) defines a large set; and a small \(\mu \) defines a small set.

In addition, the model used determines the general shape of \(M_\mu (\mathbf{x})\) (e.g., diamond or circle); and the shape is symmetric only if the distribution of self-dissimilarity is symmetric wrt x. The boundary of \(M_\mu (\mathbf{x})\) is closer (/further) to x if the self dissimilarities of points between x and the boundary have high (/low) mass.

Figure 7a, b compare \(\epsilon \)-neighbourhood with \(\mu \)-neighbourhood on a dataset having three areas of different densities. Note that the volume of the region occupied by points of \(\mu \)-neighbourhood mass depends on the data distribution—it is small in dense areas and large in sparse areas, as demonstrated by areas A, B and C in Fig. 7b. Note that the overall shape is not symmetric. In contrast, the \(\epsilon \)-neighbourhood forms a region which is independent of the data distribution with constant volume, in addition to regular and symmetric shape.

Fig. 7
figure 7

a and b show the two sets of points defined by \(\epsilon \)-neighbourhood density (\(\epsilon =0.25\)) and \(\mu \)-neighbourhood mass (level \(= 4\)iForest and \(\mu =0.5\)), respectively, on a dataset having three areas of different densities, with reference to the red point (0.5, 0.5). The dark-coloured dots denote the set of points defined by either \(\epsilon \)-neighbourhood or \(\mu \)-neighbourhood estimator; and the blue-coloured dots denote the set of points outside. a Points defined by \(\epsilon \)-neighbourhood. b Points defined by \(\mu \)-neighbourhood (Color figure online)

Fig. 8
figure 8

a and b show the two sets of points defined by \(\mu \)-neighbourhood mass (\(\mu =0.55\)) using axis-parallel iForest and non axis-parallel iForest, respectively, on a dataset with uniform density distribution with reference to the red point (0.5,0.5). a\(\mu \)-neighbourhood (axis-parallel iForest). b\(\mu \)-neighbourhood (non axis-parallel iForest) (Color figure online)

Fig. 9
figure 9

Self-dissimilarity \(m_e(\mathbf{x},\mathbf{x})\) shown in (a) directly influences the measurement \(M_{\mu }(\mathbf{x})\) (\(\mu =0.38\)) shown in (b). \(M_{\mu }(\mathbf{x})\) (\(\mu =0.60\)) at three different points are shown in (c), (d) and (e), where the underlay is the self-dissimilarity. The set of points defined by \(M_{\mu }(\mathbf{x})\) is indicated as boldfaced points. a Self-dissimilarity \(m_e\). b\(\mu \)-neighbourhood mass \(M_\mu \). c\(M_\mu \) at \(\mathbf{x}=(0.25, 0.25)\). d\(M_\mu \) at \(\mathbf{x}=(0.25, 0.75)\). e\(M_\mu \) at \(\mathbf{x}=(0.75, 0.75)\)

Figure 8a shows that the region occupied by the set of points of \(\mu \)-neighbourhood becomes symmetric only in the case of a uniform density distribution. Figure 8 shows that the shape of the region depends on the implementation: axis-parallel and non axis-parallel random trees yield diamond and spherical shapes, respectively. These shapes are similar to those of \(\ell _p\) of \(p=1\) and \(p=2\).

Figure 9 shows the influence of self-dissimilarity on \(M_\mu \): For a fixed \(\mu \), the \(\mu \)-neighbourhood \(M_\mu (\mathbf{x})\) covers a small region when x is in the high mass area of self-dissimilarity (and vice versa).

A conceptual comparison between the two neighbourhood functions is given in Table 4. In order to obtain non-zero \(M_\mu \), \(\mu \) must be set higher than \(\max _{\mathbf{z} \in D} m_e(\mathbf{z},\mathbf{z})\).

Table 4 \(\mu \)-neighbourhood mass versus \(\epsilon \)-neighbourhood density

For the purpose of clustering, mass can be used in a similar way as density to identify core points, i.e., points having high mass values are core points; and those having low mass values are noise.

The next subsection describes characteristics of \(\mu \)-neighbourhood mass which make it a better candidate than \(\epsilon \)-neighbourhood density to identify core points, especially in a dataset which has clusters of varying densities.

4.2 Characteristics of -neighbourhood mass

4.2.1 Rate of change of \(M_\mu \)

For a distribution of self-dissimilarity \(m_e(\mathbf{x},\mathbf{x})\) which has the same number of modes as that in the density distribution, we have the following observation for the rate of change of \(M_\mu (\mathbf{x})\):

Observation 1

The rate of change of \(M_\mu (\mathbf{x})\) (per unit change in \(\mu \)) is larger in dense regions than in sparse regions.

While this is the same characteristic as in \(\epsilon \)-neighbourhood density function, there are two key differences. First, the non-zero \(M_\mu \) values do not start at the same \(\mu \); whereas all non-zero \(N_\epsilon \) values start (i.e., a value just above zero) at the same \(\epsilon \). Second, the rate of change of \(M_\mu (\mathbf{x})\) is proportional to self-dissimilarity \(m_e(\mathbf{x},\mathbf{x})\) and its curvature (concave slows and convex accelerates the rate). In contrast, the rate of change of \(\epsilon \)-neighbourhood density is proportional to the density of x only.

In addition, while \(m_e(\mathbf{x},\mathbf{x})\) is high in dense regions and low in sparse regions in general, its distribution is centre-outward which is high at the centre and low at the edges (Ting et al. 2010; Chen et al. 2015), even in the case of a uniform density distribution.

Let \(\mathbf{x}_{p1}\), \(\mathbf{x}_{p2}\) and \(\mathbf{x}_{p3}\) be the points having the largest self-dissimilarity in each of the three clusters in Fig. 10a; and \(\mathbf{x}_{v1}\) and \(\mathbf{x}_{v2}\) be the points having the smallest self-dissimilarity in each of the two valleys. Figure 10b, c shows the rates of change of \(M_\mu \) and \(N_\epsilon \), respectively, for the five points.

Figure 10b shows that the five points have different starting \(\mu \) values which produce the first non-zero \(M_\mu \); and each starting \(\mu \) is the self-dissimilarity of the point. For example, the curve for \(\mathbf{x}_{v1}\) starts at \(\mu = m_e(\mathbf{x}_{v1}, \mathbf{x}_{v1})\) and \(M_\mu (\mathbf{x}_{v1}) = 1\); and it starts after \(\mathbf{x}_{v2}\) and \(\mathbf{x}_{p3}\), and before \(\mathbf{x}_{p1}\) and \(\mathbf{x}_{p2}\).

Fig. 10
figure 10

Rates of change of \(N_\epsilon \) and \(M_\mu \) (shown in (c) and (d), respectively) for the five points in the distribution of \(m_e(\mathbf{x},\mathbf{x})\) shown in (a). b The starting \(\mu \) value for each line (which has the first non-zero \(M_\mu \)) is the self-dissimilarity of the point, e.g., Peak 1 starts with \(\mu = m_e(\mathbf{x}_{p1},\mathbf{x}_{p1})\) and \(M_\mu (\mathbf{x}_{p1})=1\). d This is the same plot as (b), except all the staring points are re-aligned to \(\mu =0\) so that the relative rate of change can be compared readily. a Self-dissimilarity distribution of Fig. 1a. b\(M_\mu \) versus \(\mu \). c\(N_\epsilon \) versus \(\epsilon \). d\(M_\mu \) versus \(\Delta \mu \) (all starting points aligned)

The relative rate of change for the five points becomes apparent when the starting \(\mu \) values are re-aligned to the same point. This is shown in Fig. 10d. Peak 1 and Peak 2 have approximately the same fastest rate and they are the highest two peaks. Peak 3 has the next fastest rate. The convex curvature for each peak increases the rate, in comparison with valleys which have the concave curvature. This is the reason why Peak 3 has a faster rate than Valley 1, though the latter has higher self-dissimilarity. Valley 2 has the lowest self-dissimilarity and rate of change.

4.2.2 Converting from inseparable distributions to separable distributions for clustering

Another interesting characteristic of \(\mu \)-neighbourhood mass is that the valleys of a data distribution can be constrained within a small range of mass value by setting an appropriate \(\mu \). This characteristic is especially important in clustering algorithms which rely on a global threshold to identify core points before grouping the core points into separate clusters. Having all the valleys close to a small mass value, a global threshold slightly larger than this value will identify the majority of the core points of all clusters, irrespective of the densities of the clusters.

In contrast, \(\epsilon \)-neighbourhood density does not possess this characteristic as it estimates density distribution and its valleys can have hugely different densities. As a result, a clustering algorithm, such as DBSCAN (Ester et al. 1996) which employs the \(\epsilon \)-neighbourhood density estimator and relies on a global threshold to identify core points, is unable to detect all clusters of varying densities. This kind of distribution, shown in Fig. 11a, is called an inseparable distribution because it is impossible for DBSCAN (Ester et al. 1996) to identify all the clusters in the distribution using a global threshold.

Fig. 11
figure 11

Density distribution of an “inseparable distribution” and its estimations using \(\epsilon \)-neighbourhood density (\(\epsilon =0.03\)), \(\mu \)-neighbourhood mass based on iForest (level\(=8\) and \(\mu =0.2\)) from a sample of 1500 points of (a). a True density distribution. b\(\epsilon \)-neighbourhood density. c\(\mu \)-neighbourhood mass

Possessing the above-mentioned characteristic, the inseparable distribution (in terms of density) will exhibit as a separable distribution (in terms of mass). Examples of \(\epsilon \)-neighbourhood density and \(\mu \)-neighbourhood mass are given in Fig. 11b, c, respectively. A single threshold can then be used to identify all three clusters in the reshaped distribution in Fig. 11c. This is impossible if the density distribution is estimated instead, as shown in Fig. 11b.

In summary, \(\mu \)-neighbourhood mass is able to convert all valleys of hugely different densities to become valleys of low mass of approximately the same value by using an appropriate \(\mu \). As a result of the mass-based dissimilarity measure used, the estimated distribution has a characteristic which enables an inseparable distribution in terms of density to be reshaped into a separable distribution in terms of mass for clustering.

4.3 Condition under which mass-based clustering performs better than density-based clustering

The exact conditions under which density-based clustering fails and mass-based clustering succeeds in identifying all clusters in a dataset are given in the following two subsections.

4.3.1 Condition under which density-based clustering fails

Here we reiterate the condition under which density-based clustering fails, recently disclosed by Zhu et al. (2016).

If a density-based clustering algorithm uses a global threshold on the estimated density to identify core points and links neighbouring core points together to form clusters, then the requirement given in Eq. 5, based on the estimated density and the density-based clusters, provides a necessary condition for the algorithm to be able to identify all clusters (Zhu et al. 2016):

$$\begin{aligned} \min _{\begin{array}{c} k\in \lbrace 1,\ldots ,\varsigma \rbrace \end{array}} h_{k} > \max _{\begin{array}{c} i\ne j\in \lbrace 1,\ldots ,\varsigma \rbrace \end{array}} g_{ij} \end{aligned}$$
(5)

where \(h_{k}\) is the density of the mode of each cluster \(C_{k}\) from a total of \(\varsigma \) clusters; and \(g_{ij}\) is the largest of the minimum estimated density along any path linking clusters \(C_{i}\) and \(C_{j}\).

The condition requires that the estimated density at the mode of any cluster is greater than the maximum \(g_{ij}\) along any path linking any two modes. It implies that there must exist a threshold \(\tau \) that can be used to break all paths between the modes by assigning regions with the estimated density less than \(\tau \) to noise, i.e.,

$$\begin{aligned} \exists _{\tau } \forall _{k,i\ne j \in \lbrace 1,\ldots ,\varsigma \rbrace } \ h_{k} \geqslant \tau > g_{ij} \end{aligned}$$

In summary, on a dataset having density distribution of the following condition (Zhu et al. 2016):

$$\begin{aligned} \min _{\begin{array}{c} k\in \lbrace 1,\ldots ,\varsigma \rbrace \end{array}} h_{k} \ngtr \max _{\begin{array}{c} i\ne j\in \lbrace 1,\ldots ,\varsigma \rbrace \end{array}} g_{ij} \end{aligned}$$
(6)

the density-based clustering will fail to detect all clusters in the dataset.

4.3.2 Condition under which density-based clustering fails but mass-based clustering succeeds

By using \(\mu \)-neighbourhood mass function instead of \(\epsilon \)-neighbourhood density function in DBSCAN, we effectively convert the density-based clustering algorithm to a mass-based clustering algorithm, where clusters are defined in terms of mass rather than density.

Let \(\mathbf{c}_k\) be the mode of cluster \(C_k, {k\in \lbrace 1,\ldots ,\varsigma \rbrace }\) in the distribution of self-dissimilarity mass \(m_e(\mathbf{x},\mathbf{x})\)Footnote 7; and when using \(M_\mu (\cdot )\), \(\mu \) must be set more than the maximum value of self-dissimilarity.Footnote 8

Observation 2

For a density distribution satisfying condition (6), there exist some \( \mu > \max _{k\in \lbrace 1,\ldots ,\varsigma \rbrace } m_e(\mathbf{c}_k, \mathbf{c}_k)\) such that the distribution of \(M_\mu (\cdot )\) satisfies the following condition:

$$\begin{aligned} \min _{\begin{array}{c} k\in \lbrace 1,\ldots ,\varsigma \rbrace \end{array}} M_\mu (\mathbf{c}_k) > \max _{\begin{array}{c} i\ne j\in \lbrace 1,\ldots ,\varsigma \rbrace \end{array}} \acute{{{\mathfrak {g}}}}_{ij} \end{aligned}$$
(7)

where \(\acute{{{\mathfrak {g}}}}_{ij}\) is the largest of the minimum estimated \(M_\mu (\cdot )\) along any path linking cluster \(C_{i}\) and \(C_{j}\).

The following reasoning relies on Observation 1: the rate of change of \(M_\mu (\cdot )\) is faster in dense regions than in sparse regions.

Let \({{{\mathfrak {g}}}}_a\) and \({{{\mathfrak {g}}}}_b\) be the maximum and minimum of \({{{\mathfrak {g}}}}_{ij}\) in the distribution of self-dissimilarity \(m_e(\cdot ,\cdot )\); and \(\min _{\begin{array}{c} k\in \lbrace 1,\ldots ,\varsigma \rbrace \end{array}} m_e(\mathbf{c}_k,\mathbf{c}_k) \ngtr \max _{\begin{array}{c} i\ne j\in \lbrace 1,\ldots ,\varsigma \rbrace \end{array}} {{{\mathfrak {g}}}}_{ij}\), as in condition (6) of the density distribution of the same dataset. Let \(\mathbf{x}_a\) and \(\mathbf{x}_b\) be the points for \({{{\mathfrak {g}}}}_a\) and \({{{\mathfrak {g}}}}_b\), respectively; and \(\mathbf{x}_h = {{\mathrm{argmax}}}_{\mathbf{c}_k, \begin{array}{c} k\in \lbrace 1,\ldots ,\varsigma \rbrace \end{array}} m_e(\mathbf{c}_k,\mathbf{c}_k)\) is the peak of all clusters. In plain language, \(\mathbf{x}_a\) and \(\mathbf{x}_b\) are valleys in the dense and sparse regions, respectively; and \(\mathbf{x}_h\) is the peak of all dense regions. Therefore, the self-dissimilarity has the characteristic: \(m_e(\mathbf{x}_h,\mathbf{x}_h)> m_e(\mathbf{x}_a,\mathbf{x}_a) > m_e(\mathbf{x}_b,\mathbf{x}_b)\).

In the distribution of \(M_\mu (\cdot )\), the aim is to find \({\mu }\) such that \(M_\mu (\mathbf{x}_a) \approx M_\mu (\mathbf{x}_b)\) and \(\min _{\begin{array}{c} k\in \lbrace 1,\ldots ,\varsigma \rbrace \end{array}} M_\mu (\mathbf{c}_k) > max(M_\mu (\mathbf{x}_a), M_\mu (\mathbf{x}_b))\). That is, condition (7) will be satisfied because every mode will have mass more than all valleys.

Let \(\rho (M_\mu )\) be the rate of increase of \(M_\mu (\cdot )\) for per unit increase in \(\mu \). For \(\mu > m_e(\mathbf{x}_a,\mathbf{x}_a)\), \(\rho (M_\mu (\mathbf{x}_a)) > \rho (M_\mu (\mathbf{x}_b))\) based on Observation 1 and that \(\mathbf{x}_a\) is in a denser region than \(\mathbf{x}_b\).

Note that \(\mu = m_e(\mathbf{x}_a,\mathbf{x}_a)\) will give \(M_\mu (\mathbf{x}_a)=1\) and \(M_\mu (\mathbf{x}_b) > 1\) because \(m_e(\mathbf{x}_a,\mathbf{x}_a) > m_e(\mathbf{x}_b,\mathbf{x}_b)\). Because \(M_\mu (\cdot )\) is a monotonic increasing function of \(\mu \); and \(M_\mu (\mathbf{x}_a)\) begins with a low base but at a faster rate, \(M_\mu (\mathbf{x}_a)\) will catch up with \(M_\mu (\mathbf{x}_b)\) at some \(\mu > m_e(\mathbf{x}_a,\mathbf{x}_a)\).

Because \(m_e(\mathbf{x}_h,\mathbf{x}_h) > m_e(\mathbf{x}_a,\mathbf{x}_a)\), \(\mu = m_e(\mathbf{x}_h,\mathbf{x}_h)\) will give \(M_\mu (\mathbf{x}_h)=1\) and \(M_\mu (\mathbf{x}_a) > 1\). A value of \(\mu > m_e(\mathbf{x}_h,\mathbf{x}_h)\) shall be set such that \(\min _{\begin{array}{c} k\in \lbrace 1,\ldots ,\varsigma \rbrace \end{array}} M_\mu (\mathbf{c}_k) > max(M_\mu (\mathbf{x}_a), M_\mu (\mathbf{x}_b))\). An example of these increases can be found in Fig. 10b, where \(\mu \ge 0.12\) satisfy this requirement.

The above increase in \(\mu \) assumes that the increases in \(M_\mu (\mathbf{x}_a)\) and \(M_\mu (\mathbf{c}_k)\) occur in the dense region and the increase in \(M_\mu (\mathbf{x}_b)\) occurs in the sparse region such that the relative rate of change between dense and sparse regions stays approximately the same.

Thus, the above aim can be achieved by increasing \(\mu \) surpassing \(m_e(\mathbf{x}_h,\mathbf{x}_h)\), provided the relative rate of increase of \(M_\mu \) satisfies the stated assumption. \(\square \)

Satisfying the above condition, the mass-based clustering algorithm employing a threshold \(\tau \) can be used to identify all clusters, as it breaks all paths between the modes by assigning regions with estimated \(M_\mu (\cdot )\) less than \({\tau }\) to noise, i.e.,

$$\begin{aligned} \exists _{{\tau }} \forall _{k,i\ne j \in \lbrace 1,\ldots ,\varsigma \rbrace } M_{{\mu }}(\mathbf{c}_k) \geqslant {\tau } > \acute{{{\mathfrak {g}}}}_{ij} \end{aligned}$$

Is it possible that \(\mu \)-neighbourhood mass may convert a separable (density) distribution to an inseparable (mass) distribution? We are unaware of any such conditions. We believe that such conditions are rare in practice because the parameter \(\mu \) provides sufficient flexibility to morph from an inseparable distribution to a separable distribution. Such conditions did not appear in our experiments reported in Sect. 6.2.

4.3.3 Simulations

Figure 12 shows simulations of two inseparable distributions in which density-based clustering fails to identify all clusters in each case; and the conversion to separable distributions using \(\mu \)-neighbourhood mass where mass-based clustering successfully identifies all clusters in each case. This is despite the fact that both density-based clustering and mass-based clustering are using exactly the same clustering procedure, except for the substitution of the dissimilarity measure.

Fig. 12
figure 12

a and c are the distributions of \(\epsilon \)-neighbourhood density of two “inseparable distribution” datasets. b and d are the distributions of the \(\mu \)-neighbourhood mass of (a) and (c) based on iForest (level\( = 8\) and \(\mu =0.22\)), respectively. a\(\epsilon \)-neighbourhood density. b\(\mu \)-neighbourhood mass of (a). c\(\epsilon \)-neighbourhood density. d\(\mu \)-neighbourhood mass of (c)

4.4 Relation to SNN similarity

A measure based on shared nearest neighbours (SNN) in k nearest neighbours has been proposed for clustering (Jarvis and Patrick 1973):

“Data points are similar to the extent that they share the same nearest neighbours; in particular, two data points are similar to the extent that their respective k nearest neighbour lists match. In addition, for this similarity measure to be valid, it is required that the tested points themselves belong to the common neighbourhood.”

SNN (dis)similarity has been used to replace the distance measure in DBSCAN as a way to overcome its inability to find all clusters of varying densities (Ertöz et al. 2003).

Here we show that SNN similarity is a mass-based similarity, and the corresponding neighbourhood estimator is a variant of \(\mu \)-neighbourhood mass estimator.

Using a similar notation, let \(R(\mathbf{x}|H)\) be the (implicit) region which covers the k nearest neighbours of \(\mathbf{x}\), where H is kNN. \(SNN(\mathbf{x},\mathbf{y})\) can be defined as the neighbourhood mass (i.e., the number of nearest neighbours) shared by \(\mathbf{x}\) and \(\mathbf{y}\):

$$\begin{aligned} SNN(\mathbf{x},\mathbf{y}) = |R(\mathbf{x}|H) \cap R(\mathbf{y}|H)| \end{aligned}$$
(8)

where the intersection must include both \(\mathbf{x}\) and \(\mathbf{y}\); otherwise \(SNN(\mathbf{x},\mathbf{y})=0\).

Let \(s_k(\mathbf{x},\mathbf{y}) = SNN(\mathbf{x},\mathbf{y})/k\). The neighbourhood function based on the SNN similarity can be expressed as:

$$\begin{aligned}M_\sigma (\mathbf{x}) = \# \{ \mathbf{y} \in D\ |\ s_k(\mathbf{x},\mathbf{y}) \ge \sigma \} \end{aligned}$$

Note that \(M_\sigma (\mathbf{x})\), like \(M_\mu (\mathbf{x})\), cannot be treated as a density because the volume used to compute \(M_\sigma \) for every x is not constant, given a fixed \(\sigma \). In other words, we disclose that SNN clustering algorithm (Ertöz et al. 2003) is a type of a mass-based clustering method, not a density-based clustering method. This is despite the fact that SNN employs the DBSCAN procedure by replacing the distance measure with the (inverse) SNN similarity (Ertöz et al. 2003; Tan et al. 2005).

The advantages of using iForest instead of k nearest neighbours to estimate the neighbourhood mass are:

  • The SNN similarity matrix is sensitive to the parameter of neighbourhood list size k. In contrast, iForest works well with a default setting.

  • The SNN clustering algorithm (Ertöz et al. 2003) has \(O(k^2n^2)\) time complexity or \(O(n^3)\) when \(k=\sqrt{n}\) or larger and \(n = |D|\). Yet, the same algorithm using mass-based dissimilarity has the same \(O(n^2)\) time complexity as DBSCAN, except an additional preprocessing to compute the dissimilarity matrix which takes \(O(t\log \psi (\psi + n^2))\) or \(O(n^2)\) since \(\psi \ll n\). When distance measure is used, the cost for the dissimilarity matrix is also \(O(n^2)\).

5 Shortcomings of existing distance-based algorithms

The neighbourhood of a point has been used for different functions in various data mining tasks. Table 5 summarises the key functions and key shortcomings of four existing algorithms relying on the distance measure. It is instructive to see that a mere replacement of distance with the mass-based dissimilarity in these algorithms changes the perspective from finding the nearest neighbourhood to the lowest probability mass neighbourhood, though both denote the least dissimilar neighbourhood. The corresponding ‘new’ functions are described in the last column of Table 5. Although the rest of the procedures in each algorithm are unchanged, the mass-based dissimilarity overcomes the key shortcomings of these algorithms through finding the lowest probability mass neighbourhood, rather than the nearest neighbourhood.

Table 5 Key functions and key shortcomings of algorithms that rely on distance in four tasks and their replacement functions due to mass-based dissimilarity
Table 6 Properties of datasets used in the experiments

As our analyses are focused on classification and clustering, the empirical evaluations in the next section are conducted on these two tasks. The results of the evaluations in multi-label classification and anomaly detection are presented by Ting et al. (2016).

6 Empirical evaluation

The evaluations in classification and clustering are provided in the following two subsections; and the runtime evaluation is presented in the third subsection.

Thirty datasets are used in the experiments, where twenty eight are from the UCI Machine Learning Repository (Lichman 2013) and two are synthetic datasets (S1 and S2). Table 6 presents the properties of these datasets. We have focused on datasets containing numeric attributes only where Euclidean distance has the natural interpretation. The majority of the first twenty datasets are 2-class problems which are the focus of the analysis for classification. Therefore, they are used for evaluating kNN and kLMN classifiers. More than half of the last twenty datatsets have the number of classes more than 2. They are chosen to examine the capability of clustering algorithms. All these datasets represent a good mix of different numbers of attributes (2–147), data sizes (150–11055) and classes (2–100).

All datasets are normalised using the min-max normalisation (unless stated otherwise) to yield each attribute to be in [0,1] before the experiments begin.

Fig. 13
figure 13

Density distributions of S1 and S2. a S1: “inseparable distribution”. b S2: “separable distribution”

S1 and S2 are the synthetic datasets created to examine the condition under which density-based clustering fails (Zhu et al. 2016), stated in Sect. 4.3. S1 is an “inseparable distribution” which contains 3 Gaussian clusters N(meanstd) with means located at \((x_1,x_2)\)=(3.3, 9.3), (8, 5), (12, 12), and standard deviations \(std=3, 3, 8\) in each dimension; and each cluster has 300 points. S2 is a “separable distribution” which has 3 Gaussian clusters N(meanstd) with means located at \((x_1,x_2)=(10,10), (20,20), (60,60)\), and \(std=2, 2, 11\) in each dimension; and each cluster has 500 points. The density plots of S1 and S2 are shown in Fig. 13.

6.1 kNN classifiers versus kLMN classifiers

This section compares the kNN classifier with the kLMN classifier to assess the relative utility of distance measure and mass-based dissimilarity.

In addition, we also include the Extended Nearest Neighbour classifier (ENN) (Tang and He 2015) and two supervised approaches to distance metric learning called large margin nearest neighbour (LMNN) (Weinberger and Saul 2009) and geometric mean metric learning (GMML) (Zadeh et al. 2016) in the comparison.

ENN is an improved version of kNN which makes its prediction based on not only the nearest neighbors of each test point, but also those instances which have the test point as their nearest neighbors.

LMNN learns a distance metric such that k nearest neighbours (of a test point) belong to the same class, and points from different classes are separated by a large margin from the k nearest neighbours. Because LMNN uses the maximum margin in their objective function, there is no closed form solution.

GMML is a recent method for distance metric learning. It learns a linear transformation (distance metric) such that instances of different classes are more dissimilar and those of the same class are more similar. The linear transformation is Mahalanobis distance parameterized by a symmetric positive definite matrix which has a close-form solution.

A conceptual comparison between distance metric learning and mass-based dissimilarity is provided in Sect. 8. Here we denote LMNN as the kNN classifier which employs the LMNN learned metric; and GMML as the kNN classifier which employs the GMML learned metric.

All classifiers used in the experiments are implemented in MatLab.Footnote 9

Parameters used are given as follows. The number of nearest neighbours k is set to 5 for kNN, ENN, LMNN, GMML and kLMN. We set the number of dimensions of the learned metric in LMNN and GMML as the original number of dimensions on each dataset. The default setting (i.e., \(\psi =256\) and \(t=100\)) (Liu et al. 2008) is used to generate iForest, which is used to compute \(m_e\) for kLMN.

We report the average accuracy of 5-fold cross-validation on each dataset. A post-hoc Nemenyi testFootnote 10 is used to examine whether the difference between any two classifiers is significant.

Table 7 A comparison of kNN, ENN, LMNN, GMML and kLMN on data with min-max normalisation

The result shown in Table 7 (on normalised datasets) demonstrates that kLMN has the best result, having the best accuracy on 7 datasets; followed by GMML, LMNN, ENN and kNN on 5, 3, 3 and 2 datasets, respectively. kLMN has the highest average rank among the five algorithms shown in Fig. 14a. This result also shows that mass-based dissimilarity provides a significantly better closest match neighbourhood than the distance measure used for classification in kNN. Although KLMN is not significantly better than LMNN and GMML, kLMN is still the preferred choice because it does not need a computationally expensive learning process.

The result on the unnormalised data shown in Table 8 discloses an interesting phenomenon: normalisation has the highest impact on kNN and ENN, where huge differences can be seen on Heart, Parkinson, Wilt, GPS, SPAM and Urban (comparing their individual results between Tables 7 and 8). The sum of the absolute difference between accuracies with and without normalisation for each of the five algorithms is shown in Table 9. The impact is smaller with LMNN and GMML; but still huge differences can be seen on Austra (LMNN only), Parkinson (GMML only) and Urban. The smallest impact is with kLMN, where there are no huge differences on all datasets. This shows that kLMN has the highest capability to deal with varying data scales in different attributes of the same dataset; followed by LMNN and GMML.

Fig. 14
figure 14

Critical difference (CD) diagram of the post-hoc Nemenyi test \((\alpha = 0.05)\). The difference between two algorithms is significant if the gap between their ranks is larger than the CD. A line connecting two algorithms indicates that the rank gap between them is smaller than the CD. a Normalised datasets showed in Table 7. b Unnormalised datasets showed in Table 8

Table 8 A comparison of kNN, ENN, LMNN, GMML and kLMN on data without normalisation
Table 9 The absolute difference between the accuracies with and without normalization in each dataset showed in Tables 7 and 8. The result showed is the sum over the 20 datasets

The source of this capability in kLMN is iForest, where the random split in each node of a tree almost renders null and void the effect of the value range of each attribute while maintaining the relative order of the two subsets (i.e., one subset has larger values than the other) as a result of the split. In evaluating a test point, its value on a certain attribute influences which of the two branches of a node it traverses. In other words, the outcome of mass-based dissimilarity is invariant to linear scaling of an attribute.Footnote 11 Thus, the value ranges between attributes can differ hugely but they do not influence the outcome of the dissimilarity measure substantially; unlike in the case of \(\ell _p\). This is a bonus advantage of mass-based dissimilarity over distance measures.

Although both LMNN and GMML have almost identical ranking in both Fig. 14a, b, they have large differences in some datasets, e.g., Breast, Corel, ILPD, GPS and Spam on the normalised datasets. Both LMNN and GMML are significantly better than kNN on the normalised datasets, and better than both kNN and ENN on the unnormalised datasets. We also found that there is no significant difference between kNN and ENN on both the normalised and unnormalised datasets, though ENN has a slightly higher average ranking over the 20 datasets.

Figure 15 shows examples of varying k in all five algorithms. As expected, k has an influence on the predictive accuracy on every algorithm. In a specific application, the best k shall be searched in order to get the best accuracy.

Fig. 15
figure 15

Accuracies of algorithms with different k values. a Urban. b Ionosphere. c GPS. d Breast. e Heart. f Vote

6.2 Density-based clustering versus mass-based clustering

DBSCAN (Ester et al. 1996) is a natural choice for our evaluation not only because it is a commonly used clustering algorithm, but also it employs \(\epsilon \)-neighbourhood density estimation (i.e., the key contender we used in the analysis in Sect. 4). Here we convert DBSCAN to MBSCAN, i.e., from density based to mass based, by simply replacing distance measure with mass-based dissimilarity, leaving the rest of the procedure unchanged. This effectively changes the use of \(\epsilon \)-neighbourhood density estimation to \(\mu \)-neighbourhood mass estimation, as described in Sect. 4.1. This enables a global threshold to be used in a mass distribution to identify all clusters of varying densities in a density distribution as shown in Figs. 11 and 12.

We evaluated MBSCAN and compared it with DBSCAN, SNN (Ertöz et al. 2003) and OPTICS (Ankerst et al. 1999). Note that the only difference among DBSCAN, SNN and MBSCAN is the dissimilarity matrix, which is preprocessed and serves as input to these algorithms. We used iForest with the default setting (i.e., \(\psi =256\) and \(t=100\)) (Liu et al. 2008) to generate the mass-based dissimilarity matrix as the input for MBSCAN.

For each clustering algorithm, the search range of either \(\epsilon \), \(\mu \) or \(\sigma \) was from the minimum to the maximum value of pairwise dissimilarity in the given dataset. The search range of MinPts in DBSCAN, SNN and MBSCAN was in the range \(\lbrace 2, 3,\ldots ,10 \rbrace \). The parameter k in SNN was set to the square root of the data size as suggested by some researcher (Silverman 1986). For OPTICS, we searched MinPts to produce the required hierarchical plots, and then searched threshold \(\xi \)Footnote 12 in the range \(\lbrace 0.01, 0.02,\ldots , 0.99 \rbrace \) to extract clusters from each plot.

We recorded the best F-measureFootnote 13 of a clustering algorithm on a dataset. Because iForest is a randomised method, we reported the average result over 10 trials.

Table 10 shows the best F-measure of DBSCAN, OPTICS, SNN and MBSCAN. It demonstrates that MBSCAN and SNN performed the best in 11 and 7 datasets, respectively. OPTICS only performed the best on 2 datasets. In terms of performance ratio with reference to DBSCAN, MBSCAN enhances DBSCAN the most by 33%; SNN enhances DBSCAN by 29%; and OPTICS enhances DBSCAN by 12%.

Table 10 Best F-measures of DBSCAN, OPTIC, SNN and MBSCAN on 20 datasets

Although MBSCAN and SNN have similar F-measures in many datasets, one is better than the other on a few datasets. MBSCAN is significantly better than SNN in two datasets: S1 and WDBC. For example, MBSCAN enhances DBSCAN by more than 80% compared with less than 50% achieved by SNN on S1. The reverse is true on the Forest dataset, where SNN enhances DBSCAN by 300% compared with 216% by MBSCAN. MBSCAN and SNN enhance DBSCAN by more than 15% on 12 and 10 datasets, respectively. OPTICS only does so on 8 datasets.

The post-hoc Nemenyi test, shown in Fig. 16, demonstrates that MBSCAN is the only algorithm which performs significantly better than OPTICS and DBSCAN. SNN is significantly better than DBSCAN, but not better than OPTICS.

Fig. 16
figure 16

Critical difference (CD) diagram of the post-hoc Nemenyi test (\(\alpha =0.05\)) for the results showed in Table 10. The difference between two algorithms is significant if the gap between their ranks is larger than the CD. There is a line between two algorithms if the rank gap between them is smaller than the CD

6.3 Evaluation in runtime

For unsupervised learning tasks, the only difference between the \(\ell _p\) and \(m_e\) versions of the algorithms is the computation time for the dissimilarity matrix. After the matrix is computed and served as input, the algorithm has the same runtime regardless of the dissimilarity used to compute the matrix.

Using \(\ell _p\) to compute the dissimilarity matrix has \(O(dn^{2})\) time complexity. \(m_e\) builds iForest and computes the dissimilarity matrix based on iForest, which yields \(O(t \psi log \psi +n^{2}tlog\psi )\) time complexity. For large datasets, \(\psi \ll n\), the time cost is \(O(n^{2})\). The time complexity of SNN similarity is \(O(n^{3})\) in the worst case when \(k = \sqrt{n}\) or larger. Table 11 gives the time and space complexities of dissimilarity matrix calculation based on \(\ell _p\), \(m_e\) and SNN.

Table 11 Time and space complexities of dissimilarity matrix calculation based on \(\ell _p\), \(m_e\) and SNN

Table 12 shows the runtime of the dissimilarity matrix calculation for the three dissimilarities on three datasets. In small dimensional datasets, \(m_e\) has almost the same runtime as SNN, but takes longer to compute than \(\ell _p\) due to the use of iForest. However, in large and high dimensional datasets such as p53Mutant, \(m_e\) is much faster than both \(\ell _p\) and SNN because it is independent of the data dimensionality, i.e., each split node of a tree chooses one attribute randomly up to the certain height limit only.

For supervised learning tasks, the current implementation of \(m_e\) using iForest demands a traversal of all the trees for every measurement of dissimilarity between a test point and one point in the training set. This is more costly than the \(\ell _p\) measurement. For example, on the Corel dataset used in Sect. 6.1, kNN took 1700 s while kLMN took 116,600 s to complete the 5-fold cross-validation experiment.

Table 12 Runtime of the dissimilarity matrix calculation (in seconds)

Thus, a more efficient implementation of \(m_e\) is required for supervised learning tasks.

7 Related data dependent dissimilarities and metric axioms

7.1 \(m_p\) and \(\ell _{p,cdf}\)

The first version of mass-based dissimilarity is called \(m_p\) (Aryal et al. 2014a) and it is defined as follows:

$$\begin{aligned} m_p(\mathbf{x}, \mathbf{y}) = \displaystyle \left( \frac{1}{q}\sum _{i=1}^q \left( \frac{|R_i(\mathbf{x}, \mathbf{y})|}{n}\right) ^p\right) ^{\frac{1}{p}} \end{aligned}$$
(9)

where \(|R_i(\mathbf{x}, \mathbf{y})| = |\{z_i:\min (x_i,y_i)-\delta \le z_i \le \max (x_i,y_i)+\delta \}|\) is the data mass in dimension-i region \(R_i(\mathbf{x}, \mathbf{y}) = \left[ \min (x_i,y_i)-\delta , \max (x_i,y_i)+\delta \right] \), \(\delta \ge 0\); \(p>0\); q is the number of dimensions; and n is the number of points in the given dataset.

Note that this formulation is the same as \(\ell _p\), except that the data mass in \(R_i(\mathbf{x}, \mathbf{y})\) is used instead of distance between \(\mathbf{x}\) and \(\mathbf{y}\) in each dimension. This formulation is a special case of \(m_e\) which considers data mass in individual dimensions independently.

The single dimensional implementation of \(m_p\) has the advantage that it can be easily extended to deal with categorical attributes and mixed attribute types. There is no obvious way to deal with these attributes in the iForest implementation. The disadvantage is that \(m_p\) can perform poorly on datasets in which there is strong dependency between multiple attributes. This is where the iForest implementation of \(m_e\) can perform better than \(m_p\).

Transformation using cdf One can view \(m_p\) (Aryal et al. 2014a) as almost equivalent to \(\ell _p\) applied to a transformed data set \(D'\), where \(x_i'=cdf(x_i)\), i.e., points are represented by their cumulative distribution function in each dimension. Note that

$$\begin{aligned}m_p(x_i,y_i) = P(x_i \le X_i \le y_i) = cdf(y_i)-cdf(x_i)+P(x_i) \end{aligned}$$

whereas \(\ell _p(x'_i,y'_i) = P(x_i < X_i \le y_i) = cdf(y_i)-cdf(x_i)\).

If every point is unique, then the difference between \(m_p\) and \(\ell _p\) in dimension i is \(P(x_i) = 1/n\), or the self-dissimilarity is constant. However, the difference between \(m_p\) and \(\ell _p\) enlarges when there are duplicate points.Footnote 14 Often in real high-dimensional datasets, many points can have the same value in many dimensions, e.g., many documents in a collection have the same occurrence frequency of a term; or different individuals have the same age. In the extreme case where all the points have the same value in dimension i, \(m_p(x_i,x_i) = \frac{n}{n} = 1\) (maximal dissimilarity) whereas \(\ell _p(x'_i,x'_i) = cdf(x_i)-cdf(x_i) = 0\) (minimal dissimilarity).

Note that \(\ell _p\) with the cdf transformation (\(\ell _{p,cdf}\)) is a measure in-between data independent \(\ell _p\) and data dependent \(m_p\), i.e., it becomes data dependent for \(\forall \ \mathbf{x} \ne \mathbf{y}\) only and its self-dissimilarity is zero and constant. As a result, it is still a metric.

7.2 Compliance with metric axioms

A comparison of \(m_e\) and \(m_{p}\) in terms of compliance with the metric axioms is given in Table 13. \(m_e\) is in compliance with the triangular inequality and symmetry axioms when the iForest implementation is used.

Table 13 Compliance with the metric axioms. Note that for \(\ell _{p,cdf}\) or \(\ell _{p}\) to be a metric, \(p \ge 1\)

Let abc be values of a real attribute, and \(a< b < c\). A binary iTree will partition the points as follows: either \(R(a,b) \subseteq R(a,c)\) or \(R(b,c) \subseteq R(a,c)\). Because either \(R(a,b) = R(a,c)\) or \(R(b,c) = R(a,c)\), it results in \(|R(a,b)| + |R(b,c)| > |R(a,c)|\), satisfying the triangular inequality condition.

It is interesting to note that the smallest region in mass-based dissimilarity is analogous to the shortest path in distance metric; and both of them lead to triangular inequality.

\(m_e\) is symmetric because R(ab) and R(ba) occupy the same node in a tree.

\(m_e\) is not in compliance with the first two axioms because of the properties described in Table 1. Because of its unique feature of data dependent self-dissimilarity, we name it datad-metric to differentiate it from existing measures such as quasi-metric, meta-metric, semi-metric, peri-metric (which violate one or two of the metric axioms).

As our focus is on data dependent measures, the comparison with generalised data independent measures (e.g., \(\ell _p\), where \(0<p<1\), that violates the triangular inequality axiom) is outside the scope of this paper. Interested readers of generalised data independent measures may refer to Jacobs et al. (2000), Tan et al. (2009). Whether these generalised data independent measures can overcome the shortcomings of the four algorithms we have studied here (including those shown in the Appendices) is an interesting research topic in the future.

There are other existing data dependent dissimilarities such as Mahalanobis distance (Mahalanobis 1936) and Lin’s probabilistic measure (Lin 1998). As they are metrics, their self-dissimilarity is constant, unlike \(m_e\).

A summary of different types of dissimilarities is provided in Table 14. We describe the relation to distance metric learning, data dependent kernel and similarity-based learning in the next section.

Table 14 Different types of dissimilarities

8 Relation to distance metric learning, data dependent kernel and similarity-based learning

8.1 Distance metric learning

Distance metric learning can be viewed as a restricted form of data dependent dissimilarity. Typically, the aim is to learn a generalised (or parameterised) Mahalanobis distance, subject to some optimality constraint, from a dataset. In simple terms, the aim is to reduce the dissimilarity between points of the same class; and increase the dissimilarity between points of different classes.

Wang and Sun (2015) define a common form of distance metric learning as:

“The problem of learning a distance function \(\partial \) for a pair of data points \(\mathbf{x}\) and \(\mathbf{y}\) is to learn a mapping function f, such that \(f(\mathbf{x})\) and \(f(\mathbf{y})\) will be in the Euclidean space and \(\partial (\mathbf{x}, \mathbf{y}) = || f(\mathbf{x}) - f(\mathbf{y}) ||\), where \(||\cdot ||\) is the \(\ell _2\) norm.”

A more general formulation must still conform to some norm and is a pseudo-metric, i.e., it relaxes one metric condition from “\(\partial (\mathbf{x},\mathbf{y})=0\) iff \(\mathbf{x}=\mathbf{y}\)” to “if \(\mathbf{x}=\mathbf{y}\), then \(\partial (\mathbf{x},\mathbf{y})=0\)”, in additional to the learning requirement which needs to know the class label for each point.

In contrast, mass-based dissimilarity, without learning, derives the dissimilarity of \(\mathbf{x}\) and \(\mathbf{y}\) directly from data based on probability mass of the region covering \(\mathbf{x}\) and \(\mathbf{y}\), without class information.

Mass-based dissimilarity is conceptually simpler, yet more generic, than distance metric learning; and it is a data dependent measure that is not restricted to a single form (which requires Euclidean distance, for example).

The implementations of mass-based dissimilarity differ in how the region is defined, including one similar to \(\ell _p\)-norm called \(m_p\) dissimilarity (Aryal et al. 2014a), which permits all valid p values including \(p=2\).

In other words, distance metric learning exploits information from Euclidean metric space, with a minor relaxation to become a pseudo metric. The focus is to learn a mapping from data based on some optimality criterion. It is an indirect way to get a restricted form of data-dependent dissimilarity. In contrast, mass-based dissimilarity has no such insistence or assumption, and it derives the dissimilarity directly from data, resulting in a generic form of data-dependent dissimilarity.

The empirical comparison between \(m_e\) and two distance metric learning methods [i.e., large margin nearest neighbour (Weinberger and Saul 2009) and GMML (Zadeh et al. 2016)] has been shown in Sect. 6.1. The former is unsupervised which requires neither labelled data nor learning; the latter is supervised which requires labelled data and learning. Even with the advantage of class information and learning, distance metric learning is not found to be significantly better than \(m_e\).

A summary of the key differences between mass-based dissimilarity and distance metric learning is provided in Table 15. In a nutshell, less is more in solving the deficiency of distance measures, i.e., one gets more out of mass-based dissimilarity than distance metric learning with significantly less computational requirements.

Table 15 Key differences between mass-based dissimilarity and distance metric learning

8.2 Data dependent kernel

Designing a good kernel function is at the heart of kernel methods. The use of a poorly designed kernel function severely affects a kernel method’s predictive accuracy. One of the first methods to adapt a kernel function to the structure of the data is through a conformal transformation of kernel functions (Amari and Wu 1999; Wu and Amari 2002; Xiong et al. 2007). In the classification context, the idea is to modify the kernel function such that the spatial distances around the boundary between two classes are enlarged (and those outside the boundary region reduced). To achieve this objective, some knowledge of the boundary is required in order to learn a data dependent kernel for a given dataset. This objective bears some resemblance to that of distance metric learning (described in the last subsection), i.e., to reduce the distance between points of the same class and increase the distance between points of different classes. Both require a computationally expensive optimisation learning process. The resultant dissimilarity measure of a data dependent kernel is metric. Here, the data dependency relies on class information.

In contrast, mass-based dissimilarity is derived directly from the given dataset, requiring neither learning nor information about class labels. The term ‘data dependent’ is used for mass-based dissimilarity in a more general context (than data dependent kernel or distance metric learning) to denote the dependency on data distribution without the class information.

A point worth noting is that a kernel family needs to be determined by a user of kernel methods, even if a kernel can be made data dependent. The equivalent requirement for mass-based dissimilarity is the model (or the partitioning strategy) used to define regions. It is arguably easier to determine the latter than the former.

The early development of data dependent kernel can be traced back to learning an optimal global nearest neighbour metric (Short and Fukunaga 1981; Fukunaga and Flick 1984). Fukunaga (1990) (pages 318–319) has commented that it is unclear how to estimate the metric such that the bias could be minimised because of the requirement of a positive definite matrix (a similar requirement of many kernel methods).

8.3 Similarity-based learning

The key concern of similarity-based (non-metric) learning (Chen et al. 2009; Schleif and Tino 2015) is: how to deal with the given, often naturally obtained, dissimilarity which is non-metric (e.g., violations of symmetry and/or triangular equality); and the data are often not represented in a vectorial representation. In this setting, two general approaches (Schleif and Tino 2015) are: (i) Transforming non-metric dissimilarity to metric dissimilarity so that existing methods which rely on the constraint of metric can be applied; and (ii) creating learning methods that deal with the given non-metric dissimilarity directly. Most of these methods are data independent, i.e., the dissimilarity between two points depends on these two points only.

The current framework of mass-based dissimilarity (and also distance metric learning and data dependent kernel described in the last two subsections) assumes that points are given in a vectorial representation. With this assumption, there is a choice in using either metric or non-metric dissimilarity; albeit the default choice is often metric by using distance measures (as in the cases of distance metric learning and data dependent kernel). The mass-based dissimilarity offers an unconventional choice to employ data dependent non-metric dissimilarity under this assumption.

In other words, the premises of mass-based dissimilarity and similarity-based learning are different. A comparison with similarity-based learning is only meaningful when mass-based dissimilarity is further developed into one which converts the given (data independent) non-metric dissimilarity into a data dependent dissimilarity. Whether this conversion necessitates a learning process is an open question.

Having said that, there is still one distinguishing feature of mass-based dissimilarity: it is a general approach which can be used for a variety of applications. In contrast, many similarity-based learning methods are based on application-specific studies, where the naturally obtained dissimilarity is application specific.

9 Discussion

9.1 Concepts

Except in the case of given non-metric dissimilarity, dissimilarity measures are assumed to be a metric or a pseudo-metric as a necessary criterion for all data mining tasks. This work shows that requiring the metric assumptions may be an impediment to producing good performing models in two tasks. The fact that mass-based dissimilarity \(m_e\), which violates two metric axioms, can be used to overcome the shortcomings of two algorithms demonstrates the inadequacy of metric axioms. The data dependent property is the overarching factor which leads to this outcome.

We imply that distance measures are the root cause of key shortcomings in two existing algorithms, highlighted in Table 5. Having identified the root cause and created an effective alternative to distance measure, the solution becomes simple—merely replacing the distance measure with the mass-based dissimilarity; the otherwise unchanged algorithm can now overcome its shortcoming.

The result of not recognising the root cause can often lead to a solution which is more complicated than necessary and may not resolve the issue completely. An example is the inability of density-based clustering to find all clusters of varying densities. This issue is well-known and many suggestions have focused on density-based solutions (Ankerst et al. 1999; Ertöz et al. 2003). The fact that the \(\epsilon \)-neighbourhood density estimator employed relies on distance measure, which is the root cause of the shortcoming, has been overlooked.

It is interesting to note that one of the existing solutions, i.e., SNN clustering has been incorrectly designated as density-based (Ertöz et al. 2003; Tan et al. 2005) thus far. Our analysis in Sect. 4.4 has unveiled that when replacing SNN (dis)similarity with the distance measure in \(\epsilon \)-neighbourhood density estimator, the result is a mass estimator, not a density estimator.

\(\mu \)-neighbourhood mass can be viewed as a general version of \(\epsilon \)-neighbourhood density. The shapes and volumes of \(\epsilon \)-neighbourhood density regions are fixed for a given \(\epsilon \). In contrast, the shapes and volumes of \(\mu \)-neighbourhood mass regions depend on data distribution. We provide an example in Fig. 8 that \(\mu \)-neighbourhood mass has a regular shape and fixed volume region like \(\epsilon \)-neighbourhood estimator in uniform density distribution only.

9.2 Implementations

The use of iForest can be viewed as estimating probability from multiple variable-size multi-dimensional histograms.

Parameter \(\psi \) in iForest, used in the \(\mu \)-neighbourhood estimator, is a smoothing parameter similar to k in a k-nearest neighbour density estimator. High \(\psi \) yields large trees which are sensitive to local variations in data distribution—similar effect of setting small k.

Since the default setting of iForest (\(\psi =256\) and \(t=100\)) can be used to provide good performance on many datasets, the implementation of mass-based dissimilarity based on iForest does not create additional limitations or parameters that need to be tuned.

The generic formulation of mass-based dissimilarity allows different implementations, including different variants of iForest (see Sect. 2.4); and \(m_p\) dissimilarity (Aryal et al. 2014a) and SNN (dis)similarity are its special cases—all of them possess the characteristic of judged dissimilarity as prescribed by psychologists (Krumhansl 1978).

It is possible to use SNN in kNN algorithms (e.g., kNN anomaly detection, and kNN and MLkNN classifications). However, its use has two issues. First, there are two k parameters as kNN is employed separately in the dissimilarity calculation and the decision making process. Second, the high time complexity shown in Table 11 makes SNN prohibitive in large datasets.

Note that a mass-based neighbourhood function can be implemented using a distance measure, as in the case of SNN. But, it is not only an indirect way to estimate mass but also an expensive one, as mentioned in Sect. 4.4.

9.3 Limitations of current implementations

The iForest implementation of mass-based dissimilarity has five limitations. First, the time complexity is higher than distance measure in supervised learning, as described in Sect. 6.3. Second, it is limited to numeric attributes only. Third, like all tree implementations, it can deal with low- and medium-size dimensionality only because each tree considers only a small subset of available attributes. Fourth, the current implementations (either \(m_e\) or \(m_p\)) produce datad-metric only. These do not suit applications which demand violations of triangle inequality and/or symmetry. Fifth, the current implementations assume vectorial representation of data points. As such, it is unclear how mass-based dissimilarity can be created for applications which have no vectorial representations and only dissimilarities between points are given.

The SNN implementation is not a good alternative to iForest because of its high time complexity; and it introduces an additional parameter k which is sensitive and needs to be tuned. This is despite the fact that it may be able to better deal with categorical attributes and high-dimensional datasets.

The \(m_p\) implementation simplifies to single-dimensional probability estimations. Its current implementation [shown in Eq. (9)] requires a range search in each dimension which costs \(O(d \log n)\) by using a binary search tree for each dimension (compared against O(d) for \(\ell _p\).) The time complexity has the potential to be further reduced. In addition, this implementation is also more readily applied to categorical attributes and high dimensional datasets. However, the key shortcoming is that every dimension is being considered independently in probability estimation. This can have a negative impact in many real-world applications where the dependency between attributes needs to be considered.

9.4 Probabilistic kNN classifiers and related techniques

Various techniques have been developed over the years to tackle some of the limitations of the kNN classifier, such as its sensitivity to high dimensions or to settings of the parameter k.

For instance, the probabilistic kNN (PNN) model (Holmes and Adams 2002) made use of Bayesian techniques to remove the need to choose k. The PNN model can be seen as using hold-one-out cross-validation to generate a smoothed ensemble of kNN classifiers over different values for k (weighted by their posterior probabilities).

The Discriminant Adaptive Nearest Neighbor (DANN) model (Hastie and Tibshirani 1996) attempts to deal with the susceptibility of kNN classifiers to high-dimensionality by making use of ideas from Linear Discriminant Analysis. More specifically, the model modifies the distance metric at points near class boundaries such that the k-nearest neighbourhood is warped and stretches out along the boundaries. The ellipsoid-shaped neighborhoods mean that points on the other side of the boundary are less likely to fall within the neighborhood and thus allow for more reliable determination of the true class boundary in high dimensions. We note that the method does not, to the best of our knowledge, deal with the issue of differences in density on either side of the class boundary.

Finally, the Bayesian Adaptive Nearest Neighbor (BANN) model (Guo and Chakraborty 2010) combines the ideas of both PNN and DANN into a single model.

Overall, these techniques can be seen as using ensembling method to remove the dependence on k and distance-warping methods to reduce bias at points close to the decision boundary. We note that modified distance measure, while data dependent, makes use of the class label information, thus distinguishing the technique from mass-based approaches outlined in this paper. Furthermore, the computational requirements of the techniques are substantial due to the need to perform both leave-one-out cross-validation and matrix inversion.

10 Concluding remarks

We introduce a generic mass-based dissimilarity which is readily applied to existing algorithms in different tasks. The mass-based dissimilarity implemented with iForest overcomes key shortcomings of two existing algorithms that rely on distance, and effectively improves their task-specific performance on distance-based classification and density-based clustering.

These existing algorithms are transformed by simply replacing the distance measure with the mass-based dissimilarity, leaving the rest of the procedures unchanged.

As the transformation heralds a fundamental change of perspective in finding the closest match neighbourhood, the converted algorithms are more aptly called lowest probability mass neighbour algorithms than nearest neighbour algorithms, since the lowest mass represents the most similar.

Our analyses provide an insight into the conditions under which the distance-based neighbourhood methods fail and the mass-based neighbourhood methods succeed in classification and clustering tasks.

The proposed mass-based dissimilarity has a unique feature in comparison with existing data dependent measures, that is, its self-dissimilarity is data dependent and not constant. We call it datad-metric, as opposed to existing data dependent metrics and generalised data independent metrics (e.g., quasi-metric, meta-metric, semi-metric, peri-metric).

We disclose that two of the four metric axioms are not necessary in developing a model that performs well. This opens up research to (i) other forms of data dependent dissimilarities that work in practice, and (ii) incorporating learning with datad-metrics, where the research effort thus far has been largely restricted to metric learning and non-metric learning; and the latter is largely data independent. In addition, we will explore other implementations of mass-based dissimilarity and investigate their influence in different data mining tasks.