1 Introduction

Many real-world applications such as web mining, text mining, bio-informatics, system diagnosis, and action recognition have to deal with sequential data. The core task of such applications is to apply machine learning methods, for example, K-means or Support Vector Machine (SVM) to sequential data to find insightful patterns or build effective predictive models. However, this task is challenging since machine learning methods typically require inputs as fixed-length vectors, which are not applicable to sequences.

A well-known solution in data mining is to use sequential patterns (SPs) as features [6]. This approach first mines SPs from the dataset, and then represents each sequence in the dataset as a feature vector with binary components indicating whether this sequence contains a particular sequential pattern. We can see the dimension of the feature space is huge since the number of SPs is often large. Consequently, this leads to the high-dimensionality and data sparsity problems.

To reduce the dimension of the feature space, many researchers have tried to extract only interesting SPs under an unsupervised setting [5, 9, 17] or discriminative SPs under a supervised setting [3, 6, 19]. The methods discover interesting SPs, e.g., closed SPs [17], compressing SPs [9], and relevant SPs [5] without using the sequence labels. Although these methods can reduce the number of generated patterns, thus solving the high-dimensionality problem, they still suffer from the data sparsity problem. The methods discover discriminative SPs using different measures, e.g., information gain [6], support-cohesion [19], and behavioral constraint [3], which involve the sequence labels. Although these methods often show good performances in sequence classification, they usually require the labels for all training examples, which is often unrealistic in many real applications.

Recently, neural embedding approaches have been introduced to learn low-dimensional continuous embedding vectors for sequences using neural networks in a fully unsupervised manner. These methods primarily focus on text, where they learn embedding vectors for documents [2, 10] and show significant improvements over non-embedding methods, e.g., Bag-of-Words, in several applications such as document classification and sentiment analysis. However, they have two limitations. First, they mostly learn embedding vectors based on atoms in data (i.e., words), but do not consider sets of atoms (i.e., phrases). Second, they often perform poorly on datasets with a relatively small vocabulary [8]. In our experiments, the performances of document embedding methods dramatically reduce on sequential datasets whose the vocabulary size is less than 300.

Our Approach. To overcome the disadvantages of traditional pattern-based methods and recent embedding methods, we propose a novel unsupervised method (named Sqn2Vec) for learning sequence embeddings. In particular, we first extract a set of SPs which satisfy a gap constraint from the dataset. We then adapt a document embedding model to learn a vector for each sequence by predicting not only its belonging symbols but its SPs as well. By doing this, we can learn low-dimensional continuous vectors for sequences, which solves the weakness of pattern-based methods. We also take into account sets of atoms (i.e., SPs) during the learning process, which solves the weakness of embedding methods. More importantly, by considering both singleton symbols and SPs, we can increase the vocabulary size, which results in our better embeddings on sequential datasets with a small vocabulary. Moreover, since Sqn2Vec is fully unsupervised, it can be directly used for learning sequence embeddings in domains where labeled examples are difficult to obtain and the learned representations are well-generalized to different tasks such as sequence classification, clustering, and visualization.

To summarize, we make the following contributions:

  1. 1.

    We propose Sqn2Vec, an unsupervised embedding method, for learning low-dimensional continuous feature vectors for sequences.

  2. 2.

    We propose two models in Sqn2Vec, which learn sequence embeddings by predicting its belonging singleton symbols and SPs. The learned embeddings are meaningful and discriminative.

  3. 3.

    We demonstrate Sqn2Vec in both sequence classification and sequence clustering tasks, where it significantly outperforms the state-of-the-art baselines on 10 real-world sequential datasets.

2 Related Work

2.1 Sequential Pattern Based Methods for Sequence Representation

SPs have been widely used to construct feature vectors for sequences [6], which are essential inputs for different machine learning tasks. However, using SPs as features often suffers from the data sparsity and high-dimensionality problems. Recent SP-based methods have tried to extract only interesting or discriminative SPs. Several approaches have been proposed for mining interesting SPs. For example, Lam et al. [9] discovered compressing SPs which can optimally compress the dataset w.r.t an encoding scheme. In [5], a probabilistic approach was developed to mine relevant SPs which are able to reconstruct the dataset. Although interesting SPs can help to reduce the number of generated patterns (i.e., the feature space), they still suffer from the data sparsity problem.

To discover discriminative SPs, existing approaches have used the sequence labels during the mining process. For example, they use the label information to compute information gain [6], support-cohesion [19], or behavioral constraint [3]. Although discriminative SPs are useful for classification, they require sequence labels, making the mining process supervised. Related to sequence classification, SPs have been also used to build a set of predictive rules for classification, often called sequential classification rules. These rules represent the strong associations between SPs and labels, which can be used directly for prediction (i.e., they are used as rule-based classifiers) [19] or indirectly for prediction (i.e., they are used as features in other classifiers) [4].

2.2 Embedding Methods for Sequence Representation

Most existing approaches for sequence embedding learning mainly focus on text, where they learn embedding vectors for documents [2, 10]. These methods have shown impressive successes in many natural language processing tasks such as document classification and sentiment analysis. They, however, are not suitable for sequential data in bio-informatics, navigation systems, and action recognition since different from text, these sequential datasets have a very small vocabulary size (i.e., the small number of distinct symbols). For example, the human DNA sequences only consist of four nucleotides A, C, G, and T. Related to sequence classification, several deep neural network models (also called supervised embedding methods) have been introduced such as long short term memory (LSTM) networks and bidirectional LSTM (Bi-LSTM) networks [16]. Since these methods require labels, their embeddings are not general enough to effectively apply to unsupervised tasks such as sequence clustering.

As far as we know, learning embedding vectors for sequences based on SPs has not been studied yet. In this paper, we propose the first approach which utilizes information of both singleton symbols and SPs to learn sequence embeddings. Different from discriminative pattern-based methods and supervised embedding methods, our method is fully unsupervised. Moreover, our method leverages SPs to capture the sequential relations among symbols as SP-based methods while it learns dense representations as embedding methods.

3 Framework

3.1 Problem Definition

Given a set of symbols \(\mathcal{I}=\{e_{1},e_{2},...,e_{M}\}\), a sequential dataset \(\mathcal{D}=\{S_{1},S_{2},...,S_{N}\}\) is a set of sequences where each sequence \(S_{i}\) is an ordered list of symbols [18]. The symbol at the position j in \(S_{i}\) is denoted as \(S_{i}[j]\) and \(S_{i}[j]\in \mathcal{I}\).

Our goal is to learn a mapping function \(f:\mathcal{D}\rightarrow \mathbb {R}^{d}\) such that every sequence \(S_{i}\in \mathcal{D}\) is mapped to a d-dimensional continuous vector. The mapping needs to capture the similarity among the sequences in \(\mathcal{D}\), in the sense that \(S_{i}\) and \(S_{j}\) are similar if \(f(S_{i})\) and \(f(S_{j})\) are close to each other on the vector space, and vice versa. The matrix \(\mathbf {X}=[f(S_{1}),f(S_{2}),...,f(S_{N})]\) then contains feature vectors of sequences, which can be direct inputs for many traditional machine learning and data mining tasks, particularly classification and clustering.

3.2 Learning Sequence Embeddings Based on Sequential Patterns

To learn sequence embeddings, one direct solution is to apply document embedding models [2, 10] to the sequential dataset, where each sequence is treated as a document and symbols are treated as words. However, as we discussed in Sect. 1, existing document embedding methods are not suitable for sequential datasets in bio-informatics or system diagnosis since these datasets have a relatively small vocabulary (i.e., the very small number of symbols).

To improve the performances of document embedding models on such kind of sequential data, we propose to learn sequence embeddings based on SPs instead of singleton symbols. By doing this, we can increase the vocabulary size since the number of SPs is much larger than the number of symbols.

Sequential Pattern Discovery. Following the notations in [18], we define a sequential pattern as follows. Let \(\mathcal{I}=\{e_{1},e_{2},...,e_{M}\}\) be a set of symbols and \(\mathcal{D}=\{S_{1},S_{2},...,S_{N}\}\) be a sequential dataset.

Definition 1

(Subsequence). Given two sequences \(S_{1}=\{e_{1},e_{2},...,e_{n}\}\) and \(S_{2}=\{e'_{1},e'_{2},...,e'_{m}\}\), \(S_{1}\) is said to be a subsequence of \(S_{2}\) or \(S_{1}\) is contained in \(S_{2}\) (denoted \(S_{1}\subseteq S_{2}\)), if there exists a one-to-one mapping \(\phi :[1,n]\rightarrow [1,m]\), such that \(S_{1}[i]=S_{2}[\phi (i)]\) and for any positions i, j in \(S_{1}\), \(i<j\Rightarrow \phi (i)<\phi (j)\). In other words, each position in \(S_{1}\) is mapped to a position in \(S_{2}\), and the order of symbols is preserved.

Definition 2

(Subsequence occurrence). Given a sequence \(S=\big \{e'_{1},e'_{2},...,e'_{m}\big \}\) and a subsequence \(X=\{e_{1},e_{2},...,e_{n}\}\) of S,  a sequence of positions \(o=\{i_{1},...,i_{m}\}\) is an occurrence of X in S if \(1\le i_{k}\le m\) and \(X[k]=S[i_{k}]\) for each \(1\le k\le n\), and \(i_{k}<i_{k+1}\) for each \(1\le k<n\).

Example 1

\(X=\{g,t\}\) (or \(X=gt\) for short) is a subsequence of \(S=gaagt\). There are two occurrences of X in S, namely \(o_{1}=\{1,5\}\) and \(o_{2}=\{4,5\}\).

Definition 3

(Subsequence support). Given a sequential dataset \(\mathcal{D}\), the support of a subsequence X is defined as \(sup(X)=\frac{|\{S_{i}\in \mathcal{D}\mid X\subseteq S_{i}\}|}{|\mathcal{D}|}\), i.e., the fraction of sequences in \(\mathcal{D}\), which contain X.

Definition 4

(Sequential pattern). Given a minimum support threshold \(\delta \in [0,1]\), a subsequence X is said to be a sequential pattern if \(sup(X)\ge \delta \).

A sequential pattern can capture the sequential relation among symbols, but it does not pay attention on the gap among its elements. In bio-data and text data, this gap is very important because SPs whose symbols are far away from each other are often less meaningful than those whose symbols are close in the sequences. For example, consider a text dataset with two sentences \(S_{1}=\) “machine learning is a field of computer science” and \(S_{2}=\) “machine learning gives computer systems the ability to learn”. Although two SPs \(X_{1}=\{\text {machine, learning}\}\) and \(X_{2}=\{\text {machine, computer}\}\) are found in both \(S_{1}\) and \(S_{2}\), \(X_{2}\) is less meaningful than \(X_{1}\) due to the large gap between “machine” and “computer”. In other words, the two words “machine” and “computer” are in two different contexts. We believe that if we restrict the distance between two neighboring elements in a sequential pattern, then this pattern is more meaningful and discriminative. We define a sequential pattern satisfying a gap constraint as follows.

Definition 5

(Gap constraint and satisfaction). A gap is a positive integer, \(\triangle >0\). Given a sequence \(S=\{e'_{1},e'_{2},...,e'_{m}\}\) and an occurrence \(o=\{i_{1},...,i_{m}\}\) of a subsequence X of S, if \(i_{k+1}\le i_{k}+\triangle \) (\(\forall i_{k}\in [1,m-1]\)), then we say that o satisfies the \(\triangle \)-gap constraint. If there is at least one occurrence of X satisfies the \(\triangle \)-gap constraint, we say that X satisfies the \(\triangle \)-gap constraint.

Example 2

Among two occurrences of \(X=gt\) in \(S=gaagt\), namely \(o_{1}=\{1,5\}\) and \(o_{2}=\{4,5\}\), only \(o_{2}\) satisfies the 1-gap constraint (i.e., \(\triangle =1\)) since \(5\le 4+\triangle \). We say that X satisfies the 1-gap constraint because at least one of its occurrences does.

Definition 6

(Sequential pattern satisfying a \(\triangle \)-gap constraint). Given a sequential dataset \(\mathcal{D}\), a gap constraint \(\triangle >0\), and a minimum support threshold \(\delta \in [0,1]\), the support of a subsequence X in \(\mathcal{D}\) with the \(\triangle \)-gap constraint, denoted \(sup(X,\triangle )\), is the fraction of sequences in \(\mathcal{D}\), where X appears as a subsequence satisfying the \(\triangle \)-gap constraint. X is called a sequential pattern which satisfies the \(\triangle \)-gap constraint if \(sup(X,\triangle )\ge \delta \).

Note that we consider the subsequences with length 1 (i.e., they contain only one symbol) satisfy any \(\triangle \)-gap constraint. Hereafter, we call a subsequence X a sequential pattern with the meaning that X is a sequential pattern satisfying a \(\triangle \)-gap constraint.

Example 3

Let consider an example sequential dataset as shown in Fig. 1(a). Assume that \(\triangle =1\) and \(\delta =0.7\). The subsequence \(X=ag\) is contained in three sequences \(S_{1}\), \(S_{2}\), and \(S_{4}\), and it also satisfies the 1-gap constraint in these three sequences. Thus, its support is . We say that \(X=ag\) is a sequential pattern since \(\sup (X,\triangle )\ge \delta \). With \(\triangle =1\) and \(\delta =0.7\), there are in total five SPs discovered from the dataset, as shown in Fig. 1(b), and each sequence now can be represented by a set of SPs, as shown in Fig. 1(c).

Fig. 1.
figure 1

Two forms of a sequence: a set of single symbols and a set of SPs. Table (a) shows a sequential dataset with four sequences where each of them is a set of symbols. Table (b) shows five SPs discovered from the dataset (here, \(\triangle =1\) and \(\delta =0.7\)). Table (c) shows each sequence represented by a set of SPs.

Sequence Embedding Learning. After associating each sequence with a set of SPs, we follow the Paragraph Vector-Distributed Bag-of-Words (PV-DBOW) model introduced in [10] to learn embedding vectors for sequences. Given a target sequence \(S_{t}\) whose representation needs to be learned, and a set of SPs \(\mathcal{F}(S_{t})=\{X_{1},X_{2},...,X_{l}\}\) contained in \(S_{t}\), our goal is to maximize the log probability of predicting the SPs \(X_{1},X_{2},...,X_{l}\) which appear in \(S_{t}\):

$$\begin{aligned} \max \sum _{i=1}^{l}\log \Pr (X_{i}\mid S_{t}) \end{aligned}$$
(1)

Furthermore, \(\Pr (X_{i}\mid S_{t})\) is defined by a softmax function:

$$\begin{aligned} \Pr (X_{i}\mid S_{t})=\frac{\exp (g(X_{i})\cdot f(S_{t}))}{\sum _{X_{j}\in \mathcal{F}(\mathcal{D})}\exp (g(X_{j})\cdot f(S_{t}))}, \end{aligned}$$
(2)

where \(g(X_{i})\in \mathbb {R}^{d}\) and \(f(S_{t})\in \mathbb {R}^{d}\) are the embedding vectors of the sequential pattern \(X_{i}\in \mathcal{F}(S_{t})\) and the sequence \(S_{t}\) respectively, and \(\mathcal{F}(\mathcal{D})\) is the set of all SPs discovered from the dataset \(\mathcal{D}\).

Calculating the summation \(\sum _{X_{j}\in \mathcal{F}(\mathcal{D})}\exp (g(X_{j})\cdot f(S_{t}))\) in Eq. 2 is very expensive since the number of SPs in \(\mathcal{F}(\mathcal{D})\) is often very large. To solve this problem, we approximate it using the negative sampling technique [13]. The idea is that instead of iterating over all SPs in \(\mathcal{F}(\mathcal{D})\), we randomly select a relatively small number of SPs which are not contained in the target sequence \(S_{t}\) (these SPs are called negative SPs). We then attempt to distinguish the SPs contained in \(S_{t}\) from the negative SPs by minimizing the following binary objective function of logistic regression:

$$\begin{aligned} \mathcal{O}_{1}=-\left[ \log \sigma (g(X_{i})\cdot f(S_{t}))+\sum _{n=1}^{K}\mathbb {E}_{X^{n}\sim \mathcal{P}(X)}\log \sigma (-g(X^{n})\cdot f(S_{t}))\right] , \end{aligned}$$
(3)

where \(\sigma (x)=\frac{1}{1+e^{-x}}\) is a sigmoid function, \(\mathcal{P}(X)\) is the set of negative SPs, \(X^{n}\) is a negative sequential pattern draw from \(\mathcal{P}(X)\) for K times, and \(g(X^{n})\in \mathbb {R}^{d}\) is the embedding vector of \(X^{n}\).

We minimize \(\mathcal{O}_{1}\) in Eq. 3 using stochastic gradient descent (SGD) where the gradients are derived as follows:

$$\begin{aligned} \frac{\partial \mathcal{O}_{1}}{\partial g(X^{n})}= & {} -\sigma (g(X^{n})\cdot f(S_{t})-\mathbb {I}_{X_{i}}[X^{n}])\cdot f(S_{t})\nonumber \\ \frac{\partial \mathcal{O}_{1}}{\partial f(S_{t})}= & {} -\sum _{n=0}^{K}\sigma (g(X^{n})\cdot f(S_{t})-\mathbb {I}_{X_{i}}[X^{n}])\cdot g(X^{n}), \end{aligned}$$
(4)

where \(\mathbb {I}_{X_{i}}[X^{n}]\) is an indicator function to indicate whether \(X^{n}\) is a sequential pattern \(X_{i}\in \mathcal{F}(S_{t})\) (i.e., the negative sequential pattern appears in the target sequence \(S_{t}\)) and when \(n=0\), then \(X^{n}=X_{i}\).

3.3 Sqn2Vec Method for Learning Sequence Embeddings

When associating a sequence \(S_{t}\) with a set of SPs, \(S_{t}\) may not contain any SPs. In this case, we cannot learn a meaningful embedding vector for \(S_{t}\). To avoid this problem, we propose two models which combine information of both single symbols and SPs to learn embedding vectors for sequences. These two models named Sqn2Vec-SEP and Sqn2Vec-SIM are presented next.

Sqn2Vec-SEP Model to Learn Sequence Embeddings. Given a sequence \(S_{t}\), we separately learn an embedding vector \(f_{1}(S_{t})\) for \(S_{t}\) based on its symbols using the document embedding model PV-DBOW [10] and an embedding vector \(f_{2}(S_{t})\) for \(S_{t}\) based on its SPs (see Sect. 3.2). We then take the average of two embedding vectors to obtain the final embedding vector \(f(S_{t})=\frac{f_{1}(S_{t})+f_{2}(S_{t})}{2}\) for that sequence. The basic idea of Sqn2Vec-SEP is illustrated in Fig. 2.

Fig. 2.
figure 2

Sqn2Vec-SEP model. Given a target sequence \(S_{t}\), we learn the embedding vector \(f_{1}(S_{t})\) to predict its belonging symbols and learn the embedding vector \(f_{2}(S_{t})\) to predict its belonging SPs. We then take the average of \(f_{1}(S_{t})\) and \(f_{2}(S_{t})\) to obtain the final embedding vector \(f(S_{t})\) for \(S_{t}\).

Sqn2Vec-SIM Model to Learn Sequence Embeddings. In the Sqn2Vec-SEP model, the sequence embeddings only capture the latent relationships between sequences and symbols and those between sequences and SPs separately. To overcome this weakness, we further propose the Sqn2Vec-SIM model which uses information of both single symbols and SPs of a sequence simultaneously. The overview of this model is shown in Fig. 3. More specifically, given a sequence \(S_{t}\), our goal is to minimize the following objective function:

$$\begin{aligned} \mathcal{O}_{2}= & {} -\left[ \sum _{e_{i}\in \mathcal{I}(S_{t})}\log \Pr (e_{i}\mid S_{t})+\sum _{X_{i}\in \mathcal{F}(S_{t})}\log \Pr (X_{i}\mid S_{t})\right] , \end{aligned}$$
(5)

where \(\mathcal{I}(S_{t})\) is the set of singleton symbols contained in \(S_{t}\) and \(\mathcal{F}(S_{t})\) is the set of SPs contained in \(S_{t}\).

Fig. 3.
figure 3

Sqn2Vec-SIM model. Given a target sequence \(S_{t}\), \(\mathcal{I}(S_{t})=\{e_{1},e_{2},...,e_{k}\}\) is the set of symbols contained in \(S_{t}\) and \(\mathcal{F}(S_{t})=\{X_{1},X_{2},...,X_{l}\}\) is the set of SPs contained in \(S_{t}\). We learn the embedding vector \(f(S_{t})\) for \(S_{t}\) to predict both its belonging symbols and SPs.

Equation 5 can be simplified to:

$$\begin{aligned} \mathcal{O}_{2}= & {} -\sum _{p_{i}\in \mathcal{I}(S_{t})\cup \mathcal{F}(S_{t})}\log \Pr (p_{i}\mid S_{t}), \end{aligned}$$
(6)

where \(p_{i}\subseteq S_{t}\) is a symbol or a sequential pattern.

Following the same procedure in Sect. 3.2, we learn the embedding vector \(f(S_{t})\) for \(S_{t}\), and the embedding vectors of two sequences \(S_{i}\) and \(S_{j}\) are close to each other if they contain similar symbols and SPs.

4 Experiments

4.1 Sequence Classification

The first experiment focuses on sequence classification, in which we compare our method with 11 baselines using sequential data from four application domains: text mining, action recognition, navigation analysis, and system diagnosis.

Datasets. We use eight benchmark datasets which are widely used for sequence classification. Their characteristics are summarized in Table 1. The reuters dataset is the four largest subsets of the Reuters-21578 dataset, consisting of news stories [19]. The three datasets aslbu, aslgt, and auslan2 are derived from the videos of American and Australian Sign Language expressions [6]. The context dataset presents different locations of mobile devices carried by end-users [12]. The two datasets pioneer and skating are used in action recognition, which were introduced in [6]. The final dataset unix contains the command-line histories in a Unix system of nine end-users [19]. All datasets were used to evaluate the accuracy of sequence classification in [4,5,6, 9, 19].

Table 1. Statistics of eight sequential datasets.

Baselines. For a comprehensive comparison, we employ 11 state-of-the-art up-to-date baselines which can be categorized into four main groups:

  • Unsupervised SP-based methods: We compare our method – Sqn2Vec with two state-of-the-art methods GoKrimp [9] and ISM [5]. We adopt their classification performances from their corresponding papers. We also employ another unsupervised baseline which constructs a binary feature vector for each sequence, with components indicating whether this sequence contains a sequential pattern with a \(\triangle \)-gap constraint (see Sect. 3.2). We name this baseline SP-BIN.

  • Supervised SP-based methods: We select three representative and up-to-date baselines, namely BIDE-DC [6], SCIP [19], and MiSeRe [4]. We adopt the classification results of BIDE-DC reported in its supplemental appendixFootnote 1, those of SCIP from Table 10 in [19] and Fig. 12 in [4], and those of MiSeRe from Fig. 8 in [4].

  • Unsupervised embedding methods: By considering a sequence as a document and symbols as words, we apply two recent state-of-the-art document embedding models for learning sequence embeddings, which are PV-DBOW [10] and Doc2Vec-C [2]. We also learn embedding vectors for sequences based on SPs (see Sect. 3.2), which we name Doc2Vec-SP.

  • Supervised embedding methods: We implement two deep recurrent neural network models for sequence classification, LSTM and Bi-LSTM [16].

Our method Sqn2Vec has two different models which use different combinations of symbols and SPs. The Sqn2Vec-SEP model learns sequence embedding vectors from symbols and SPs separately (see Sect. 3.3) while the Sqn2Vec-SIM model learns sequence embedding vectors from symbols and SPs simultaneously (see Sect. 3.3).

Evaluation Metrics. After the feature vectors of sequences are constructed or learned, we feed them to an SVM with linear kernel [1] to classify the sequence labels. We use the linear-kernel SVM (a simple classifier) since we focus on the sequence embedding learning, not on a classifier, and this classifier was also used in [4, 5, 9, 19]. The hyper-parameter C of SVM is set to 1, the same setting used in previous studies [5, 9, 19]. Each dataset is randomly split into 9 folds for training and 1 fold for testing. We repeat the classification process on each dataset 10 times and report the average classification accuracy. The standard deviation is not reported since all methods are very stable (their standard deviations are less than \(10^{-1}\)).

Parameter Settings. Our method Sqn2Vec has three important parameters: the minimum support threshold \(\delta \), the gap constraint \(\triangle \) for discovering SPs and the embedding dimension d for learning sequence embeddings. Since we develop Sqn2Vec in a fully unsupervised learning fashion, the values for \(\delta \), \(\triangle \), and d are assigned without using sequence labels. We set \(d=128\) (a common value used in embedding methods [7, 14]), set \(\triangle =4\) (a small gap which is able to capture the context of each symbol [15]), and set \(\delta \) following the elbow method in [15]. Figure 4 illustrates the elbow method. From the figure, we can see that when the \(\delta \) value decreases, the number of SPs slightly increases until a \(\delta \) value where it significantly increases. This \(\delta \) value, highlighted in red in the figure and chosen by the elbow method without considering the labels of sequences, is used in our experiments. In Sect. 4.1, we analyze the potential impact of selecting three parameters \(\delta \), \(\triangle \), and d on the classification performance.

Fig. 4.
figure 4

The number of SPs discovered from the reuters dataset per \(\delta \) (here, \(\triangle =4\)). The red dot indicates the \(\delta \) value selected via the elbow method. (Color figure online)

For a fair comparison, we use the same minimum support thresholds and gap constraints for our method and the baseline SP-BIN. We also set the embedding dimension required by three baselines PV-DBOW, Doc2Vec-C, and Doc2Vec-SP to 128, the same as one used in our method. For Doc2Vec-C, we use the source codeFootnote 2 provided by the author with the same parameter settings except \(d=128\). We implement LSTM and Bi-LSTM with the following detailsFootnote 3: the dimension of symbol embedding is 128, the number of LSTM hidden units is 100, the number of epochs is 50, the mini batch size is 64, the drop-out rate for symbol embedding and LSTM is 0.2, and the optimizer is Adam.

Results and Discussion. From Table 2, we can see two models in our method Sqn2Vec clearly results in better classification on all datasets compared with unsupervised embedding methods. Sqn2Vec-SEP achieves 2–97%, 5–232%, and 1–17% improvements over PV-DBOW, Doc2Vec-C, and Doc2Vec-SP respectively. As discussed in Sect. 1, two document embedding methods PV-DBOW and Doc2Vec-C perform poorly on sequential datasets with the small number of symbols, namely aslbu, aslgt, auslan2, context, pioneer, and skating. Especially, on the dataset auslan2 whose the vocabulary size is only 16, their performances dramatically reduce, where they are 97–232% and 106–247% worse than Sqn2Vec-SEP and Sqn2Vec-SIM respectively. In contrast, on unix and the text dataset reuters, where the vocabulary size is large enough, their performances are quite good. Doc2Vec-C even achieves the second best result on reuters.

On all the datasets with the small vocabulary size, Doc2Vec-SP significantly outperforms PV-DBOW and Doc2Vec-C. This demonstrates that learning sequence embeddings from SPs is more effective than learning sequence embeddings from symbols, as discussed in Sect. 3.2. Two our models (Sqn2Vec-SEP and Sqn2Vec-SIM) are always superior than Doc2Vec-SP. This proves that our proposal to incorporate the information of both singleton symbols and SPs into the sequence embedding learning is a better strategy than learning the sequence embeddings from SPs only (see Sect. 3.3).

For most cases, our method Sqn2Vec is better than unsupervised pattern-based methods. On three datasets auslan2, skating, and unix, Sqn2Vec-SIM outperforms these approaches by large margins (achieving 7–17%, 12–32%, and 33–34% gains over SP-BIN, GoKrimp, and ISM). Interestingly, our developed baseline SP-BIN, which uses SPs with a \(\triangle \)-gap constraint, is generally better than two state-of-the-art methods GoKrimp and ISM. This verifies our intuition in Sect. 3.2 that SPs satisfying a \(\triangle \)-gap constraint are more meaningful and discriminative since they can capture the context of each symbol.

Compared with supervised pattern-based methods and supervised embedding methods, our Sqn2Vec produces comparable performances on most datasets. It outperforms BIDE-DC, SCIP, and deep recurrent neural networks (LSTM and Bi-LSTM) on all datasets except aslgt and unix. Note that these methods leverage the labels of sequences when they construct/learn sequence representations, an impractical condition which actually benefits the supervised methods.

Table 2. Accuracy of our Sqn2Vec and 11 baselines on eight sequential datasets. Bold font marks the best performance in a column. The last row denotes the \(\delta \) values used by our method for each dataset; they are determined using the elbow method (see Fig. 4). “–” means the accuracy is not available in the original paper.

Parameter Sensitivity. We examine how the different choices of three parameters \(\delta \), \(\triangle \), and d affect the classification performance of Sqn2Vec-SEP on five datasets reuters, aslbu, aslgt, auslan2, and pioneer. Figure 5 shows the classification results as a function of one chosen parameter when the others are set to their default values. From Fig. 5(a), we can see our method is very stable on two datasets reuters and aslgt, where its classification performance just slightly changes with different \(\delta \) values. On three datasets aslbu, auslan2, and pioneer, our prediction performance shows an increasing trend as \(\delta \) is decreased. Another observation is that the values for \(\delta \) selected by the elbow method often lead to the best or close to the best accuracy.

From Fig. 5(b), we also observe the performance of our Sqn2Vec-SEP is consistent on reuters and aslgt, where the gap constraint \(\triangle \) is gain of relatively little relevant to the predictive task. On contrary, there is a first-increasing and then-decreasing accuracy line on two datasets aslbu and pioneer. One possible explanation is that if we set \(\triangle \) large, the generated SPs are less meaningful as we discussed in Sect. 3.2. On auslan2, the increase of accuracy converges when \(\triangle \) reaches 4.

Figure 5(c) suggests that the predictive performance increases on two datasets aslgt and auslan2 when d is increased whereas there is a first-increasing and then-decreasing accuracy line on aslbu and pioneer. This finding differs from those in document embedding methods, where the embedding dimension generally shows a positive effect on document classification [2]. Again, our predictive performance is steady on reuters, which is shown by a straight accuracy line.

Fig. 5.
figure 5

Parameter sensitivity in sequence classification on five datasets reuters, aslbu, aslgt, auslan2, and pioneer. The minimum support thresholds \(\delta \) selected via the elbow method and used in our experiments are indicated by red markers. (Color figure online)

4.2 Sequence Clustering

The second experiment illustrates how the latent representations learned by our proposed method can help the sequence clustering task, wherein we compare its performance with those of four baselines using text data.

Datasets. We use two text datasets for sequence clustering, namely webkb [15] and news [19]. webkb contains the content of webpages collected from computer departments of various universities. news is a subset of the dataset 20newsgroup, which is generated by selecting the five largest groups of documents. These two datasets are normalized (i.e., stop words are removed and the remaining words are stemmed), and can be downloaded from this websiteFootnote 4. Their properties are summarized in Table 3.

Table 3. Statistics of two text datasets.

Baselines. We compare our Sqn2Vec with state-of-the-art embedding methods in text (PV-DBOW, Doc2Vec-C, and Doc2Vec-SP) and an unsupervised pattern-based method SP-BIN. We choose SP-BIN because it always outperforms other unsupervised pattern-based methods in sequence classification. These baselines are introduced in Sect. 4.1. We exclude supervised methods since they require sequence labels during the learning process, thus inappropriate for our unsupervised learning task – clustering.

Evaluation Metrics. To evaluate the clustering performance, the embedding vectors provided by each method are input to a clustering algorithm. Here, we use K-means (a simple clustering method) to group data and assess the clustering results in terms of mutual-information (MI) and normalized mutual-information (NMI). We conduct clustering experiments 10 times. We then report the average and standard deviation of clustering performance.

Parameter Settings. We use the same parameter settings as in sequence classification for four baselines and our method Sqn2Vec, except that the values for \(\delta \) are selected using the elbow method (see Fig. 4).

Results and Discussion. From Table 4, we can see Sqn2Vec outperforms all the competitive baselines in terms of both MI and NMI. Compared with state-of-the-art document embedding methods, Sqn2Vec-SEP outperforms PV-DBOW, Doc2Vec-C, and Doc2Vec-SP by 24–32%, 38–40%, and 5–8% when clustering the webkb dataset. Similar improvements can be observed when clustering the news dataset, where the gains obtained by Sqn2Vec-SEP over PV-DBOW, Doc2Vec-C, and Doc2Vec-SP, around 8–14%, 72–77%, and 34–40%. Compared with the pattern-based method, the improvements are more significant, where Sqn2Vec-SEP outperforms SP-BIN by 163–184% on webkb and 767–1,090% on news.

Table 4. MI and NMI scores of our method Sqn2Vec and four baselines on two text datasets. The MI score is a non-negative value while the NMI score lies in the range [0, 1]. Bold font marks the best performance in a column.

4.3 Sequence Visualization

Figure 6 visualizes the document reprsentations learned by SP-BIN, PV-DBOW, Doc2Vec-C, Doc2Vec-SP, and our Sqn2Vec-SEP on the news dataset. We can see the documents from the same categories are clearly clustered using the embeddings generated by PV-DBOW and Sqn2Vec-SEP. On the other hand, SP-BIN, Doc2Vec-C, and Doc2Vec-SP do not distinguish different categories clearly.

Fig. 6.
figure 6

Visualization of document embeddings on news using t-SNE [11]. Different colors represent different categories. (Color figure online)

5 Conclusion

We have introduced Sqn2Vec – an unsupervised method for learning sequence embeddings from information of both singleton symbols and SPs. Our method is capable of capturing both the sequential relation among symbols and the semantic similarity among sequences. Our comprehensive experiments on 10 standard sequential datasets demonstrated the meaningful and discriminative representations learned by our approach in both sequence classification and sequence clustering tasks. In particularly, Sqn2Vec significantly outperforms several state-of-the-art baselines including pattern-based methods, embedding methods, and deep neural network models. Our approach can be applied to different real-world applications such as text mining, bio-informatics, action recognition, and system diagnosis. One of our future works is to integrate SPs into deep neural network models, e.g., LSTM and Bi-LSTM, to improve the classification performance.