Abstract
One of the most challenging problems in Gaussian process regression is to cope with large-scale datasets and to tackle an online learning setting where data instances arrive irregularly and continuously. In this paper, we introduce a novel online Gaussian process model that scales efficiently with large-scale datasets. Our proposed GoGP is constructed based on the geometric and optimization views of the Gaussian process regression, hence termed geometric-based online GP (GoGP). We developed theory to guarantee that with a good convergence rate our proposed algorithm always offers a sparse solution, which can approximate the true optima up to any level of precision specified a priori. Moreover, to further speed up the GoGP accompanied with a positive semi-definite and shift-invariant kernel such as the well-known Gaussian kernel and also address the curse of kernelization problem, wherein the model size linearly rises with data size accumulated over time in the context of online learning, we proposed to approximate the original kernel using the Fourier random feature kernel. The model of GoGP with Fourier random feature (i.e., GoGP-RF) can be stored directly in a finite-dimensional random feature space, hence being able to avoid the curse of kernelization problem and scalable efficiently and effectively with large-scale datasets. We extensively evaluated our proposed methods against the state-of-the-art baselines on several large-scale datasets for online regression task. The experimental results show that our GoGP(s) delivered comparable, or slightly better, predictive performance while achieving a magnitude of computational speedup compared with its rivals under online setting. More importantly, its convergence behavior is guaranteed through our theoretical analysis, which is rapid and stable while achieving lower errors.
Similar content being viewed by others
Notes
We store the model as \({\mathbf {w}}=\sum _{n}\alpha _{n}\Phi \left( x_{n}\right) .\) The model size is \(\left\| \varvec{\varvec{\alpha }}\right\| _{0}\), and a sparse solution specifies the solution with a small value for \(\left\| \varvec{\varvec{\alpha }}\right\| _{0}/N\).
Our code for four versions of GoGP (i.e., GoGP-K-P, GoGP-RF-P and GoGP-K-C, GoGP-RF-C) can be found at https://github.com/khanhndk/GoGP.
References
Cavallanti G, Cesa-Bianchi N, Gentile C (2007) Tracking the best hyperplane with a simple budget perceptron. Mach Learn 69(2–3):143–167
Chitta R, Jin R, Jain AK (2012) Efficient kernel clustering using random Fourier features. In: International conference on data mining
Crammer K, Dekel O, Keshet J, Shalev-Shwartz S, Singer Y (2006) Online passive-aggressive algorithms. J Mach Learn Res 7:551–585
Crammer K, Kandola J, Singer Y (2004) Online classification on a budget. In: Advances in neural information processing systems, vol 16. MIT Press
Csató L, Opper M (2002) Sparse on-line Gaussian processes. Neural Comput 14(3):641–668
Dekel O, Shalev-Shwartz S, Singer Y (2005) The forgetron: a kernel-based perceptron on a fixed budget. In: Advances in neural information processing systems 19. pp 259–266
Freund Y, Schapire RE (1999) Large margin classification using the perceptron algorithm. Mach Learn 37(3):277–296
GPy (2012) GPy: a Gaussian process framework in python. http://github.com/SheffieldML/GPy
Hensman J, Fusi N, Lawrence ND (2013) Gaussian processes for big data. In: Uncertainty in artificial intelligence, p 282, Citeseer
Hensman J, Rattray M, Lawrence ND (2012) Fast variational inference in the conjugate exponential family. In: Pereira F, Burges CJC, Bottou L, Weinberger KQ (eds) Advances in neural information processing systems 26. pp 2888–2896
Hoang TN, Hoang QM, Low B (2015) A unifying framework of anytime sparse Gaussian process regression models with stochastic variational inference for big data. In: Proceedings of the 32nd international conference on machine learning, pp 569–578
Hoang TN, Hoang QM, Low BKH (2016) A distributed variational inference framework for unifying parallel sparse Gaussian process regression models. In: Proceedings of ICML, pp 382–391
Hoffman MD, Blei DM, Wang C, Paisley J (2013) Stochastic variational inference. J Mach Learn Res 14(1):1303–1347
Kivinen J, Smola AJ, Williamson RC (2004) Online learning with kernels. IEEE Trans Signal Process 52:2165–2176
Lawrence N, Seeger M, Herbrich R (2003) Fast sparse Gaussian process methods: the informative vector machine. Advances in neural information processing systems 17:625–632
Le T, Nguyen TD, Nguyen V, Phung D (2017) Approximation vector machines for large-scale online learning. J Mach Learn Res 18(1):3962–4016
Le T, Duong P, Dinh M, Nguyen DT, Nguyen V, Phung D (2016) Budgeted semi-supervised support vector machine. In: The 32nd conference on uncertainty in artificial intelligence
Le T, Nguyen V, Nguyen TD, Phung D (2016) Nonparametric budgeted stochastic gradient descent. In: The 19th international conference on artificial intelligence and statistics, pp 654–572
Lu J, Hoi SCH, Wang J, Zhao P, Liu Z-Y (2015) Large scale online kernel learning. J Mach Learn Res 17(1):1613–1655
Nguyen TD, Le T, Bui H, Phung D (2017) Large-scale online kernel learning with random feature reparameterization. In: Proceedings of the 26th international joint conference on artificial intelligence
Nguyen TD, Nguyen V, Le T, Phung D (2016) Distributed data augmented support vector machine on spark. In: 23rd international conference on pattern recognition, pp 498–503
Nguyen K, Le T, Nguyen V, Nguyen TD, Phung D (2016) Multiple kernel learning with data augmentation. In: 8th Asian conference on machine learning
Orabona F, Keshet J, Caputo B (2009) Bounded kernel-based online learning. J Mach Learn Res 10:2643–2666
Quiñonero-Candela J, Rasmussen CE (2005) A unifying view of sparse approximate Gaussian process regression. J Mach Learn Res 6:1939–1959
Rahimi A, Recht B (2007) Random features for large-scale kernel machines. In: Advances in neural information processing systems 21. pp 1177–1184
Rasmussen CE, Williams CKI (2005) Gaussian processes for machine learning (adaptive computation and machine learning). The MIT Press, Cambridge
Rosenblatt F (1958) The perceptron: a probabilistic model for information storage and organization in the brain. Psychol Rev 65(6):386–408
Schwaighofer A, Tresp V (2003) Transductive and inductive methods for approximate Gaussian process regression. In: Advances in neural information processing systems 17. pp 420–427
Seeger M, Williams CKI, Lawrence ND (2003) Fast forward selection to speed up sparse Gaussian process regression. In: Workshop on AI and statistics 9
Smola AJ, Bartlett PL (2001) Sparse greedy Gaussian process regression. In: Advances in neural information processing systems 15, MIT Press, Cambridge, pp 619–625
Snelson E, Ghahramani Z (2006) Sparse Gaussian processes using pseudo-inputs. In: Advances in neural information processing systems 20. MIT Press, Cambridge, pp 1257–1264
Snelson E, Ghahramani Z (2007) Local and global sparse Gaussian process approximations. In: Proceedings of the eleventh international conference on artificial intelligence and statistics, AISTATS 2007, pp 524–531, Mar 21–24, San Juan, Puerto Rico
Titsias MK (2009) Variational learning of inducing variables in sparse Gaussian processes. Artif Intell Stat 12:567–574
Wang Z, Vucetic S (2010) Online passive-aggressive algorithms on a budget. AISTATS 9:908–915
Wang Z, Crammer K, Vucetic S (2012) Breaking the curse of kernelization: budgeted stochastic gradient descent for large-scale svm training. J Mach Learn Res 13(1):3103–3131
Wang Z, Vucetic S (2009) Twin vector machines for online learning on a budget. In: Proceedings of the SIAM international conference on data mining, pp 906–917
Yang T, Li Y-F, Mahdavi M, Jin R, Zhou Z-H (2012) Nyström method vs random Fourier features: a theoretical and empirical comparison. In: Pereira F, Burges CJC, Bottou L, Weinberger KQ (eds) Advances in neural information processing systems, vol 25, pp 476–484
Zinkevich M (2003) Online convex programming and generalized infinitesimal gradient ascent. In: Machine learning, proceedings of the twentieth international conference (ICML 2003), pp 928–936
Acknowledgements
This work is partially supported by the Australian Research Council under the ARC DP160109394.
Author information
Authors and Affiliations
Corresponding author
Appendix A: Proofs of Lemmas and Theorems
Appendix A: Proofs of Lemmas and Theorems
Proposition 1
(Restated) The coefficient vector \(\varvec{d}^{*}=[d_{1}^{*},\ldots ,d_{N}^{*}]\) of the projection in Eq. (7) can be explicitly computed as \(\varvec{d}^{*}=[K+\sigma ^{2}I]^{-1}K_{*}^{\mathsf {T}}\). Furthermore, the posterior predictive mean \(\mu _{*}\) in Eq. (3) can be expressed by a weighted sum of the training labels
and the posterior predictive standard deviation in Eq. (4) is the Euclidean distance from \(\Phi _{*}\) to \({\mathcal {L}}\left( \Phi _{X}\right) \)
Proof of Proposition 1
We find \(\varvec{d}=\left[ d_{n}\right] _{n=1,\ldots ,N}^{\mathsf {T}}\) by solving the following optimization problem
We derive as follows
It follows that
\(\square \)
Proposition 2
(Restated) Let us define \(\varvec{\alpha }_{*}=\left[ K+\sigma ^{2}I\right] ^{-1}\varvec{y}\) whose elements are denoted explicitly by \(\varvec{\alpha }_{*}=[\alpha _{1}^{*},\ldots ,\alpha _{N}^{*}]\), and \({\mathbf {w}}_{*}=\sum _{n=1}^{N}\alpha _{n}^{*}\Phi _{n}\). Then, \({\mathbf {w}}_{*}\) is the optimal solution for the following optimization problem:
where \(\lambda =2\sigma ^{2}N^{-1}\). The posterior predictive mean in Eq. (3) can be computed as
Proof of Proposition 2
We transform the unconstrained optimization problem in Eq. (8) to its equivalent form as follows
where \(C=\lambda ^{-1}=0.5N\sigma ^{-2}\).
The Lagrange function is of the following form
Setting the derivatives to 0, we gain
Substituting the above to the Lagrange function, we have the following optimization problem
Since the strong duality holds, we have \({\mathbf {w}}_{*}=\sum _{n=1}^{N}\alpha _{n}^{*}\Phi \left( x_{n}\right) \) and the predictive mean \(\mu ^{*}=K_{*}\left[ K+\sigma ^{2}I\right] ^{-1}\varvec{y}=K_{*}\varvec{\alpha }^{*}=\sum _{n=1}^{N}\alpha _{n}^{*}K\left( \varvec{x}_{*},\varvec{x}_{n}\right) ={\mathbf {w}}_{*}^{\mathsf {T}}\Phi _{*}\).
We now consider the upper bound of \(\left\| {\mathbf {w}}_{*}\right\| \) . In particular, we have the following lemma. \(\square \)
Lemma 1
(Restated) If we define \({\mathbf {w}}_{*}\) as
then we have \(\left\| {\mathbf {w}}_{*}\right\| \le y_{\max }\lambda ^{-1/2}\).
Proof of Lemma 1
Let us consider the equivalent constrained optimization problem
The Lagrange function is of the following form
Setting the derivatives to 0, we gain
Substituting the above to the Lagrange function, we gain the dual form
Let us denote \(\left( {\mathbf {w}}^{*},\varvec{\xi ^{*}}\right) \) and \(\varvec{\alpha ^{*}}\) be the primal and dual solutions, respectively. Since the strong duality holds, we have
Note that we have used \(g\left( \alpha _{n}^{*}\right) =y_{n}\alpha _{n}^{*}-\frac{N}{4}\alpha _{n}^{*2}\le g\left( \frac{2y_{n}}{N}\right) =\frac{y_{n}^{2}}{N}\). Hence, we gain the conclusion.
We now present the theoretical results regarding convergence analysis. The update rule is as follows
where \(\alpha _{t}={\mathbf {w}}_{t}^{\mathsf {T}}\Phi _{t}-y_{t}\), \(\delta _{t}={\mathrm{dist}}\left( \Phi _{t},{\mathcal {L}}\left( \Phi _{U}\right) \right) \), and \(S={\left\{ \begin{array}{ll} {\mathbb {R}}^{d} &{} \text {if}\,\,\lambda >2\\ {\mathcal {B}}(0,y_{\text {max}}\lambda ^{-1/2}) &{} \text {otherwise} \end{array}\right. }\) with \({\mathcal {B}}(0,y_{\text {max}}\lambda ^{-1/2})\) is defined as \(\{ x\in {\mathbb {R}}^{d}:\left\| x\right\| \le y_{\text {max}}\lambda ^{-1/2}\} \).
We can rewrite the update rule as follows
where \(g_{t}=\lambda {\mathbf {w}}_{t}+2\alpha _{t}\Phi _{t}\), \(Z_{t}\) is a binary random variable where \(\Pr \left( Z_{t}=1\right) =\Pr \left( \delta _{t}\le \theta \right) \) (i.e., if the approximation is performed), and \({\mathbb {H}}\left( U,\varvec{x}_{t}\right) =\Phi \left( x_{t}\right) -{\mathbb {P}}_{{\mathcal {L}}\left( \Phi _{U}\right) }\left( \Phi \left( \varvec{x}_{t}\right) \right) \) specifies the rejection vector of \(\Phi _{t}\) from \({\mathcal {L}}\left( \Phi _{U}\right) \). \(\square \)
Lemma 2
(Restated) The following statement holds
where\(y_{\text {max}}={\max }_{y\in {\mathcal {Y}}} |y|\).
Proof of Lemma 2
We have the following
It follows that
It happens that
Thus, we achieve
Taking sum when \(t=1,2,\ldots ,T\), we achieve
\(\square \)
Lemma 3
(Restated) If \(\lambda >2\left( 1+\sigma ^{2}\right) \), then \(\left\| {\mathbf {w}}_{T+1}\right\| \le \frac{y_{\max }}{\frac{\lambda }{2\left( 1+\sigma ^{2}\right) }-1}\left( 1-\frac{1}{\left( \frac{\lambda }{2\left( 1+\sigma ^{2}\right) }\right) ^{T}}\right) <\frac{y_{\max }}{\frac{\lambda }{2\left( 1+\sigma ^{2}\right) }-1}\) for all T.
Proof of Lemma 3
First we consider the sequence \(\left\{ s_{T}\right\} _{T}\) which is identified as \(s_{T+1}=2\lambda ^{-1}(1+\sigma ^{2})^{1/2}(y_{\text {max}}+(1+\sigma ^{2})^{1/2}s_{T})\) and \(s_{1}=0\). It is easy to find the formula of this sequence as
We prove by induction by T that \(\left\| {\mathbf {w}}_{T}\right\| \le s_{T}\) for all T. It is obvious that \(\left\| {\mathbf {w}}_{1}\right\| =s_{1}=0\). Assume that \(\left\| {\mathbf {w}}_{t}\right\| \le s_{t}\) for \(t\le T\), we verify it for \(T+1\). Indeed, we have
Lemma 4
(Restated) The following statement holds \(\left\| {\mathbf {w}}_{t}-{\mathbf {w}}_{*}\right\| ^{2}\le W,\,\forall t\) where we have defined
Proof of Lemma 4
We consider two cases \(\lambda >2\left( 1+\sigma ^{2}\right) \):
\(0<\lambda \le 2(1+\sigma ^{2})\):
Both \({\mathbf {w}}_{t}\) and \({\mathbf {w}}^{*}\) are in \({\mathcal {B}}\left( 0,y_{\text {max}}\lambda ^{-1/2}\right) \). Hence, we have
\(\square \)
Lemma 5
(Restated) The following statement holds
Proof of Lemma 5
We derive as follows
Lemma 6
(Restated) The following statement holds
where we have defined
Proof of Lemma 6
We derive as follows
Here, we note that to gain the above inequality, we consider two cases \(Z_{t}=1\) and \(Z_{t}=0\) and use \(\left\| {\mathbb {H}}\left( U,\varvec{x}_{t}\right) \right\| \le \left\| \Phi _{t}\right\| =\left( 1+\sigma ^{2}\right) ^{1/2}\).
We have
Hence, we gain
\(\square \)
Theorem 1
(Restated) Consider Algorithm 4.1 where \(\left( \varvec{x}_{t},y_{t}\right) \sim P_{N}\), or \(P_{{\mathcal {X}}\times {\mathcal {Y}}}\), arrives on fly, then the following bound holds
where \(G,\,M,\,W\) are positive constants and \(p_{t}=\Pr \left( Z_{t}=1\right) \) as defined before.
Proof of Theorem 1
We have
We recall that \(g_{t}=\lambda {\mathbf {w}}_{t}+2\left( {\mathbf {w}}_{t}^{\mathsf {T}}\Phi _{t}-y_{t}\right) \Phi \left( \varvec{x}_{t}\right) \) and \(\left( \varvec{x}_{t},y_{t}\right) \sim P_{N}\) or \(P_{{\mathcal {X}}\times {\mathcal {Y}}}\). Hence, we gain \({\mathbb {E}}\left[ g_{t}|{\mathbf {w}}_{t}\right] ={\mathcal {J}}^{'}\left( {\mathbf {w}}_{t}\right) \).
Taking the conditional expectation w.r.t \({\mathbf {w}}_{t}\), we achieve
Taking expectation again, we obtain
Taking sum the above when \(t=1,\ldots ,T\), we achieve
\(\square \)
Theorem 2
(Restated) Consider the output of Algorithm 4.1 and further let \({\overline{{\mathbf {w}}}}_{T}^{\gamma }=\frac{1}{\gamma T}\sum _{t=\left( 1-\gamma \right) T+1}^{T}{\mathbf {w}}_{t}\) and \(W_{T}^{\gamma }={\mathbb {E}}[\left\| {\mathbf {w}}_{\left( 1-\gamma \right) T+1}-{\mathbf {w}}_{*}\right\| ^{2}]\) where \(0<\gamma <1\), then, the following inequality holds
where \(\gamma '=1-\gamma \).
Proof of Theorem 2
Taking sum in Eq. (13) when \(t=\left( 1-\gamma \right) T+1,\ldots ,T\), we gain
We note that we have used the inequality \(\sum _{t=\left( 1-\gamma \right) T+1}^{T}\frac{1}{t}\le \log \left( 1/\left( 1-\gamma \right) \right) \).
To achieve the conclusion, we use the convexity of the function \({\mathcal {J}}\left( .\right) \) which implies \({\mathcal {J}}\left( {\overline{{\mathbf {w}}}}_{T}^{\gamma }\right) \le \frac{1}{\gamma T}\sum _{t=\left( 1-\gamma \right) T+1}^{T}{\mathcal {J}}\left( {\mathbf {w}}_{t}\right) \). \(\square \)
Theorem 5
(Hoeffding inequality) Let the independent variables \(X_{1},\ldots ,X_{n}\) where \(a_{i}\le X_{i}\le b_{i}\) for each \(i\in \left[ n\right] \). Let \(S=\sum _{i=1}^{n}X_{i}\) and \(\Delta _{i}=b_{i}-a_{i}\). The following statements hold
-
(i)
\({\mathbb {P}}\left( S-{\mathbb {E}}\left[ S\right] >\varepsilon \right) \le \exp \left( -\frac{2\varepsilon ^{2}}{\sum _{i=1}^{n}\Delta _{i}^{2}}\right) \).
-
(ii)
\({\mathbb {P}}\left( \left| S-{\mathbb {E}}\left[ S\right] >\varepsilon \right| \right) \le 2\exp \left( -\frac{2\varepsilon ^{2}}{\sum _{i=1}^{n}\Delta _{i}^{2}}\right) .\)
Theorem 3
(Restated) Define the gap \(m_{T}=\frac{\lambda \gamma '}{2\gamma }W_{T}^{\gamma }+\frac{2WM\theta }{\gamma T}\sum _{t=\gamma 'T+1}^{T}p_{t}\). Let r be any number randomly picked from \(\{ \gamma 'T+1,\gamma 'T+2,\ldots ,T\} \). With a probability at least \(1-\delta \), the following inequality holds
where \(\Delta _{T}^{\gamma }={\max }_{\gamma 'T+1\le t\le T}\left( {\mathcal {J}}\left( {\mathbf {w}}_{t}\right) -{\mathcal {J}}\left( {\mathbf {w}}_{*}\right) \right) \).
Proof of Theorem 3
From the above theorem, we achieve
where \(m_{T}=\frac{\lambda \left( 1-\gamma \right) }{2\gamma }W_{T}^{\gamma }+\frac{W\theta }{\gamma T}\sum _{t=\left( 1-\gamma \right) T+1}^{T}p_{t}\).
Let us denote \(X={\mathcal {J}}\left( {\mathbf {w}}_{r}\right) -{\mathcal {J}}\left( {\mathbf {w}}_{*}\right) \), where r is uniformly sampled from \(\left\{ \left( 1-\gamma \right) T+1,2,\ldots ,T\right\} \). We have
It follows that
Let us denote \(\Delta _{T}^{\gamma }=\underset{\left( 1-\gamma \right) T+1\le t\le T}{\max }\left( {\mathcal {J}}\left( {\mathbf {w}}_{t}\right) -{\mathcal {J}}\left( {\mathbf {w}}_{*}\right) \right) ,\) which implies that \(0<{\mathcal {J}}\left( w_{r}\right) -{\mathcal {J}}\left( w^{*}\right) <\Delta _{T}^{\gamma }\). Applying Hoeffding inequality for the random variable X, we gain
Choosing \(\delta =\exp \left( -\frac{2\varepsilon ^{2}}{\left( \Delta _{T}^{\gamma }\right) ^{2}}\right) \) or \(\varepsilon =\Delta _{T}^{\gamma }\sqrt{\frac{1}{2}\log \frac{1}{\delta }}\), then with the probability at least \(1-\delta \), we have
\(\square \)
Theorem 4
(Restated). Consider Algorithm 4.1 where \(\left( \varvec{x}_{t},y_{t}\right) \sim P_{N}\) or \(P_{{\mathcal {X}}\times {\mathcal {Y}}}\), after at most \(T_{\theta }\) iterations, this algorithm reaches an \(\theta \)-stable state in which the size of inducing set is bounded by \(T_{\theta }\), i.e., \(\left| U\right| \le T_{\theta }\), and for any \(\varvec{x}_{*}\notin U\) , the standard deviation \(\sigma _{U}\left( \varvec{x}_{*}\right) \) of the distribution \(p\left( f_{*}\mid \varvec{x}_{*},U\right) \) is less than \((\theta ^{2}-\sigma ^{2})^{1/2}\). More importantly, the constant \(T_{\theta }\) is independent with the data distribution and arrival order.
Proof of Theorem 4
We assume that \({\mathcal {X}}\subset {\mathbb {R}}^{d}\) is a compact set (i.e., close and bounded) and \(\Phi \left( .\right) \) is a continuous map. Since \({\mathcal {X}}\) is a compact set and \(\Phi \left( .\right) \) is a continuous map, \(\Phi \left( {\mathcal {X}}\right) \) is also a compact set. Let \(\left\{ {\mathcal {C}}\left( \Phi \left( \varvec{s}\right) ,\theta \right) \right\} _{\varvec{s}\in {\mathcal {X}}}\) be an open coverage of \(\Phi \left( {\mathcal {X}}\right) \). We note that the open sphere \({\mathcal {C}}\left( \Phi \left( \varvec{s}\right) ,\theta \right) \) is defined as
From this coverage, we can extract a finite coverage of \(T_{\theta }\) open set, that is, \(\left\{ {\mathcal {C}}\left( \Phi \left( \varvec{s}_{i}\right) ,\theta \right) \right\} _{i=1}^{T_{\theta }}\). We denote the set of inducing variables U right before adding the instance \(\left( \varvec{x}_{t},y_{t}\right) \) by \(U_{t}\). It is apparent that the resultant set of inducing variables is the union of all instantaneous set of inducing variables, i.e., \(U=\bigcup _{t\ge 1}U_{t}\).
We now prove that if \(\varvec{u},\varvec{v}\in U\) are two different elements in U, then \(\left\| \Phi \left( \varvec{u}\right) -\Phi \left( \varvec{v}\right) \right\| >\theta \). We assume that \(\varvec{u}=\varvec{x}_{t}\) and \(\varvec{v}=\varvec{x}_{t'}\) with \(t>t^{'}\). We then have
Since \(\varvec{v}=x_{t'}\in U_{t}\), by choosing \(d_{v}=1\) and \(d_{x}=0,\,x\ne v\), we gain
Therefore, each open sphere in the finite coverage \(\left\{ {\mathcal {C}}\left( \Phi \left( \varvec{s}_{i}\right) ,\theta \right) \right\} _{i=1}^{T_{\theta }}\) cannot contain more than two points in U. Besides, U is a subset of \(\bigcup _{i=1}^{T_{\theta }}{\mathcal {C}}\left( \Phi \left( \varvec{s}_{i}\right) ,\theta \right) \). Hence, the cardinality of U cannot exceed \(T_{\theta }\), i.e., \(\left| U\right| \le T_{\theta }\). \(\square \)
Rights and permissions
About this article
Cite this article
Le, T., Nguyen, K., Nguyen, V. et al. GoGP: scalable geometric-based Gaussian process for online regression. Knowl Inf Syst 60, 197–226 (2019). https://doi.org/10.1007/s10115-018-1239-1
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-018-1239-1