Skip to main content
Log in

GoGP: scalable geometric-based Gaussian process for online regression

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

One of the most challenging problems in Gaussian process regression is to cope with large-scale datasets and to tackle an online learning setting where data instances arrive irregularly and continuously. In this paper, we introduce a novel online Gaussian process model that scales efficiently with large-scale datasets. Our proposed GoGP is constructed based on the geometric and optimization views of the Gaussian process regression, hence termed geometric-based online GP (GoGP). We developed theory to guarantee that with a good convergence rate our proposed algorithm always offers a sparse solution, which can approximate the true optima up to any level of precision specified a priori. Moreover, to further speed up the GoGP accompanied with a positive semi-definite and shift-invariant kernel such as the well-known Gaussian kernel and also address the curse of kernelization problem, wherein the model size linearly rises with data size accumulated over time in the context of online learning, we proposed to approximate the original kernel using the Fourier random feature kernel. The model of GoGP with Fourier random feature (i.e., GoGP-RF) can be stored directly in a finite-dimensional random feature space, hence being able to avoid the curse of kernelization problem and scalable efficiently and effectively with large-scale datasets. We extensively evaluated our proposed methods against the state-of-the-art baselines on several large-scale datasets for online regression task. The experimental results show that our GoGP(s) delivered comparable, or slightly better, predictive performance while achieving a magnitude of computational speedup compared with its rivals under online setting. More importantly, its convergence behavior is guaranteed through our theoretical analysis, which is rapid and stable while achieving lower errors.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

Notes

  1. We store the model as \({\mathbf {w}}=\sum _{n}\alpha _{n}\Phi \left( x_{n}\right) .\) The model size is \(\left\| \varvec{\varvec{\alpha }}\right\| _{0}\), and a sparse solution specifies the solution with a small value for \(\left\| \varvec{\varvec{\alpha }}\right\| _{0}/N\).

  2. https://github.com/SheffieldML/GPy.

  3. https://github.com/qminh93/RVGP.

  4. http://lsokl.stevenhoi.com/.

  5. Our code for four versions of GoGP (i.e., GoGP-K-P, GoGP-RF-P and GoGP-K-C, GoGP-RF-C) can be found at https://github.com/khanhndk/GoGP.

References

  1. Cavallanti G, Cesa-Bianchi N, Gentile C (2007) Tracking the best hyperplane with a simple budget perceptron. Mach Learn 69(2–3):143–167

    Article  Google Scholar 

  2. Chitta R, Jin R, Jain AK (2012) Efficient kernel clustering using random Fourier features. In: International conference on data mining

  3. Crammer K, Dekel O, Keshet J, Shalev-Shwartz S, Singer Y (2006) Online passive-aggressive algorithms. J Mach Learn Res 7:551–585

    MathSciNet  MATH  Google Scholar 

  4. Crammer K, Kandola J, Singer Y (2004) Online classification on a budget. In: Advances in neural information processing systems, vol 16. MIT Press

  5. Csató L, Opper M (2002) Sparse on-line Gaussian processes. Neural Comput 14(3):641–668

    Article  MATH  Google Scholar 

  6. Dekel O, Shalev-Shwartz S, Singer Y (2005) The forgetron: a kernel-based perceptron on a fixed budget. In: Advances in neural information processing systems 19. pp 259–266

  7. Freund Y, Schapire RE (1999) Large margin classification using the perceptron algorithm. Mach Learn 37(3):277–296

    Article  MATH  Google Scholar 

  8. GPy (2012) GPy: a Gaussian process framework in python. http://github.com/SheffieldML/GPy

  9. Hensman J, Fusi N, Lawrence ND (2013) Gaussian processes for big data. In: Uncertainty in artificial intelligence, p 282, Citeseer

  10. Hensman J, Rattray M, Lawrence ND (2012) Fast variational inference in the conjugate exponential family. In: Pereira F, Burges CJC, Bottou L, Weinberger KQ (eds) Advances in neural information processing systems 26. pp 2888–2896

  11. Hoang TN, Hoang QM, Low B (2015) A unifying framework of anytime sparse Gaussian process regression models with stochastic variational inference for big data. In: Proceedings of the 32nd international conference on machine learning, pp 569–578

  12. Hoang TN, Hoang QM, Low BKH (2016) A distributed variational inference framework for unifying parallel sparse Gaussian process regression models. In: Proceedings of ICML, pp 382–391

  13. Hoffman MD, Blei DM, Wang C, Paisley J (2013) Stochastic variational inference. J Mach Learn Res 14(1):1303–1347

    MathSciNet  MATH  Google Scholar 

  14. Kivinen J, Smola AJ, Williamson RC (2004) Online learning with kernels. IEEE Trans Signal Process 52:2165–2176

    Article  MathSciNet  MATH  Google Scholar 

  15. Lawrence N, Seeger M, Herbrich R (2003) Fast sparse Gaussian process methods: the informative vector machine. Advances in neural information processing systems 17:625–632

    Google Scholar 

  16. Le T, Nguyen TD, Nguyen V, Phung D (2017) Approximation vector machines for large-scale online learning. J Mach Learn Res 18(1):3962–4016

    MathSciNet  MATH  Google Scholar 

  17. Le T, Duong P, Dinh M, Nguyen DT, Nguyen V, Phung D (2016) Budgeted semi-supervised support vector machine. In: The 32nd conference on uncertainty in artificial intelligence

  18. Le T, Nguyen V, Nguyen TD, Phung D (2016) Nonparametric budgeted stochastic gradient descent. In: The 19th international conference on artificial intelligence and statistics, pp 654–572

  19. Lu J, Hoi SCH, Wang J, Zhao P, Liu Z-Y (2015) Large scale online kernel learning. J Mach Learn Res 17(1):1613–1655

    MathSciNet  MATH  Google Scholar 

  20. Nguyen TD, Le T, Bui H, Phung D (2017) Large-scale online kernel learning with random feature reparameterization. In: Proceedings of the 26th international joint conference on artificial intelligence

  21. Nguyen TD, Nguyen V, Le T, Phung D (2016) Distributed data augmented support vector machine on spark. In: 23rd international conference on pattern recognition, pp 498–503

  22. Nguyen K, Le T, Nguyen V, Nguyen TD, Phung D (2016) Multiple kernel learning with data augmentation. In: 8th Asian conference on machine learning

  23. Orabona F, Keshet J, Caputo B (2009) Bounded kernel-based online learning. J Mach Learn Res 10:2643–2666

    MathSciNet  MATH  Google Scholar 

  24. Quiñonero-Candela J, Rasmussen CE (2005) A unifying view of sparse approximate Gaussian process regression. J Mach Learn Res 6:1939–1959

    MathSciNet  MATH  Google Scholar 

  25. Rahimi A, Recht B (2007) Random features for large-scale kernel machines. In: Advances in neural information processing systems 21. pp 1177–1184

  26. Rasmussen CE, Williams CKI (2005) Gaussian processes for machine learning (adaptive computation and machine learning). The MIT Press, Cambridge

    Book  Google Scholar 

  27. Rosenblatt F (1958) The perceptron: a probabilistic model for information storage and organization in the brain. Psychol Rev 65(6):386–408

    Article  Google Scholar 

  28. Schwaighofer A, Tresp V (2003) Transductive and inductive methods for approximate Gaussian process regression. In: Advances in neural information processing systems 17. pp 420–427

  29. Seeger M, Williams CKI, Lawrence ND (2003) Fast forward selection to speed up sparse Gaussian process regression. In: Workshop on AI and statistics 9

  30. Smola AJ, Bartlett PL (2001) Sparse greedy Gaussian process regression. In: Advances in neural information processing systems 15, MIT Press, Cambridge, pp 619–625

  31. Snelson E, Ghahramani Z (2006) Sparse Gaussian processes using pseudo-inputs. In: Advances in neural information processing systems 20. MIT Press, Cambridge, pp 1257–1264

  32. Snelson E, Ghahramani Z (2007) Local and global sparse Gaussian process approximations. In: Proceedings of the eleventh international conference on artificial intelligence and statistics, AISTATS 2007, pp 524–531, Mar 21–24, San Juan, Puerto Rico

  33. Titsias MK (2009) Variational learning of inducing variables in sparse Gaussian processes. Artif Intell Stat 12:567–574

    Google Scholar 

  34. Wang Z, Vucetic S (2010) Online passive-aggressive algorithms on a budget. AISTATS 9:908–915

    Google Scholar 

  35. Wang Z, Crammer K, Vucetic S (2012) Breaking the curse of kernelization: budgeted stochastic gradient descent for large-scale svm training. J Mach Learn Res 13(1):3103–3131

    MathSciNet  MATH  Google Scholar 

  36. Wang Z, Vucetic S (2009) Twin vector machines for online learning on a budget. In: Proceedings of the SIAM international conference on data mining, pp 906–917

  37. Yang T, Li Y-F, Mahdavi M, Jin R, Zhou Z-H (2012) Nyström method vs random Fourier features: a theoretical and empirical comparison. In: Pereira F, Burges CJC, Bottou L, Weinberger KQ (eds) Advances in neural information processing systems, vol 25, pp 476–484

  38. Zinkevich M (2003) Online convex programming and generalized infinitesimal gradient ascent. In: Machine learning, proceedings of the twentieth international conference (ICML 2003), pp 928–936

Download references

Acknowledgements

This work is partially supported by the Australian Research Council under the ARC DP160109394.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Trung Le.

Appendix A: Proofs of Lemmas and Theorems

Appendix A: Proofs of Lemmas and Theorems

Proposition 1

(Restated) The coefficient vector \(\varvec{d}^{*}=[d_{1}^{*},\ldots ,d_{N}^{*}]\) of the projection in Eq. (7) can be explicitly computed as \(\varvec{d}^{*}=[K+\sigma ^{2}I]^{-1}K_{*}^{\mathsf {T}}\). Furthermore, the posterior predictive mean \(\mu _{*}\) in Eq. (3) can be expressed by a weighted sum of the training labels

$$\begin{aligned} \mu _{*}=\sum _{n=1}^{N}d_{n}^{*}y_{n}, \end{aligned}$$

and the posterior predictive standard deviation in Eq. (4) is the Euclidean distance from \(\Phi _{*}\) to \({\mathcal {L}}\left( \Phi _{X}\right) \)

$$\begin{aligned} \sigma _{*}=\left( {\mathrm{dist}}\left( \Phi _{*},{\mathcal {L}}\left( \Phi _{X}\right) \right) ^{2}-\sigma ^{2}\right) ^{1/2}. \end{aligned}$$

Proof of Proposition 1

We find \(\varvec{d}=\left[ d_{n}\right] _{n=1,\ldots ,N}^{\mathsf {T}}\) by solving the following optimization problem

$$\begin{aligned} \underset{\varvec{d}}{\min }\,h\left( \varvec{d}\right) \triangleq \left\| \Phi _{*}-\sum _{n=1}^{N}d_{n}\Phi _{n}\right\| ^{2}. \end{aligned}$$

We derive as follows

$$\begin{aligned} h\left( \varvec{d}\right)&={\tilde{K}}_{**}-2{\tilde{K}}_{*}\varvec{d}+\varvec{d}^{\mathsf {T}}\left[ K+\sigma ^{2}I\right] \varvec{d},\\ \nabla _{\varvec{d}}h&=2\left[ K+\sigma ^{2}I\right] \varvec{d}-2K_{*}^{\mathsf {T}}=0,\\ \varvec{d}&^{*}=\left[ K+\sigma ^{2}I\right] ^{-1}K_{*}^{\mathsf {T}}. \end{aligned}$$

It follows that

$$\begin{aligned} \text {dict}\left( \Phi _{*},{\mathcal {L}}\left( \Phi _{X}\right) \right) ^{2}&=h\left( \varvec{d}^{*}\right) ={\tilde{K}}_{**}-2K_{*}\varvec{d}^{*}+\left( \varvec{d}^{*}\right) ^{\mathsf {T}}\left[ K+\sigma ^{2}I\right] \varvec{d}^{*},\\&=\sigma ^{2}+K_{**}-K_{*}\left[ K+\sigma ^{2}I\right] ^{-1}K_{*}^{\mathsf {T}}=\sigma _{*}^{2}+\sigma ^{2}. \end{aligned}$$

\(\square \)

Proposition 2

(Restated) Let us define \(\varvec{\alpha }_{*}=\left[ K+\sigma ^{2}I\right] ^{-1}\varvec{y}\) whose elements are denoted explicitly by \(\varvec{\alpha }_{*}=[\alpha _{1}^{*},\ldots ,\alpha _{N}^{*}]\), and \({\mathbf {w}}_{*}=\sum _{n=1}^{N}\alpha _{n}^{*}\Phi _{n}\). Then, \({\mathbf {w}}_{*}\) is the optimal solution for the following optimization problem:

$$\begin{aligned} \underset{{\mathbf {w}}}{\min }\,{\mathcal {J}}\left( {\mathbf {w}}\right) \triangleq \frac{\lambda }{2}\left\| {\mathbf {w}}\right\| ^{2}+\frac{1}{N}\sum _{n=1}^{N}\left( {\mathbf {w}}^{\mathsf {T}}\Phi _{n}-y_{n}\right) ^{2}, \end{aligned}$$
(11)

where \(\lambda =2\sigma ^{2}N^{-1}\). The posterior predictive mean in Eq. (3) can be computed as

$$\begin{aligned} \mu _{*}={\mathbf {w}}_{*}^{\mathsf {T}}\Phi _{*}=\sum _{n=1}^{N}\alpha _{n}^{*}{\hat{K}}\left( \varvec{x}_{*},\varvec{x}_{n}\right) \end{aligned}$$

Proof of Proposition 2

We transform the unconstrained optimization problem in Eq. (8) to its equivalent form as follows

$$\begin{aligned} \underset{{\mathbf {w}},\varvec{\xi }}{\min }&\,\,\frac{1}{2}\left\| {\mathbf {w}}\right\| ^{2}+\frac{C}{N}\sum _{n=1}^{N}\xi _{n}^{2},\\ \text {s.t.}:&\,\,\xi _{n}={\mathbf {w}}^{\mathsf {T}}\Phi _{n}-y_{n},\,\forall n, \end{aligned}$$

where \(C=\lambda ^{-1}=0.5N\sigma ^{-2}\).

The Lagrange function is of the following form

$$\begin{aligned} {\mathcal {L}}\left( {\mathbf {w}},\varvec{\xi },\varvec{\alpha }\right)&=\frac{1}{2}\left\| {\mathbf {w}}\right\| ^{2}+\frac{C}{N}\sum _{n=1}^{N}\xi _{n}^{2}-\sum _{n=1}^{N}\alpha _{n}\left( {\mathbf {w}}^{\mathsf {T}}\Phi _{n}-y_{n}-\xi _{n}\right) . \end{aligned}$$

Setting the derivatives to 0, we gain

$$\begin{aligned} \nabla _{{\mathbf {w}}}{\mathcal {L}}&={\mathbf {w}}-\sum _{n=1}^{N}\alpha _{n}\Phi _{n}\rightarrow {\mathbf {w}}=\sum _{n=1}^{N}\alpha _{n}\Phi _{n},\\ \nabla _{\xi _{n}}{\mathcal {L}}&=\frac{2C\xi _{n}}{N}+\alpha _{n}=0\rightarrow \xi _{n}=\frac{-N\alpha _{n}}{2C}=-\sigma ^{2}\alpha _{n}. \end{aligned}$$

Substituting the above to the Lagrange function, we have the following optimization problem

$$\begin{aligned} \underset{\varvec{\alpha }}{\min } \,\mathcal {A}\left( \varvec{\alpha }\right)&\triangleq \frac{1}{2}\varvec{\alpha }^{\mathsf {T}}\left[ K+\sigma ^{2}I\right] \varvec{\alpha }-\varvec{y}^{\mathsf {T}}\varvec{\alpha },\\ \nabla _{\varvec{\alpha }}\mathcal {A}&=\left[ K+\sigma ^{2}I\right] \varvec{\alpha }-\varvec{y}=0,\\ \varvec{\alpha }^{*}&=\left[ K+\sigma ^{2}I\right] ^{-1}\varvec{y}. \end{aligned}$$

Since the strong duality holds, we have \({\mathbf {w}}_{*}=\sum _{n=1}^{N}\alpha _{n}^{*}\Phi \left( x_{n}\right) \) and the predictive mean \(\mu ^{*}=K_{*}\left[ K+\sigma ^{2}I\right] ^{-1}\varvec{y}=K_{*}\varvec{\alpha }^{*}=\sum _{n=1}^{N}\alpha _{n}^{*}K\left( \varvec{x}_{*},\varvec{x}_{n}\right) ={\mathbf {w}}_{*}^{\mathsf {T}}\Phi _{*}\).

We now consider the upper bound of \(\left\| {\mathbf {w}}_{*}\right\| \) . In particular, we have the following lemma. \(\square \)

Lemma 1

(Restated) If we define \({\mathbf {w}}_{*}\) as

$$\begin{aligned} {\mathbf {w}}_{*}=\underset{{\mathbf {w}}}{{\mathrm{argmin}}}\left( \frac{\lambda }{2}\left\| {\mathbf {w}}\right\| ^{2}+\frac{1}{N}\sum _{n=1}^{N}\left( y_{n}-{\mathbf {w}}^{\mathsf {T}}\Phi _{n}\right) ^{2}\right) , \end{aligned}$$

then we have \(\left\| {\mathbf {w}}_{*}\right\| \le y_{\max }\lambda ^{-1/2}\).

Proof of Lemma 1

Let us consider the equivalent constrained optimization problem

$$\begin{aligned}&\underset{{\mathbf {w}},\varvec{\xi }}{\min }\left( \frac{\lambda }{2}\left\| {\mathbf {w}}\right\| ^{2}+\frac{1}{N}\sum _{n=1}^{N}\xi _{i}^{2}\right) ,\\&\quad \text {s.t.:}\,\,&\xi _{n}=y_{n}-{\mathbf {w}}^{\mathsf {T}}\Phi _{n},\quad \forall n. \end{aligned}$$

The Lagrange function is of the following form

$$\begin{aligned} {\mathcal {L}}\left( {\mathbf {w}},\varvec{\xi ,\alpha }\right) =\frac{\lambda }{2}\left\| {\mathbf {w}}^{2}\right\| +\frac{1}{N}\sum _{n=1}^{N}\xi _{n}^{2}+\sum _{n=1}^{N}\alpha _{n}\left( y_{n}-{\mathbf {w}}^{\mathsf {T}}\Phi _{n}-\xi _{n}\right) . \end{aligned}$$

Setting the derivatives to 0, we gain

$$\begin{aligned} \nabla _{{\mathbf {w}}}{\mathcal {L}}= & {} \lambda {\mathbf {w}}-\sum _{n=1}^{N}\alpha _{n}\Phi _{n}=0\rightarrow {\mathbf {w}}=\lambda ^{-1}\sum _{n=1}^{N}\alpha _{n}\Phi _{n},\\ \nabla _{\xi _{n}}{\mathcal {L}}= & {} \frac{2}{N}\xi _{n}-\alpha _{n}=0\rightarrow \xi _{n}=\frac{N\alpha _{n}}{2}. \end{aligned}$$

Substituting the above to the Lagrange function, we gain the dual form

$$\begin{aligned} {\mathcal {W}}\left( \varvec{\alpha }\right) Z&= -\frac{\lambda }{2}\left\| {\mathbf {w}}\right\| ^{2}+\sum _{n=1}^{N}y_{n}\alpha _{n}-\frac{N}{4}\sum _{n=1}^{N}\alpha _{n}^{2},\\&=-\frac{1}{2\lambda }\left\| \sum _{n=1}^{N}\alpha _{n}\Phi _{n}\right\| ^{2}+\sum _{n=1}^{N}y_{n}\alpha _{n}-\frac{N}{4}\sum _{n=1}^{N}\alpha _{n}^{2}. \end{aligned}$$

Let us denote \(\left( {\mathbf {w}}^{*},\varvec{\xi ^{*}}\right) \) and \(\varvec{\alpha ^{*}}\) be the primal and dual solutions, respectively. Since the strong duality holds, we have

$$\begin{aligned} \frac{\lambda }{2}\left\| {\mathbf {w}}^{*}\right\| ^{2}+\frac{1}{N}\sum _{n=1}^{N}\xi _{n}^{*2}= & {} -\frac{\lambda }{2}\left\| {\mathbf {w}}^{*}\right\| ^{2}+\sum _{n=1}^{N}y_{n}\alpha _{n}^{*}-\frac{N}{4}\sum _{n=1}^{N}\alpha _{n}^{*2},\\ \lambda \left\| {\mathbf {w}}^{*}\right\| ^{2}= & {} \sum _{n=1}^{N}y_{n}\alpha _{n}^{*}-\frac{N}{4}\sum _{n=1}^{N}\alpha _{n}^{*2}-\frac{1}{N}\sum _{n=1}^{N}\xi _{n}^{*2},\\\le & {} \sum _{n=1}^{N}\left( y_{n}\alpha _{n}^{*}-\frac{N}{4}\alpha _{n}^{*2}\right) \le \sum _{n=1}^{N}\frac{y_{n}^{2}}{N}\le y_{\text {max}}^{2}. \end{aligned}$$

Note that we have used \(g\left( \alpha _{n}^{*}\right) =y_{n}\alpha _{n}^{*}-\frac{N}{4}\alpha _{n}^{*2}\le g\left( \frac{2y_{n}}{N}\right) =\frac{y_{n}^{2}}{N}\). Hence, we gain the conclusion.

We now present the theoretical results regarding convergence analysis. The update rule is as follows

$$\begin{aligned} {\mathbf {w}}_{t+1}={\left\{ \begin{array}{ll} \prod {}_{S}\left( \frac{t-1}{t}{\mathbf {w}}_{t}-\frac{2\alpha _{t}}{\lambda t}\Phi _{t}\right) &{} \text {if}\,\delta _{t}>\theta ,\\ \prod {}_{S}\left( \frac{t-1}{t}{\mathbf {w}}_{t}-\frac{2\alpha _{t}}{\lambda t}{\mathbb {P}}_{{\mathcal {L}}\left( \Phi _{U}\right) }\left( \Phi _{t}\right) \right) &{} \text {otherwise}, \end{array}\right. } \end{aligned}$$

where \(\alpha _{t}={\mathbf {w}}_{t}^{\mathsf {T}}\Phi _{t}-y_{t}\), \(\delta _{t}={\mathrm{dist}}\left( \Phi _{t},{\mathcal {L}}\left( \Phi _{U}\right) \right) \), and \(S={\left\{ \begin{array}{ll} {\mathbb {R}}^{d} &{} \text {if}\,\,\lambda >2\\ {\mathcal {B}}(0,y_{\text {max}}\lambda ^{-1/2}) &{} \text {otherwise} \end{array}\right. }\) with \({\mathcal {B}}(0,y_{\text {max}}\lambda ^{-1/2})\) is defined as \(\{ x\in {\mathbb {R}}^{d}:\left\| x\right\| \le y_{\text {max}}\lambda ^{-1/2}\} \).

We can rewrite the update rule as follows

$$\begin{aligned} {\mathbf {w}}_{t+1}=\Pi _{S}\left( {\mathbf {w}}_{t}-\eta _{t}\left( g_{t}-2Z_{t}\alpha _{t}{\mathbb {H}}\left( U,\varvec{x}_{t}\right) \right) \right) , \end{aligned}$$

where \(g_{t}=\lambda {\mathbf {w}}_{t}+2\alpha _{t}\Phi _{t}\), \(Z_{t}\) is a binary random variable where \(\Pr \left( Z_{t}=1\right) =\Pr \left( \delta _{t}\le \theta \right) \) (i.e., if the approximation is performed), and \({\mathbb {H}}\left( U,\varvec{x}_{t}\right) =\Phi \left( x_{t}\right) -{\mathbb {P}}_{{\mathcal {L}}\left( \Phi _{U}\right) }\left( \Phi \left( \varvec{x}_{t}\right) \right) \) specifies the rejection vector of \(\Phi _{t}\) from \({\mathcal {L}}\left( \Phi _{U}\right) \). \(\square \)

Lemma 2

(Restated) The following statement holds

$$\begin{aligned} \left\| {\mathbf {w}}_{T+1}\right\| \le 2\lambda ^{-1}\left( 1+\sigma ^{2}\right) ^{1/2}\left( y_{\max }+\frac{\left( 1+\sigma ^{2}\right) ^{1/2}}{T}\sum _{t=1}^{T}\left\| {\mathbf {w}}_{t}\right\| \right) , \end{aligned}$$

where\(y_{\text {max}}={\max }_{y\in {\mathcal {Y}}} |y|\).

Proof of Lemma 2

We have the following

$$\begin{aligned} {\mathbf {w}}_{t+1}={\left\{ \begin{array}{ll} \Pi _{S}\left( \frac{t-1}{t}{\mathbf {w}}_{t}-\frac{2\alpha _{t}}{\lambda t}\Phi _{t}\right) &{} \text {if}\,Z_{t}=0,\\ \Pi _{S}\left( \frac{t-1}{t}{\mathbf {w}}_{t}-\frac{2\alpha _{t}}{\lambda t}{\mathbb {P}}_{{\mathcal {L}}\left( \Phi _{U}\right) }\left( \Phi _{t}\right) \right) &{} \text {otherwise}. \end{array}\right. } \end{aligned}$$

It follows that

$$\begin{aligned} \left\| {\mathbf {w}}_{t+1}\right\| \le \frac{t-1}{t}\left\| {\mathbf {w}}_{t}\right\| +\frac{2}{\lambda t}\left| \alpha _{t}\right| \left( 1+\sigma ^{2}\right) ^{1/2}\,\,\,\,\,\,\,\,\text {since}\,\,\left\| {\mathbb {P}}{}_{{\mathcal {L}}\left( \Phi _{U}\right) }\left( \Phi _{t}\right) \right\| \le \left\| \Phi _{t}\right\| =\left( 1+\sigma ^{2}\right) ^{1/2}. \end{aligned}$$

It happens that

$$\begin{aligned} \left| \alpha _{t}\right| =\left| y_{t}-{\mathbf {w}}_{t}^{\mathsf {T}}\Phi _{t}\right| \le y_{\text {max}}+\left\| {\mathbf {w}}_{t}\right\| \left\| \Phi _{t}\right\| \le y_{\text {max}}+\left\| {\mathbf {w}}_{t}\right\| \left( 1+\sigma ^{2}\right) ^{1/2}. \end{aligned}$$

Thus, we achieve

$$\begin{aligned} t\left\| {\mathbf {w}}_{t+1}\right\| \le \left( t-1\right) \left\| {\mathbf {w}}_{t}\right\| +2\lambda ^{-1}\left( 1+\sigma ^{2}\right) ^{1/2}\left( y_{\text {max}}+\left\| {\mathbf {w}}_{t}\right\| \left( 1+\sigma ^{2}\right) ^{1/2}\right) . \end{aligned}$$

Taking sum when \(t=1,2,\ldots ,T\), we achieve

$$\begin{aligned} T\left\| {\mathbf {w}}_{T+1}\right\|\le & {} 2\lambda ^{-1}\left( 1+\sigma ^{2}\right) ^{1/2}\left( Ty_{\text {max}}+\left( 1+\sigma ^{2}\right) ^{1/2}\sum _{t=1}^{T}\left\| {\mathbf {w}}_{t}\right\| \right) ,\nonumber \\ \left\| {\mathbf {w}}_{T+1}\right\|\le & {} 2\lambda ^{-1}\left( 1+\sigma ^{2}\right) ^{1/2}\left( y_{\text {max}}+\frac{\left( 1+\sigma ^{2}\right) ^{1/2}}{T}\sum _{t=1}^{T}\left\| {\mathbf {w}}_{t}\right\| \right) . \end{aligned}$$
(12)

\(\square \)

Lemma 3

(Restated) If \(\lambda >2\left( 1+\sigma ^{2}\right) \), then \(\left\| {\mathbf {w}}_{T+1}\right\| \le \frac{y_{\max }}{\frac{\lambda }{2\left( 1+\sigma ^{2}\right) }-1}\left( 1-\frac{1}{\left( \frac{\lambda }{2\left( 1+\sigma ^{2}\right) }\right) ^{T}}\right) <\frac{y_{\max }}{\frac{\lambda }{2\left( 1+\sigma ^{2}\right) }-1}\) for all T.

Proof of Lemma 3

First we consider the sequence \(\left\{ s_{T}\right\} _{T}\) which is identified as \(s_{T+1}=2\lambda ^{-1}(1+\sigma ^{2})^{1/2}(y_{\text {max}}+(1+\sigma ^{2})^{1/2}s_{T})\) and \(s_{1}=0\). It is easy to find the formula of this sequence as

$$\begin{aligned} s_{T+1}-\frac{y_{\text {max}}}{\frac{\lambda }{2\left( 1+\sigma ^{2}\right) }-1}&=\left( \frac{\lambda }{2\left( 1+\sigma ^{2}\right) }\right) ^{-1}\left( s_{T}-\frac{y_{\text {max}}}{\frac{\lambda }{2\left( 1+\sigma ^{2}\right) }-1}\right) =\ldots \\&=\left( \frac{\lambda }{2\left( 1+\sigma ^{2}\right) }\right) ^{-T}\left( s_{1}-\frac{y_{\text {max}}}{\frac{\lambda }{2\left( 1+\sigma ^{2}\right) }-1}\right) =\frac{-\left( \frac{\lambda }{2\left( 1+\sigma ^{2}\right) }\right) ^{-T}y_{\text {max}}}{\frac{\lambda }{2\left( 1+\sigma ^{2}\right) }-1}\\ s_{T+1}&=\frac{y_{\text {max}}}{\frac{\lambda }{2\left( 1+\sigma ^{2}\right) }-1}\left( 1-\frac{1}{\left( \frac{\lambda }{2\left( 1+\sigma ^{2}\right) }\right) ^{T}}\right) . \end{aligned}$$

We prove by induction by T that \(\left\| {\mathbf {w}}_{T}\right\| \le s_{T}\) for all T. It is obvious that \(\left\| {\mathbf {w}}_{1}\right\| =s_{1}=0\). Assume that \(\left\| {\mathbf {w}}_{t}\right\| \le s_{t}\) for \(t\le T\), we verify it for \(T+1\). Indeed, we have

$$\begin{aligned} \left\| {\mathbf {w}}_{T+1}\right\|&\le 2\lambda ^{-1}\left( 1+\sigma ^{2}\right) ^{1/2}\left( y_{\text {max}}+\frac{\left( 1+\sigma ^{2}\right) ^{1/2}}{T}\sum _{t=1}^{T}\left\| {\mathbf {w}}_{t}\right\| \right) ,\\&\le 2\lambda ^{-1}\left( 1+\sigma ^{2}\right) ^{1/2}\left( y_{\text {max}}+\frac{\left( 1+\sigma ^{2}\right) ^{1/2}}{T}\sum _{t=1}^{T}s_{t}\right) ,\\&\le 2\lambda ^{-1}\left( 1+\sigma ^{2}\right) ^{1/2}\left( y_{\text {max}}+\left( 1+\sigma ^{2}\right) ^{1/2}s_{T}\right) =s_{T+1}. \end{aligned}$$

Lemma 4

(Restated) The following statement holds \(\left\| {\mathbf {w}}_{t}-{\mathbf {w}}_{*}\right\| ^{2}\le W,\,\forall t\) where we have defined

$$\begin{aligned} W={\left\{ \begin{array}{ll} \left( \frac{y_{\max }}{\frac{\lambda }{2\left( 1+\sigma ^{2}\right) }-1}+y_{\max }\lambda ^{-1/2}\right) ^{2} &{} {\mathrm{if}}\quad \lambda >2(1+\sigma ^{2}),\\ 4y_{\max }^{2}\lambda ^{-1} &{} {\mathrm{otherwise}}. \end{array}\right. } \end{aligned}$$

Proof of Lemma 4

We consider two cases \(\lambda >2\left( 1+\sigma ^{2}\right) \):

$$\begin{aligned} \left\| {\mathbf {w}}_{t}-{\mathbf {w}}_{*}\right\| ^{2}\le \left( \left\| {\mathbf {w}}_{t}\right\| +\left\| {\mathbf {w}}_{*}\right\| \right) ^{2}\le \left( \frac{y_{\text {max}}}{\frac{\lambda }{2\left( 1+\sigma ^{2}\right) }-1}+y_{\text {max}}\lambda ^{-1/2}\right) ^{2}. \end{aligned}$$

\(0<\lambda \le 2(1+\sigma ^{2})\):

Both \({\mathbf {w}}_{t}\) and \({\mathbf {w}}^{*}\) are in \({\mathcal {B}}\left( 0,y_{\text {max}}\lambda ^{-1/2}\right) \). Hence, we have

$$\begin{aligned} \left\| {\mathbf {w}}_{t}-{\mathbf {w}}_{*}\right\| ^{2}\le \left( 2y_{\text {max}}\lambda ^{-1/2}\right) ^{2}=4y_{\text {max}}^{2}\lambda ^{-1}. \end{aligned}$$

\(\square \)

Lemma 5

(Restated) The following statement holds

$$\begin{aligned} \left| \alpha _{t}\right| \le M=\left( 1+\sigma ^{2}\right) ^{1/2}\max \left( \frac{y_{\max }}{\frac{\lambda }{2\left( 1+\sigma ^{2}\right) }-1},y_{\max }\lambda ^{-1/2}\right) +y_{\max },\quad \forall t. \end{aligned}$$

Proof of Lemma 5

We derive as follows

$$\begin{aligned} \left\| {\mathbf {w}}_{t}\right\|&\le \max \left( \frac{y_{\text {max}}}{\frac{\lambda }{2\left( 1+\sigma ^{2}\right) }-1},y_{\text {max}}\lambda ^{-1/2}\right) ,\\ \left| \alpha _{t}\right|&=\left| {\mathbf {w}}_{t}^{\mathsf {T}}\Phi \left( \varvec{x}_{t}\right) -y_{t}\right| \le \left\| {\mathbf {w}}_{t}\right\| \left\| \Phi _{t}\right\| +y_{\max },\\&\le \left( 1+\sigma ^{2}\right) ^{1/2}\max \left( \frac{y_{\text {max}}}{\frac{\lambda }{2\left( 1+\sigma ^{2}\right) }-1},y_{\max }\lambda ^{-1/2}\right) +y_{\max }=M. \end{aligned}$$

Lemma 6

(Restated) The following statement holds

$$\begin{aligned} \left\| g_{t}-2Z_{t}\alpha _{t}{\mathbb {H}}\left( U,\varvec{x}_{t}\right) \right\| ^{2}\le G,\,\forall t, \end{aligned}$$

where we have defined

$$\begin{aligned} G=\left( \left( \lambda +2\left( 1+\sigma ^{2}\right) ^{1/2}\right) \max \left( \frac{\lambda y_{\max }}{\frac{\lambda }{2\left( 1+\sigma ^{2}\right) }-1},\lambda y_{\max }\lambda ^{-1/2}\right) +2\left( 1+\sigma ^{2}\right) ^{1/2}y_{\max }\right) ^{2}. \end{aligned}$$

Proof of Lemma 6

We derive as follows

$$\begin{aligned} \left\| g_{t}-2Z_{t}\alpha _{t}{\mathbb {H}}\left( U,\varvec{x}_{t}\right) \right\| ^{2}&\le \left( \lambda \left\| {\mathbf {w}}_{t}\right\| +2\left| \alpha _{t}\right| \left\| \Phi _{t}\right\| \right) ^{2}=\left( \lambda \left\| {\mathbf {w}}_{t}\right\| +2\left| \alpha _{t}\right| \left( 1+\sigma ^{2}\right) ^{1/2}\right) ^{2}. \end{aligned}$$

Here, we note that to gain the above inequality, we consider two cases \(Z_{t}=1\) and \(Z_{t}=0\) and use \(\left\| {\mathbb {H}}\left( U,\varvec{x}_{t}\right) \right\| \le \left\| \Phi _{t}\right\| =\left( 1+\sigma ^{2}\right) ^{1/2}\).

We have

$$\begin{aligned} \left\| {\mathbf {w}}_{t}\right\|&\le \max \left( \frac{y_{\text {max}}}{\frac{\lambda }{2\left( 1+\sigma ^{2}\right) }-1},y_{\text {max}}\lambda ^{-1/2}\right) ,\\ \left| \alpha _{t}\right|&\le M=\left( 1+\sigma ^{2}\right) ^{1/2}\max \left( \frac{y_{\text {max}}}{\frac{\lambda }{2\left( 1+\sigma ^{2}\right) }-1},y_{\max }\lambda ^{-1/2}\right) +y_{\max }. \end{aligned}$$

Hence, we gain

$$\begin{aligned} \lambda \left\| {\mathbf {w}}_{t}\right\| +2\left| \alpha _{t}\right| \left( 1+\sigma ^{2}\right) ^{1/2}&\le \max \left( \frac{\lambda y_{\text {max}}}{\frac{\lambda }{2\left( 1+\sigma ^{2}\right) }-1},\lambda y_{\text {max}}\lambda ^{-1/2}\right) \\&\quad +2\left( 1+\sigma ^{2}\right) ^{1/2}\max \left( \frac{y_{\text {max}}}{\frac{\lambda }{2\left( 1+\sigma ^{2}\right) }-1},y_{\max }\lambda ^{-1/2}\right) \\&\quad +2\left( 1+\sigma ^{2}\right) ^{1/2}y_{\max },\\&= \left( \lambda +2\left( 1+\sigma ^{2}\right) ^{1/2}\right) \max \left( \frac{\lambda y_{\text {max}}}{\frac{\lambda }{2\left( 1+\sigma ^{2}\right) }-1},\lambda y_{\text {max}}\lambda ^{-1/2}\right) \\&\quad +2\left( 1+\sigma ^{2}\right) ^{1/2}y_{\max }. \end{aligned}$$

\(\square \)

Theorem 1

(Restated) Consider Algorithm 4.1 where \(\left( \varvec{x}_{t},y_{t}\right) \sim P_{N}\), or \(P_{{\mathcal {X}}\times {\mathcal {Y}}}\), arrives on fly, then the following bound holds

$$\begin{aligned} {\mathbb {E}}\left[ {\mathcal {J}}\left( {\overline{{\mathbf {w}}}}_{T}\right) -{\mathcal {J}}\left( {\mathbf {w}}_{*}\right) \right]&\le \frac{G\left( \log \,T+1\right) }{2\lambda T}+\frac{2WM\theta }{T}\sum _{t=1}^{T}p_{t}\le \frac{G\left( \log \,T+1\right) }{2\lambda T}+2WM\theta , \end{aligned}$$

where \(G,\,M,\,W\) are positive constants and \(p_{t}=\Pr \left( Z_{t}=1\right) \) as defined before.

Proof of Theorem 1

We have

$$\begin{aligned} \left\| {\mathbf {w}}_{t+1}-{\mathbf {w}}_{*}\right\| ^{2}&=\left\| \prod _{S}\left( {\mathbf {w}}_{t}-\eta _{t}\left( g_{t}-2Z_{t}\alpha _{t}{\mathbb {H}}\left( U,\varvec{x}_{t}\right) \right) \right) -{\mathbf {w}}^{*}\right\| ^{2}\\&\le \left\| {\mathbf {w}}_{t}-\eta _{t}\left( g_{t}-2Z_{t}\alpha _{t}{\mathbb {H}}\left( U,\varvec{x}_{t}\right) \right) -{\mathbf {w}}_{*}\right\| ^{2}\\&=\left\| {\mathbf {w}}_{t}-{\mathbf {w}}_{*}\right\| ^{2}+\eta _{t}^{2}\left\| g_{t}-2Z_{t}\alpha _{t}{\mathbb {H}}\left( U,\varvec{x}_{t}\right) \right\| ^{2}\\&\quad -2\eta _{t}\left\langle {\mathbf {w}}_{t}-w^{*},g_{t}-2Z_{t}\alpha _{t}{\mathbb {H}}\left( U,\varvec{x}_{t}\right) \right\rangle ,\\ \left\langle {\mathbf {w}}_{t}-{\mathbf {w}}_{*},g_{t}\right\rangle&\le \frac{\left\| {\mathbf {w}}_{t}-{\mathbf {w}}_{*}\right\| ^{2}-\left\| {\mathbf {w}}_{t+1}-{\mathbf {w}}_{*}\right\| ^{2}}{2\eta _{t}}\\&\quad +\frac{\eta _{t}}{2}\left\| g_{t} -2Z_{t}\alpha _{t}{\mathbb {H}}\left( U,\varvec{x}_{t}\right) \right\| ^{2}+\left\langle {\mathbf {w}}_{t}-{\mathbf {w}}_{*},2Z_{t}\alpha _{t}{\mathbb {H}}\left( U,\varvec{x}_{t}\right) \right\rangle . \end{aligned}$$

We recall that \(g_{t}=\lambda {\mathbf {w}}_{t}+2\left( {\mathbf {w}}_{t}^{\mathsf {T}}\Phi _{t}-y_{t}\right) \Phi \left( \varvec{x}_{t}\right) \) and \(\left( \varvec{x}_{t},y_{t}\right) \sim P_{N}\) or \(P_{{\mathcal {X}}\times {\mathcal {Y}}}\). Hence, we gain \({\mathbb {E}}\left[ g_{t}|{\mathbf {w}}_{t}\right] ={\mathcal {J}}^{'}\left( {\mathbf {w}}_{t}\right) \).

Taking the conditional expectation w.r.t \({\mathbf {w}}_{t}\), we achieve

$$\begin{aligned} \left\langle {\mathbf {w}}_{t}-{\mathbf {w}}_{*},{\mathcal {J}}^{'}\left( {\mathbf {w}}_{t}\right) \right\rangle&\le {\mathbb {E}}\left[ \frac{\left\| {\mathbf {w}}_{t}-{\mathbf {w}}_{*}\right\| ^{2}-\left\| {\mathbf {w}}_{t+1}-{\mathbf {w}}_{*}\right\| ^{2}}{2\eta _{t}}\right] \\&\quad +\frac{\eta _{t}}{2}{\mathbb {E}}\left[ \left\| g_{t}-2Z_{t}\alpha _{t}{\mathbb {H}}\left( U,\varvec{x}_{t}\right) \right\| ^{2}\right] \\&\quad +{\mathbb {E}}\left[ \left\langle {\mathbf {w}}_{t}-{\mathbf {w}}_{*},2Z_{t}\alpha _{t}{\mathbb {H}}\left( U,\varvec{x}_{t}\right) \right\rangle \right] ,\\ {\mathcal {J}}\left( {\mathbf {w}}_{t}\right) -{\mathcal {J}}\left( {\mathbf {w}}_{*}\right) +\frac{\lambda }{2}\left\| {\mathbf {w}}_{t}-{\mathbf {w}}_{*}\right\| ^{2}&\le {\mathbb {E}}\left[ \frac{\left\| {\mathbf {w}}_{t}-{\mathbf {w}}_{*}\right\| ^{2}-\left\| {\mathbf {w}}_{t+1}-{\mathbf {w}}_{*}\right\| ^{2}}{2\eta _{t}}\right] \\&\quad +\frac{\eta _{t}}{2}{\mathbb {E}}\left[ \left\| g_{t}-2Z_{t}\alpha _{t}{\mathbb {H}}\left( U,\varvec{x}_{t}\right) \right\| ^{2}\right] \\&\quad +2{\mathbb {E}}\left[ \left\langle {\mathbf {w}}_{t}-{\mathbf {w}}_{*},Z_{t}\alpha _{t}{\mathbb {H}}\left( U,\varvec{x}_{t}\right) \right\rangle \right] . \end{aligned}$$

Taking expectation again, we obtain

$$\begin{aligned} {\mathbb {E}}\left[ {\mathcal {J}}\left( {\mathbf {w}}_{t}\right) -{\mathcal {J}}\left( {\mathbf {w}}_{*}\right) \right]&\le \frac{\lambda }{2}\left( t-1\right) {\mathbb {E}}\left[ \left\| {\mathbf {w}}_{t}-{\mathbf {w}}_{*}\right\| ^{2}\right] -\frac{\lambda }{2}t{\mathbb {E}}\left[ \left\| {\mathbf {w}}_{t+1}-{\mathbf {w}}_{*}\right\| ^{2}\right] \nonumber \\&\quad +\frac{\eta _{t}}{2}{\mathbb {E}}\left[ \left\| g_{t}-Z_{t}\alpha _{t}{\mathbb {H}}\left( U,\varvec{x}_{t}\right) \right\| ^{2}\right] \nonumber \\&\quad +2{\mathbb {E}}\left[ \left\langle {\mathbf {w}}_{t}-{\mathbf {w}}_{*},Z_{t}\alpha _{t}{\mathbb {H}}\left( U,\varvec{x}_{t}\right) \right\rangle \right] ,\nonumber \\&\le \frac{\lambda }{2}\left( t-1\right) {\mathbb {E}}\left[ \left\| {\mathbf {w}}_{t}-{\mathbf {w}}_{*}\right\| ^{2}\right] -\frac{\lambda }{2}t{\mathbb {E}}\left[ \left\| {\mathbf {w}}_{t+1}-{\mathbf {w}}_{*}\right\| ^{2}\right] \nonumber \\&\quad +\frac{G\eta _{t}}{2}+2{\mathbb {E}}\left[ \left\| {\mathbf {w}}_{t}-{\mathbf {w}}_{*}\right\| \left\| {\mathbb {H}}\left( U,\varvec{x}_{t}\right) \right\| \left| \alpha _{t}\right| Z_{t}\right] ,\nonumber \\&\le \frac{\lambda }{2}\left( t-1\right) {\mathbb {E}}\left[ \left\| {\mathbf {w}}_{t}-{\mathbf {w}}_{*}\right\| ^{2}\right] -\frac{\lambda }{2}t{\mathbb {E}}\left[ \left\| {\mathbf {w}}_{t+1}-{\mathbf {w}}_{*}\right\| ^{2}\right] \nonumber \\&\quad +\frac{G\eta _{t}}{2}+2WM{\mathbb {E}}\left[ \left\| {\mathbb {H}}\left( U,\varvec{x}_{t}\right) \right\| Z_{t}\right] ,\nonumber \\&\le \frac{\lambda }{2}\left( t-1\right) {\mathbb {E}}\left[ \left\| {\mathbf {w}}_{t}-{\mathbf {w}}_{*}\right\| ^{2}\right] -\frac{\lambda }{2}t{\mathbb {E}}\left[ \left\| {\mathbf {w}}_{t+1}-{\mathbf {w}}_{*}\right\| ^{2}\right] \nonumber \\&\quad +\frac{G\eta _{t}}{2}+2WM\theta \Pr \left( Z_{t}=1\right) . \end{aligned}$$
(13)

Taking sum the above when \(t=1,\ldots ,T\), we achieve

$$\begin{aligned} \sum _{t=1}^{T}{\mathbb {E}}\left[ {\mathcal {J}}\left( {\mathbf {w}}_{t}\right) \right] -T{\mathcal {J}}\left( {\mathbf {w}}_{*}\right)&\le \frac{G}{2}\sum _{t=1}^{T}\frac{1}{\lambda t}+2WM\theta \sum _{t=1}^{T}p_{t},\\ \frac{1}{T}\sum _{t=1}^{T}{\mathbb {E}}\left[ {\mathcal {J}}\left( {\mathbf {w}}_{t}\right) \right] -{\mathcal {J}}\left( {\mathbf {w}}_{*}\right)&\le \frac{G\left( \log T+1\right) }{2\lambda T}+\frac{2WM\theta }{T}\sum _{t=1}^{T}p_{t}\le \frac{G\left( \log T+1\right) }{2\lambda T}+2WM\theta ,\\ {\mathbb {E}}\left[ {\mathcal {J}}\left( {\overline{{\mathbf {w}}}}_{T}\right) -{\mathcal {J}}\left( {\mathbf {w}}_{*}\right) \right]&\le \frac{G\left( \log T+1\right) }{2\lambda T}+\frac{2WM\theta }{T}\sum _{t=1}^{T}p_{t}\le \frac{G\left( \log T+1\right) }{2\lambda T}+2WM\theta . \end{aligned}$$

\(\square \)

Theorem 2

(Restated) Consider the output of Algorithm 4.1 and further let \({\overline{{\mathbf {w}}}}_{T}^{\gamma }=\frac{1}{\gamma T}\sum _{t=\left( 1-\gamma \right) T+1}^{T}{\mathbf {w}}_{t}\) and \(W_{T}^{\gamma }={\mathbb {E}}[\left\| {\mathbf {w}}_{\left( 1-\gamma \right) T+1}-{\mathbf {w}}_{*}\right\| ^{2}]\) where \(0<\gamma <1\), then, the following inequality holds

$$\begin{aligned} {\mathcal {J}}\left( {\overline{{\mathbf {w}}}}_{T}^{\gamma }\right) -{\mathcal {J}}\left( {\mathbf {w}}_{*}\right)&\le \frac{1}{\gamma T}\sum _{t=\gamma 'T+1}^{T}{\mathcal {J}}\left( {\mathbf {w}}_{t}\right) -{\mathcal {J}}\left( {\mathbf {w}}_{*}\right) ,\\&\le \frac{\lambda \gamma '}{2\gamma }W_{T}^{\gamma }+\frac{G\log \left( 1/\gamma '\right) }{2\lambda \gamma T}+\frac{2WM\theta }{\gamma T}\sum _{t=\gamma 'T+1}^{T}p_{t},\\&\le \frac{\lambda \gamma '}{2\gamma }W_{T}^{\gamma }+\frac{G\log \left( 1/\gamma '\right) }{2\lambda \gamma T}+2WM\theta , \end{aligned}$$

where \(\gamma '=1-\gamma \).

Proof of Theorem 2

Taking sum in Eq. (13) when \(t=\left( 1-\gamma \right) T+1,\ldots ,T\), we gain

$$\begin{aligned} \frac{1}{\gamma T}\sum _{t=\left( 1-\gamma \right) T+1}^{T}{\mathcal {J}}\left( {\mathbf {w}}_{t}\right) -{\mathcal {J}}\left( {\mathbf {w}}_{*}\right)&\le \frac{\lambda \left( 1-\gamma \right) }{2\gamma }W_{T}^{\gamma }+\frac{G\log \left( 1/\left( 1-\gamma \right) \right) }{2\lambda \gamma T}+\frac{2WM\theta }{\gamma T}\sum _{t=\left( 1-\gamma \right) T+1}^{T}p_{t},\\&\le \frac{\lambda \left( 1-\gamma \right) }{2\gamma }W_{T}^{\gamma }+\frac{G\log \left( 1/\left( 1-\gamma \right) \right) }{2\lambda T}+2WM\theta . \end{aligned}$$

We note that we have used the inequality \(\sum _{t=\left( 1-\gamma \right) T+1}^{T}\frac{1}{t}\le \log \left( 1/\left( 1-\gamma \right) \right) \).

To achieve the conclusion, we use the convexity of the function \({\mathcal {J}}\left( .\right) \) which implies \({\mathcal {J}}\left( {\overline{{\mathbf {w}}}}_{T}^{\gamma }\right) \le \frac{1}{\gamma T}\sum _{t=\left( 1-\gamma \right) T+1}^{T}{\mathcal {J}}\left( {\mathbf {w}}_{t}\right) \). \(\square \)

Theorem 5

(Hoeffding inequality) Let the independent variables \(X_{1},\ldots ,X_{n}\) where \(a_{i}\le X_{i}\le b_{i}\) for each \(i\in \left[ n\right] \). Let \(S=\sum _{i=1}^{n}X_{i}\) and \(\Delta _{i}=b_{i}-a_{i}\). The following statements hold

  1. (i)

    \({\mathbb {P}}\left( S-{\mathbb {E}}\left[ S\right] >\varepsilon \right) \le \exp \left( -\frac{2\varepsilon ^{2}}{\sum _{i=1}^{n}\Delta _{i}^{2}}\right) \).

  2. (ii)

    \({\mathbb {P}}\left( \left| S-{\mathbb {E}}\left[ S\right] >\varepsilon \right| \right) \le 2\exp \left( -\frac{2\varepsilon ^{2}}{\sum _{i=1}^{n}\Delta _{i}^{2}}\right) .\)

Theorem 3

(Restated) Define the gap \(m_{T}=\frac{\lambda \gamma '}{2\gamma }W_{T}^{\gamma }+\frac{2WM\theta }{\gamma T}\sum _{t=\gamma 'T+1}^{T}p_{t}\). Let r be any number randomly picked from \(\{ \gamma 'T+1,\gamma 'T+2,\ldots ,T\} \). With a probability at least \(1-\delta \), the following inequality holds

$$\begin{aligned} {\mathcal {J}}\left( {\mathbf {w}}_{r}\right) -{\mathcal {J}}\left( {\mathbf {w}}_{*}\right) \le \frac{G\log \left( 1/\gamma '\right) }{2\lambda \gamma T}+m_{T}+\Delta _{T}^{\gamma }\sqrt{\frac{1}{2}\log \frac{1}{\delta }}, \end{aligned}$$

where \(\Delta _{T}^{\gamma }={\max }_{\gamma 'T+1\le t\le T}\left( {\mathcal {J}}\left( {\mathbf {w}}_{t}\right) -{\mathcal {J}}\left( {\mathbf {w}}_{*}\right) \right) \).

Proof of Theorem 3

From the above theorem, we achieve

$$\begin{aligned} \frac{1}{\gamma T}\sum _{t=\left( 1-\gamma \right) T+1}^{T}{\mathbb {E}}\left[ {\mathcal {J}}\left( {\mathbf {w}}_{t}\right) \right] -{\mathcal {J}}\left( {\mathbf {w}}_{*}\right) \le \frac{G\log \left( 1/\left( 1-\gamma \right) \right) }{2\lambda \gamma T}+m_{T}, \end{aligned}$$

where \(m_{T}=\frac{\lambda \left( 1-\gamma \right) }{2\gamma }W_{T}^{\gamma }+\frac{W\theta }{\gamma T}\sum _{t=\left( 1-\gamma \right) T+1}^{T}p_{t}\).

Let us denote \(X={\mathcal {J}}\left( {\mathbf {w}}_{r}\right) -{\mathcal {J}}\left( {\mathbf {w}}_{*}\right) \), where r is uniformly sampled from \(\left\{ \left( 1-\gamma \right) T+1,2,\ldots ,T\right\} \). We have

$$\begin{aligned} {\mathbb {E}}_{r}\left[ X\right] =\frac{1}{\gamma T}\sum _{t=\left( 1-\gamma \right) T+1}^{T}{\mathbb {E}}\left[ {\mathcal {J}}\left( {\mathbf {w}}_{r}\right) \right] -{\mathcal {J}}\left( {\mathbf {w}}_{*}\right) \le \frac{G\log \left( 1/\left( 1-\gamma \right) \right) }{2\lambda \gamma T}+m_{T}. \end{aligned}$$

It follows that

$$\begin{aligned} {\mathbb {E}}\left[ X\right] ={\mathbb {E}}_{\left( x_{t},y_{t}\right) _{t=1}^{T}}\left[ {\mathbb {E}}_{r}\left[ X\right] \right] \le \frac{G\log \left( 1/\left( 1-\gamma \right) \right) }{2\lambda \gamma T}+m_{T}. \end{aligned}$$

Let us denote \(\Delta _{T}^{\gamma }=\underset{\left( 1-\gamma \right) T+1\le t\le T}{\max }\left( {\mathcal {J}}\left( {\mathbf {w}}_{t}\right) -{\mathcal {J}}\left( {\mathbf {w}}_{*}\right) \right) ,\) which implies that \(0<{\mathcal {J}}\left( w_{r}\right) -{\mathcal {J}}\left( w^{*}\right) <\Delta _{T}^{\gamma }\). Applying Hoeffding inequality for the random variable X, we gain

$$\begin{aligned} P\left( X-{\mathbb {E}}\left[ X\right]>\varepsilon \right) \le&\exp \left( -\frac{2\varepsilon ^{2}}{\left( \Delta _{T}^{\gamma }\right) ^{2}}\right) , \\ P\left( X-\frac{G\log \left( 1/\left( 1-\gamma \right) \right) }{2\lambda \gamma T}-m_{T}>\varepsilon \right) \le&\exp \left( -\frac{2\varepsilon ^{2}}{\left( \Delta _{T}^{\gamma }\right) ^{2}}\right) , \\ P\left( X\le G\log \left( 1/\left( 1-\gamma \right) \right) +m_{T}+\varepsilon \right) >&1-\exp \left( -\frac{2\varepsilon ^{2}}{\left( \Delta _{T}^{\gamma }\right) ^{2}}\right) . \end{aligned}$$

Choosing \(\delta =\exp \left( -\frac{2\varepsilon ^{2}}{\left( \Delta _{T}^{\gamma }\right) ^{2}}\right) \) or \(\varepsilon =\Delta _{T}^{\gamma }\sqrt{\frac{1}{2}\log \frac{1}{\delta }}\), then with the probability at least \(1-\delta \), we have

$$\begin{aligned} {\mathcal {J}}\left( {\mathbf {w}}_{r}\right) -{\mathcal {J}}\left( {\mathbf {w}}_{*}\right) \le \frac{G\log \left( 1/\left( 1-\gamma \right) \right) }{2\lambda \gamma T}+m_{T}+\Delta _{T}^{\gamma }\sqrt{\frac{1}{2}\log \frac{1}{\delta }}. \end{aligned}$$

\(\square \)

Theorem 4

(Restated). Consider Algorithm 4.1 where \(\left( \varvec{x}_{t},y_{t}\right) \sim P_{N}\) or \(P_{{\mathcal {X}}\times {\mathcal {Y}}}\), after at most \(T_{\theta }\) iterations, this algorithm reaches an \(\theta \)-stable state in which the size of inducing set is bounded by \(T_{\theta }\), i.e., \(\left| U\right| \le T_{\theta }\), and for any \(\varvec{x}_{*}\notin U\) , the standard deviation \(\sigma _{U}\left( \varvec{x}_{*}\right) \) of the distribution \(p\left( f_{*}\mid \varvec{x}_{*},U\right) \) is less than \((\theta ^{2}-\sigma ^{2})^{1/2}\). More importantly, the constant \(T_{\theta }\) is independent with the data distribution and arrival order.

Proof of Theorem 4

We assume that \({\mathcal {X}}\subset {\mathbb {R}}^{d}\) is a compact set (i.e., close and bounded) and \(\Phi \left( .\right) \) is a continuous map. Since \({\mathcal {X}}\) is a compact set and \(\Phi \left( .\right) \) is a continuous map, \(\Phi \left( {\mathcal {X}}\right) \) is also a compact set. Let \(\left\{ {\mathcal {C}}\left( \Phi \left( \varvec{s}\right) ,\theta \right) \right\} _{\varvec{s}\in {\mathcal {X}}}\) be an open coverage of \(\Phi \left( {\mathcal {X}}\right) \). We note that the open sphere \({\mathcal {C}}\left( \Phi \left( \varvec{s}\right) ,\theta \right) \) is defined as

$$\begin{aligned} {\mathcal {C}}\left( \Phi \left( \varvec{s}\right) ,\theta \right) =\left\{ \phi \in \Phi \left( {\mathcal {X}}\right) \mid \left\| \phi -\Phi \left( \varvec{s}\right) \right\| <\theta \right\} . \end{aligned}$$

From this coverage, we can extract a finite coverage of \(T_{\theta }\) open set, that is, \(\left\{ {\mathcal {C}}\left( \Phi \left( \varvec{s}_{i}\right) ,\theta \right) \right\} _{i=1}^{T_{\theta }}\). We denote the set of inducing variables U right before adding the instance \(\left( \varvec{x}_{t},y_{t}\right) \) by \(U_{t}\). It is apparent that the resultant set of inducing variables is the union of all instantaneous set of inducing variables, i.e., \(U=\bigcup _{t\ge 1}U_{t}\).

We now prove that if \(\varvec{u},\varvec{v}\in U\) are two different elements in U, then \(\left\| \Phi \left( \varvec{u}\right) -\Phi \left( \varvec{v}\right) \right\| >\theta \). We assume that \(\varvec{u}=\varvec{x}_{t}\) and \(\varvec{v}=\varvec{x}_{t'}\) with \(t>t^{'}\). We then have

$$\begin{aligned} {\mathrm{dist}}\left( \Phi \left( \varvec{x}_{t}\right) ,{\mathcal {L}}_{\Phi \left( U_{t'}\right) }\right)>\theta ,\\ \underset{\varvec{d}}{\min }\left\| \Phi \left( \varvec{x}_{t}\right) -\sum _{\varvec{x}\in U_{t'}}d_{x}\Phi \left( \varvec{x}\right) \right\| >\theta . \end{aligned}$$

Since \(\varvec{v}=x_{t'}\in U_{t}\), by choosing \(d_{v}=1\) and \(d_{x}=0,\,x\ne v\), we gain

$$\begin{aligned} \left\| \Phi \left( \varvec{x}_{t}\right) -\Phi \left( \varvec{x}_{t'}\right) \right\|&>\theta ,\\ \left\| \Phi \left( \varvec{u}\right) -\Phi \left( \varvec{v}\right) \right\|&>\theta . \end{aligned}$$

Therefore, each open sphere in the finite coverage \(\left\{ {\mathcal {C}}\left( \Phi \left( \varvec{s}_{i}\right) ,\theta \right) \right\} _{i=1}^{T_{\theta }}\) cannot contain more than two points in U. Besides, U is a subset of \(\bigcup _{i=1}^{T_{\theta }}{\mathcal {C}}\left( \Phi \left( \varvec{s}_{i}\right) ,\theta \right) \). Hence, the cardinality of U cannot exceed \(T_{\theta }\), i.e., \(\left| U\right| \le T_{\theta }\). \(\square \)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Le, T., Nguyen, K., Nguyen, V. et al. GoGP: scalable geometric-based Gaussian process for online regression. Knowl Inf Syst 60, 197–226 (2019). https://doi.org/10.1007/s10115-018-1239-1

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-018-1239-1

Keywords

Navigation