GoGP: scalable geometric-based Gaussian process for online regression

Le, Trung; Nguyen, Khanh; Nguyen, Vu; Nguyen, Tu Dinh; Phung, Dinh

doi:10.1007/s10115-018-1239-1

GoGP: scalable geometric-based Gaussian process for online regression

Regular Paper
Published: 20 July 2018

Volume 60, pages 197–226, (2019)
Cite this article

Knowledge and Information Systems Aims and scope Submit manuscript

Trung Le¹,
Khanh Nguyen²,
Vu Nguyen²,
Tu Dinh Nguyen¹ &
…
Dinh Phung¹

356 Accesses
3 Citations
Explore all metrics

Abstract

One of the most challenging problems in Gaussian process regression is to cope with large-scale datasets and to tackle an online learning setting where data instances arrive irregularly and continuously. In this paper, we introduce a novel online Gaussian process model that scales efficiently with large-scale datasets. Our proposed GoGP is constructed based on the geometric and optimization views of the Gaussian process regression, hence termed geometric-based online GP (GoGP). We developed theory to guarantee that with a good convergence rate our proposed algorithm always offers a sparse solution, which can approximate the true optima up to any level of precision specified a priori. Moreover, to further speed up the GoGP accompanied with a positive semi-definite and shift-invariant kernel such as the well-known Gaussian kernel and also address the curse of kernelization problem, wherein the model size linearly rises with data size accumulated over time in the context of online learning, we proposed to approximate the original kernel using the Fourier random feature kernel. The model of GoGP with Fourier random feature (i.e., GoGP-RF) can be stored directly in a finite-dimensional random feature space, hence being able to avoid the curse of kernelization problem and scalable efficiently and effectively with large-scale datasets. We extensively evaluated our proposed methods against the state-of-the-art baselines on several large-scale datasets for online regression task. The experimental results show that our GoGP(s) delivered comparable, or slightly better, predictive performance while achieving a magnitude of computational speedup compared with its rivals under online setting. More importantly, its convergence behavior is guaranteed through our theoretical analysis, which is rapid and stable while achieving lower errors.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Introduction to Machine Learning

PolieDRO: a novel classification and regression framework with non-parametric data-driven regularization

Article 15 April 2024

The Frank-Wolfe Algorithm: A Short Introduction

Article Open access 13 December 2023

Notes

We store the model as ${\mathbf {w}}=\sum _{n}\alpha _{n}\Phi \left( x_{n}\right) .$ The model size is $\left\| \varvec{\varvec{\alpha }}\right\| _{0}$, and a sparse solution specifies the solution with a small value for $\left\| \varvec{\varvec{\alpha }}\right\| _{0}/N$.
https://github.com/SheffieldML/GPy.
https://github.com/qminh93/RVGP.
http://lsokl.stevenhoi.com/.
Our code for four versions of GoGP (i.e., GoGP-K-P, GoGP-RF-P and GoGP-K-C, GoGP-RF-C) can be found at https://github.com/khanhndk/GoGP.

References

Cavallanti G, Cesa-Bianchi N, Gentile C (2007) Tracking the best hyperplane with a simple budget perceptron. Mach Learn 69(2–3):143–167
Article Google Scholar
Chitta R, Jin R, Jain AK (2012) Efficient kernel clustering using random Fourier features. In: International conference on data mining
Crammer K, Dekel O, Keshet J, Shalev-Shwartz S, Singer Y (2006) Online passive-aggressive algorithms. J Mach Learn Res 7:551–585
MathSciNet MATH Google Scholar
Crammer K, Kandola J, Singer Y (2004) Online classification on a budget. In: Advances in neural information processing systems, vol 16. MIT Press
Csató L, Opper M (2002) Sparse on-line Gaussian processes. Neural Comput 14(3):641–668
Article MATH Google Scholar
Dekel O, Shalev-Shwartz S, Singer Y (2005) The forgetron: a kernel-based perceptron on a fixed budget. In: Advances in neural information processing systems 19. pp 259–266
Freund Y, Schapire RE (1999) Large margin classification using the perceptron algorithm. Mach Learn 37(3):277–296
Article MATH Google Scholar
GPy (2012) GPy: a Gaussian process framework in python. http://github.com/SheffieldML/GPy
Hensman J, Fusi N, Lawrence ND (2013) Gaussian processes for big data. In: Uncertainty in artificial intelligence, p 282, Citeseer
Hensman J, Rattray M, Lawrence ND (2012) Fast variational inference in the conjugate exponential family. In: Pereira F, Burges CJC, Bottou L, Weinberger KQ (eds) Advances in neural information processing systems 26. pp 2888–2896
Hoang TN, Hoang QM, Low B (2015) A unifying framework of anytime sparse Gaussian process regression models with stochastic variational inference for big data. In: Proceedings of the 32nd international conference on machine learning, pp 569–578
Hoang TN, Hoang QM, Low BKH (2016) A distributed variational inference framework for unifying parallel sparse Gaussian process regression models. In: Proceedings of ICML, pp 382–391
Hoffman MD, Blei DM, Wang C, Paisley J (2013) Stochastic variational inference. J Mach Learn Res 14(1):1303–1347
MathSciNet MATH Google Scholar
Kivinen J, Smola AJ, Williamson RC (2004) Online learning with kernels. IEEE Trans Signal Process 52:2165–2176
Article MathSciNet MATH Google Scholar
Lawrence N, Seeger M, Herbrich R (2003) Fast sparse Gaussian process methods: the informative vector machine. Advances in neural information processing systems 17:625–632
Google Scholar
Le T, Nguyen TD, Nguyen V, Phung D (2017) Approximation vector machines for large-scale online learning. J Mach Learn Res 18(1):3962–4016
MathSciNet MATH Google Scholar
Le T, Duong P, Dinh M, Nguyen DT, Nguyen V, Phung D (2016) Budgeted semi-supervised support vector machine. In: The 32nd conference on uncertainty in artificial intelligence
Le T, Nguyen V, Nguyen TD, Phung D (2016) Nonparametric budgeted stochastic gradient descent. In: The 19th international conference on artificial intelligence and statistics, pp 654–572
Lu J, Hoi SCH, Wang J, Zhao P, Liu Z-Y (2015) Large scale online kernel learning. J Mach Learn Res 17(1):1613–1655
MathSciNet MATH Google Scholar
Nguyen TD, Le T, Bui H, Phung D (2017) Large-scale online kernel learning with random feature reparameterization. In: Proceedings of the 26th international joint conference on artificial intelligence
Nguyen TD, Nguyen V, Le T, Phung D (2016) Distributed data augmented support vector machine on spark. In: 23rd international conference on pattern recognition, pp 498–503
Nguyen K, Le T, Nguyen V, Nguyen TD, Phung D (2016) Multiple kernel learning with data augmentation. In: 8th Asian conference on machine learning
Orabona F, Keshet J, Caputo B (2009) Bounded kernel-based online learning. J Mach Learn Res 10:2643–2666
MathSciNet MATH Google Scholar
Quiñonero-Candela J, Rasmussen CE (2005) A unifying view of sparse approximate Gaussian process regression. J Mach Learn Res 6:1939–1959
MathSciNet MATH Google Scholar
Rahimi A, Recht B (2007) Random features for large-scale kernel machines. In: Advances in neural information processing systems 21. pp 1177–1184
Rasmussen CE, Williams CKI (2005) Gaussian processes for machine learning (adaptive computation and machine learning). The MIT Press, Cambridge
Book Google Scholar
Rosenblatt F (1958) The perceptron: a probabilistic model for information storage and organization in the brain. Psychol Rev 65(6):386–408
Article Google Scholar
Schwaighofer A, Tresp V (2003) Transductive and inductive methods for approximate Gaussian process regression. In: Advances in neural information processing systems 17. pp 420–427
Seeger M, Williams CKI, Lawrence ND (2003) Fast forward selection to speed up sparse Gaussian process regression. In: Workshop on AI and statistics 9
Smola AJ, Bartlett PL (2001) Sparse greedy Gaussian process regression. In: Advances in neural information processing systems 15, MIT Press, Cambridge, pp 619–625
Snelson E, Ghahramani Z (2006) Sparse Gaussian processes using pseudo-inputs. In: Advances in neural information processing systems 20. MIT Press, Cambridge, pp 1257–1264
Snelson E, Ghahramani Z (2007) Local and global sparse Gaussian process approximations. In: Proceedings of the eleventh international conference on artificial intelligence and statistics, AISTATS 2007, pp 524–531, Mar 21–24, San Juan, Puerto Rico
Titsias MK (2009) Variational learning of inducing variables in sparse Gaussian processes. Artif Intell Stat 12:567–574
Google Scholar
Wang Z, Vucetic S (2010) Online passive-aggressive algorithms on a budget. AISTATS 9:908–915
Google Scholar
Wang Z, Crammer K, Vucetic S (2012) Breaking the curse of kernelization: budgeted stochastic gradient descent for large-scale svm training. J Mach Learn Res 13(1):3103–3131
MathSciNet MATH Google Scholar
Wang Z, Vucetic S (2009) Twin vector machines for online learning on a budget. In: Proceedings of the SIAM international conference on data mining, pp 906–917
Yang T, Li Y-F, Mahdavi M, Jin R, Zhou Z-H (2012) Nyström method vs random Fourier features: a theoretical and empirical comparison. In: Pereira F, Burges CJC, Bottou L, Weinberger KQ (eds) Advances in neural information processing systems, vol 25, pp 476–484
Zinkevich M (2003) Online convex programming and generalized infinitesimal gradient ascent. In: Machine learning, proceedings of the twentieth international conference (ICML 2003), pp 928–936

Download references

Acknowledgements

This work is partially supported by the Australian Research Council under the ARC DP160109394.

Author information

Authors and Affiliations

Faculty of Information Technology, Monash University, Melbourne, Australia
Trung Le, Tu Dinh Nguyen & Dinh Phung
Centre for Pattern Recognition and Data Analytics, School of Information Technology, Deakin University, Geelong, Australia
Khanh Nguyen & Vu Nguyen

Authors

Trung Le
View author publications
You can also search for this author in PubMed Google Scholar
Khanh Nguyen
View author publications
You can also search for this author in PubMed Google Scholar
Vu Nguyen
View author publications
You can also search for this author in PubMed Google Scholar
Tu Dinh Nguyen
View author publications
You can also search for this author in PubMed Google Scholar
Dinh Phung
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Trung Le.

Appendix A: Proofs of Lemmas and Theorems

Proposition 1

(Restated) The coefficient vector $\varvec{d}^{*}=[d_{1}^{*},\ldots ,d_{N}^{*}]$ of the projection in Eq. (7) can be explicitly computed as $\varvec{d}^{*}=[K+\sigma ^{2}I]^{-1}K_{*}^{\mathsf {T}}$. Furthermore, the posterior predictive mean $\mu _{*}$ in Eq. (3) can be expressed by a weighted sum of the training labels

$$\begin{aligned} \mu _{*}=\sum _{n=1}^{N}d_{n}^{*}y_{n}, \end{aligned}$$

and the posterior predictive standard deviation in Eq. (4) is the Euclidean distance from $\Phi _{*}$ to ${\mathcal {L}}\left( \Phi _{X}\right) $

$$\begin{aligned} \sigma _{*}=\left( {\mathrm{dist}}\left( \Phi _{*},{\mathcal {L}}\left( \Phi _{X}\right) \right) ^{2}-\sigma ^{2}\right) ^{1/2}. \end{aligned}$$

Proof of Proposition 1

We find $\varvec{d}=\left[ d_{n}\right] _{n=1,\ldots ,N}^{\mathsf {T}}$ by solving the following optimization problem

$$\begin{aligned} \underset{\varvec{d}}{\min }\,h\left( \varvec{d}\right) \triangleq \left\| \Phi _{*}-\sum _{n=1}^{N}d_{n}\Phi _{n}\right\| ^{2}. \end{aligned}$$

We derive as follows

$$\begin{aligned} h\left( \varvec{d}\right)&={\tilde{K}}_{**}-2{\tilde{K}}_{*}\varvec{d}+\varvec{d}^{\mathsf {T}}\left[ K+\sigma ^{2}I\right] \varvec{d},\\ \nabla _{\varvec{d}}h&=2\left[ K+\sigma ^{2}I\right] \varvec{d}-2K_{*}^{\mathsf {T}}=0,\\ \varvec{d}&^{*}=\left[ K+\sigma ^{2}I\right] ^{-1}K_{*}^{\mathsf {T}}. \end{aligned}$$

It follows that

$$\begin{aligned} \text {dict}\left( \Phi _{*},{\mathcal {L}}\left( \Phi _{X}\right) \right) ^{2}&=h\left( \varvec{d}^{*}\right) ={\tilde{K}}_{**}-2K_{*}\varvec{d}^{*}+\left( \varvec{d}^{*}\right) ^{\mathsf {T}}\left[ K+\sigma ^{2}I\right] \varvec{d}^{*},\\&=\sigma ^{2}+K_{**}-K_{*}\left[ K+\sigma ^{2}I\right] ^{-1}K_{*}^{\mathsf {T}}=\sigma _{*}^{2}+\sigma ^{2}. \end{aligned}$$

$\square $

Proposition 2

(Restated) Let us define $\varvec{\alpha }_{*}=\left[ K+\sigma ^{2}I\right] ^{-1}\varvec{y}$ whose elements are denoted explicitly by $\varvec{\alpha }_{*}=[\alpha _{1}^{*},\ldots ,\alpha _{N}^{*}]$, and ${\mathbf {w}}_{*}=\sum _{n=1}^{N}\alpha _{n}^{*}\Phi _{n}$. Then, ${\mathbf {w}}_{*}$ is the optimal solution for the following optimization problem:

$$\begin{aligned} \underset{{\mathbf {w}}}{\min }\,{\mathcal {J}}\left( {\mathbf {w}}\right) \triangleq \frac{\lambda }{2}\left\| {\mathbf {w}}\right\| ^{2}+\frac{1}{N}\sum _{n=1}^{N}\left( {\mathbf {w}}^{\mathsf {T}}\Phi _{n}-y_{n}\right) ^{2}, \end{aligned}$$

(11)

where $\lambda =2\sigma ^{2}N^{-1}$. The posterior predictive mean in Eq. (3) can be computed as

$$\begin{aligned} \mu _{*}={\mathbf {w}}_{*}^{\mathsf {T}}\Phi _{*}=\sum _{n=1}^{N}\alpha _{n}^{*}{\hat{K}}\left( \varvec{x}_{*},\varvec{x}_{n}\right) \end{aligned}$$

Proof of Proposition 2

We transform the unconstrained optimization problem in Eq. (8) to its equivalent form as follows

$$\begin{aligned} \underset{{\mathbf {w}},\varvec{\xi }}{\min }&\,\,\frac{1}{2}\left\| {\mathbf {w}}\right\| ^{2}+\frac{C}{N}\sum _{n=1}^{N}\xi _{n}^{2},\\ \text {s.t.}:&\,\,\xi _{n}={\mathbf {w}}^{\mathsf {T}}\Phi _{n}-y_{n},\,\forall n, \end{aligned}$$

where $C=\lambda ^{-1}=0.5N\sigma ^{-2}$.

The Lagrange function is of the following form

$$\begin{aligned} {\mathcal {L}}\left( {\mathbf {w}},\varvec{\xi },\varvec{\alpha }\right)&=\frac{1}{2}\left\| {\mathbf {w}}\right\| ^{2}+\frac{C}{N}\sum _{n=1}^{N}\xi _{n}^{2}-\sum _{n=1}^{N}\alpha _{n}\left( {\mathbf {w}}^{\mathsf {T}}\Phi _{n}-y_{n}-\xi _{n}\right) . \end{aligned}$$

Setting the derivatives to 0, we gain

$$\begin{aligned} \nabla _{{\mathbf {w}}}{\mathcal {L}}&={\mathbf {w}}-\sum _{n=1}^{N}\alpha _{n}\Phi _{n}\rightarrow {\mathbf {w}}=\sum _{n=1}^{N}\alpha _{n}\Phi _{n},\\ \nabla _{\xi _{n}}{\mathcal {L}}&=\frac{2C\xi _{n}}{N}+\alpha _{n}=0\rightarrow \xi _{n}=\frac{-N\alpha _{n}}{2C}=-\sigma ^{2}\alpha _{n}. \end{aligned}$$

Substituting the above to the Lagrange function, we have the following optimization problem

$$\begin{aligned} \underset{\varvec{\alpha }}{\min } \,\mathcal {A}\left( \varvec{\alpha }\right)&\triangleq \frac{1}{2}\varvec{\alpha }^{\mathsf {T}}\left[ K+\sigma ^{2}I\right] \varvec{\alpha }-\varvec{y}^{\mathsf {T}}\varvec{\alpha },\\ \nabla _{\varvec{\alpha }}\mathcal {A}&=\left[ K+\sigma ^{2}I\right] \varvec{\alpha }-\varvec{y}=0,\\ \varvec{\alpha }^{*}&=\left[ K+\sigma ^{2}I\right] ^{-1}\varvec{y}. \end{aligned}$$

Since the strong duality holds, we have ${\mathbf {w}}_{*}=\sum _{n=1}^{N}\alpha _{n}^{*}\Phi \left( x_{n}\right) $ and the predictive mean $\mu ^{*}=K_{*}\left[ K+\sigma ^{2}I\right] ^{-1}\varvec{y}=K_{*}\varvec{\alpha }^{*}=\sum _{n=1}^{N}\alpha _{n}^{*}K\left( \varvec{x}_{*},\varvec{x}_{n}\right) ={\mathbf {w}}_{*}^{\mathsf {T}}\Phi _{*}$.

We now consider the upper bound of $\left\| {\mathbf {w}}_{*}\right\| $ . In particular, we have the following lemma. $\square $

Lemma 1

(Restated) If we define ${\mathbf {w}}_{*}$ as

$$\begin{aligned} {\mathbf {w}}_{*}=\underset{{\mathbf {w}}}{{\mathrm{argmin}}}\left( \frac{\lambda }{2}\left\| {\mathbf {w}}\right\| ^{2}+\frac{1}{N}\sum _{n=1}^{N}\left( y_{n}-{\mathbf {w}}^{\mathsf {T}}\Phi _{n}\right) ^{2}\right) , \end{aligned}$$

then we have $\left\| {\mathbf {w}}_{*}\right\| \le y_{\max }\lambda ^{-1/2}$.

Proof of Lemma 1

Let us consider the equivalent constrained optimization problem

$$\begin{aligned}&\underset{{\mathbf {w}},\varvec{\xi }}{\min }\left( \frac{\lambda }{2}\left\| {\mathbf {w}}\right\| ^{2}+\frac{1}{N}\sum _{n=1}^{N}\xi _{i}^{2}\right) ,\\&\quad \text {s.t.:}\,\,&\xi _{n}=y_{n}-{\mathbf {w}}^{\mathsf {T}}\Phi _{n},\quad \forall n. \end{aligned}$$

The Lagrange function is of the following form

$$\begin{aligned} {\mathcal {L}}\left( {\mathbf {w}},\varvec{\xi ,\alpha }\right) =\frac{\lambda }{2}\left\| {\mathbf {w}}^{2}\right\| +\frac{1}{N}\sum _{n=1}^{N}\xi _{n}^{2}+\sum _{n=1}^{N}\alpha _{n}\left( y_{n}-{\mathbf {w}}^{\mathsf {T}}\Phi _{n}-\xi _{n}\right) . \end{aligned}$$

Setting the derivatives to 0, we gain

$$\begin{aligned} \nabla _{{\mathbf {w}}}{\mathcal {L}}= & {} \lambda {\mathbf {w}}-\sum _{n=1}^{N}\alpha _{n}\Phi _{n}=0\rightarrow {\mathbf {w}}=\lambda ^{-1}\sum _{n=1}^{N}\alpha _{n}\Phi _{n},\\ \nabla _{\xi _{n}}{\mathcal {L}}= & {} \frac{2}{N}\xi _{n}-\alpha _{n}=0\rightarrow \xi _{n}=\frac{N\alpha _{n}}{2}. \end{aligned}$$

Substituting the above to the Lagrange function, we gain the dual form

$$\begin{aligned} {\mathcal {W}}\left( \varvec{\alpha }\right) Z&= -\frac{\lambda }{2}\left\| {\mathbf {w}}\right\| ^{2}+\sum _{n=1}^{N}y_{n}\alpha _{n}-\frac{N}{4}\sum _{n=1}^{N}\alpha _{n}^{2},\\&=-\frac{1}{2\lambda }\left\| \sum _{n=1}^{N}\alpha _{n}\Phi _{n}\right\| ^{2}+\sum _{n=1}^{N}y_{n}\alpha _{n}-\frac{N}{4}\sum _{n=1}^{N}\alpha _{n}^{2}. \end{aligned}$$

Let us denote $\left( {\mathbf {w}}^{*},\varvec{\xi ^{*}}\right) $ and $\varvec{\alpha ^{*}}$ be the primal and dual solutions, respectively. Since the strong duality holds, we have

$$\begin{aligned} \frac{\lambda }{2}\left\| {\mathbf {w}}^{*}\right\| ^{2}+\frac{1}{N}\sum _{n=1}^{N}\xi _{n}^{*2}= & {} -\frac{\lambda }{2}\left\| {\mathbf {w}}^{*}\right\| ^{2}+\sum _{n=1}^{N}y_{n}\alpha _{n}^{*}-\frac{N}{4}\sum _{n=1}^{N}\alpha _{n}^{*2},\\ \lambda \left\| {\mathbf {w}}^{*}\right\| ^{2}= & {} \sum _{n=1}^{N}y_{n}\alpha _{n}^{*}-\frac{N}{4}\sum _{n=1}^{N}\alpha _{n}^{*2}-\frac{1}{N}\sum _{n=1}^{N}\xi _{n}^{*2},\\\le & {} \sum _{n=1}^{N}\left( y_{n}\alpha _{n}^{*}-\frac{N}{4}\alpha _{n}^{*2}\right) \le \sum _{n=1}^{N}\frac{y_{n}^{2}}{N}\le y_{\text {max}}^{2}. \end{aligned}$$

Note that we have used $g\left( \alpha _{n}^{*}\right) =y_{n}\alpha _{n}^{*}-\frac{N}{4}\alpha _{n}^{*2}\le g\left( \frac{2y_{n}}{N}\right) =\frac{y_{n}^{2}}{N}$. Hence, we gain the conclusion.

We now present the theoretical results regarding convergence analysis. The update rule is as follows

$$\begin{aligned} {\mathbf {w}}_{t+1}={\left\{ \begin{array}{ll} \prod {}_{S}\left( \frac{t-1}{t}{\mathbf {w}}_{t}-\frac{2\alpha _{t}}{\lambda t}\Phi _{t}\right) &{} \text {if}\,\delta _{t}>\theta ,\\ \prod {}_{S}\left( \frac{t-1}{t}{\mathbf {w}}_{t}-\frac{2\alpha _{t}}{\lambda t}{\mathbb {P}}_{{\mathcal {L}}\left( \Phi _{U}\right) }\left( \Phi _{t}\right) \right) &{} \text {otherwise}, \end{array}\right. } \end{aligned}$$

where $\alpha _{t}={\mathbf {w}}_{t}^{\mathsf {T}}\Phi _{t}-y_{t}$, $\delta _{t}={\mathrm{dist}}\left( \Phi _{t},{\mathcal {L}}\left( \Phi _{U}\right) \right) $, and $S={\left\{ \begin{array}{ll} {\mathbb {R}}^{d} &{} \text {if}\,\,\lambda >2\\ {\mathcal {B}}(0,y_{\text {max}}\lambda ^{-1/2}) &{} \text {otherwise} \end{array}\right. }$ with ${\mathcal {B}}(0,y_{\text {max}}\lambda ^{-1/2})$ is defined as $\{ x\in {\mathbb {R}}^{d}:\left\| x\right\| \le y_{\text {max}}\lambda ^{-1/2}\} $.

We can rewrite the update rule as follows

$$\begin{aligned} {\mathbf {w}}_{t+1}=\Pi _{S}\left( {\mathbf {w}}_{t}-\eta _{t}\left( g_{t}-2Z_{t}\alpha _{t}{\mathbb {H}}\left( U,\varvec{x}_{t}\right) \right) \right) , \end{aligned}$$

where $g_{t}=\lambda {\mathbf {w}}_{t}+2\alpha _{t}\Phi _{t}$, $Z_{t}$ is a binary random variable where $\Pr \left( Z_{t}=1\right) =\Pr \left( \delta _{t}\le \theta \right) $ (i.e., if the approximation is performed), and ${\mathbb {H}}\left( U,\varvec{x}_{t}\right) =\Phi \left( x_{t}\right) -{\mathbb {P}}_{{\mathcal {L}}\left( \Phi _{U}\right) }\left( \Phi \left( \varvec{x}_{t}\right) \right) $ specifies the rejection vector of $\Phi _{t}$ from ${\mathcal {L}}\left( \Phi _{U}\right) $. $\square $

Lemma 2

(Restated) The following statement holds

$$\begin{aligned} \left\| {\mathbf {w}}_{T+1}\right\| \le 2\lambda ^{-1}\left( 1+\sigma ^{2}\right) ^{1/2}\left( y_{\max }+\frac{\left( 1+\sigma ^{2}\right) ^{1/2}}{T}\sum _{t=1}^{T}\left\| {\mathbf {w}}_{t}\right\| \right) , \end{aligned}$$

where$y_{\text {max}}={\max }_{y\in {\mathcal {Y}}} |y|$.

Proof of Lemma 2

We have the following

$$\begin{aligned} {\mathbf {w}}_{t+1}={\left\{ \begin{array}{ll} \Pi _{S}\left( \frac{t-1}{t}{\mathbf {w}}_{t}-\frac{2\alpha _{t}}{\lambda t}\Phi _{t}\right) &{} \text {if}\,Z_{t}=0,\\ \Pi _{S}\left( \frac{t-1}{t}{\mathbf {w}}_{t}-\frac{2\alpha _{t}}{\lambda t}{\mathbb {P}}_{{\mathcal {L}}\left( \Phi _{U}\right) }\left( \Phi _{t}\right) \right) &{} \text {otherwise}. \end{array}\right. } \end{aligned}$$

It follows that

$$\begin{aligned} \left\| {\mathbf {w}}_{t+1}\right\| \le \frac{t-1}{t}\left\| {\mathbf {w}}_{t}\right\| +\frac{2}{\lambda t}\left| \alpha _{t}\right| \left( 1+\sigma ^{2}\right) ^{1/2}\,\,\,\,\,\,\,\,\text {since}\,\,\left\| {\mathbb {P}}{}_{{\mathcal {L}}\left( \Phi _{U}\right) }\left( \Phi _{t}\right) \right\| \le \left\| \Phi _{t}\right\| =\left( 1+\sigma ^{2}\right) ^{1/2}. \end{aligned}$$

It happens that

$$\begin{aligned} \left| \alpha _{t}\right| =\left| y_{t}-{\mathbf {w}}_{t}^{\mathsf {T}}\Phi _{t}\right| \le y_{\text {max}}+\left\| {\mathbf {w}}_{t}\right\| \left\| \Phi _{t}\right\| \le y_{\text {max}}+\left\| {\mathbf {w}}_{t}\right\| \left( 1+\sigma ^{2}\right) ^{1/2}. \end{aligned}$$

Thus, we achieve

$$\begin{aligned} t\left\| {\mathbf {w}}_{t+1}\right\| \le \left( t-1\right) \left\| {\mathbf {w}}_{t}\right\| +2\lambda ^{-1}\left( 1+\sigma ^{2}\right) ^{1/2}\left( y_{\text {max}}+\left\| {\mathbf {w}}_{t}\right\| \left( 1+\sigma ^{2}\right) ^{1/2}\right) . \end{aligned}$$

Taking sum when $t=1,2,\ldots ,T$, we achieve

$$\begin{aligned} T\left\| {\mathbf {w}}_{T+1}\right\|\le & {} 2\lambda ^{-1}\left( 1+\sigma ^{2}\right) ^{1/2}\left( Ty_{\text {max}}+\left( 1+\sigma ^{2}\right) ^{1/2}\sum _{t=1}^{T}\left\| {\mathbf {w}}_{t}\right\| \right) ,\nonumber \\ \left\| {\mathbf {w}}_{T+1}\right\|\le & {} 2\lambda ^{-1}\left( 1+\sigma ^{2}\right) ^{1/2}\left( y_{\text {max}}+\frac{\left( 1+\sigma ^{2}\right) ^{1/2}}{T}\sum _{t=1}^{T}\left\| {\mathbf {w}}_{t}\right\| \right) . \end{aligned}$$

(12)

$\square $

Lemma 3

(Restated) If $\lambda >2\left( 1+\sigma ^{2}\right) $, then $\left\| {\mathbf {w}}_{T+1}\right\| \le \frac{y_{\max }}{\frac{\lambda }{2\left( 1+\sigma ^{2}\right) }-1}\left( 1-\frac{1}{\left( \frac{\lambda }{2\left( 1+\sigma ^{2}\right) }\right) ^{T}}\right) <\frac{y_{\max }}{\frac{\lambda }{2\left( 1+\sigma ^{2}\right) }-1}$ for all T.

Proof of Lemma 3

First we consider the sequence $\left\{ s_{T}\right\} _{T}$ which is identified as $s_{T+1}=2\lambda ^{-1}(1+\sigma ^{2})^{1/2}(y_{\text {max}}+(1+\sigma ^{2})^{1/2}s_{T})$ and $s_{1}=0$. It is easy to find the formula of this sequence as

$$\begin{aligned} s_{T+1}-\frac{y_{\text {max}}}{\frac{\lambda }{2\left( 1+\sigma ^{2}\right) }-1}&=\left( \frac{\lambda }{2\left( 1+\sigma ^{2}\right) }\right) ^{-1}\left( s_{T}-\frac{y_{\text {max}}}{\frac{\lambda }{2\left( 1+\sigma ^{2}\right) }-1}\right) =\ldots \\&=\left( \frac{\lambda }{2\left( 1+\sigma ^{2}\right) }\right) ^{-T}\left( s_{1}-\frac{y_{\text {max}}}{\frac{\lambda }{2\left( 1+\sigma ^{2}\right) }-1}\right) =\frac{-\left( \frac{\lambda }{2\left( 1+\sigma ^{2}\right) }\right) ^{-T}y_{\text {max}}}{\frac{\lambda }{2\left( 1+\sigma ^{2}\right) }-1}\\ s_{T+1}&=\frac{y_{\text {max}}}{\frac{\lambda }{2\left( 1+\sigma ^{2}\right) }-1}\left( 1-\frac{1}{\left( \frac{\lambda }{2\left( 1+\sigma ^{2}\right) }\right) ^{T}}\right) . \end{aligned}$$

We prove by induction by T that $\left\| {\mathbf {w}}_{T}\right\| \le s_{T}$ for all T. It is obvious that $\left\| {\mathbf {w}}_{1}\right\| =s_{1}=0$. Assume that $\left\| {\mathbf {w}}_{t}\right\| \le s_{t}$ for $t\le T$, we verify it for $T+1$. Indeed, we have

$$\begin{aligned} \left\| {\mathbf {w}}_{T+1}\right\|&\le 2\lambda ^{-1}\left( 1+\sigma ^{2}\right) ^{1/2}\left( y_{\text {max}}+\frac{\left( 1+\sigma ^{2}\right) ^{1/2}}{T}\sum _{t=1}^{T}\left\| {\mathbf {w}}_{t}\right\| \right) ,\\&\le 2\lambda ^{-1}\left( 1+\sigma ^{2}\right) ^{1/2}\left( y_{\text {max}}+\frac{\left( 1+\sigma ^{2}\right) ^{1/2}}{T}\sum _{t=1}^{T}s_{t}\right) ,\\&\le 2\lambda ^{-1}\left( 1+\sigma ^{2}\right) ^{1/2}\left( y_{\text {max}}+\left( 1+\sigma ^{2}\right) ^{1/2}s_{T}\right) =s_{T+1}. \end{aligned}$$

Lemma 4

(Restated) The following statement holds $\left\| {\mathbf {w}}_{t}-{\mathbf {w}}_{*}\right\| ^{2}\le W,\,\forall t$ where we have defined

$$\begin{aligned} W={\left\{ \begin{array}{ll} \left( \frac{y_{\max }}{\frac{\lambda }{2\left( 1+\sigma ^{2}\right) }-1}+y_{\max }\lambda ^{-1/2}\right) ^{2} &{} {\mathrm{if}}\quad \lambda >2(1+\sigma ^{2}),\\ 4y_{\max }^{2}\lambda ^{-1} &{} {\mathrm{otherwise}}. \end{array}\right. } \end{aligned}$$

Proof of Lemma 4

We consider two cases $\lambda >2\left( 1+\sigma ^{2}\right) $:

$$\begin{aligned} \left\| {\mathbf {w}}_{t}-{\mathbf {w}}_{*}\right\| ^{2}\le \left( \left\| {\mathbf {w}}_{t}\right\| +\left\| {\mathbf {w}}_{*}\right\| \right) ^{2}\le \left( \frac{y_{\text {max}}}{\frac{\lambda }{2\left( 1+\sigma ^{2}\right) }-1}+y_{\text {max}}\lambda ^{-1/2}\right) ^{2}. \end{aligned}$$

$0<\lambda \le 2(1+\sigma ^{2})$:

Both ${\mathbf {w}}_{t}$ and ${\mathbf {w}}^{*}$ are in ${\mathcal {B}}\left( 0,y_{\text {max}}\lambda ^{-1/2}\right) $. Hence, we have

$$\begin{aligned} \left\| {\mathbf {w}}_{t}-{\mathbf {w}}_{*}\right\| ^{2}\le \left( 2y_{\text {max}}\lambda ^{-1/2}\right) ^{2}=4y_{\text {max}}^{2}\lambda ^{-1}. \end{aligned}$$

$\square $

Lemma 5

(Restated) The following statement holds

$$\begin{aligned} \left| \alpha _{t}\right| \le M=\left( 1+\sigma ^{2}\right) ^{1/2}\max \left( \frac{y_{\max }}{\frac{\lambda }{2\left( 1+\sigma ^{2}\right) }-1},y_{\max }\lambda ^{-1/2}\right) +y_{\max },\quad \forall t. \end{aligned}$$

Proof of Lemma 5

We derive as follows

$$\begin{aligned} \left\| {\mathbf {w}}_{t}\right\|&\le \max \left( \frac{y_{\text {max}}}{\frac{\lambda }{2\left( 1+\sigma ^{2}\right) }-1},y_{\text {max}}\lambda ^{-1/2}\right) ,\\ \left| \alpha _{t}\right|&=\left| {\mathbf {w}}_{t}^{\mathsf {T}}\Phi \left( \varvec{x}_{t}\right) -y_{t}\right| \le \left\| {\mathbf {w}}_{t}\right\| \left\| \Phi _{t}\right\| +y_{\max },\\&\le \left( 1+\sigma ^{2}\right) ^{1/2}\max \left( \frac{y_{\text {max}}}{\frac{\lambda }{2\left( 1+\sigma ^{2}\right) }-1},y_{\max }\lambda ^{-1/2}\right) +y_{\max }=M. \end{aligned}$$

Lemma 6

(Restated) The following statement holds

$$\begin{aligned} \left\| g_{t}-2Z_{t}\alpha _{t}{\mathbb {H}}\left( U,\varvec{x}_{t}\right) \right\| ^{2}\le G,\,\forall t, \end{aligned}$$

where we have defined

$$\begin{aligned} G=\left( \left( \lambda +2\left( 1+\sigma ^{2}\right) ^{1/2}\right) \max \left( \frac{\lambda y_{\max }}{\frac{\lambda }{2\left( 1+\sigma ^{2}\right) }-1},\lambda y_{\max }\lambda ^{-1/2}\right) +2\left( 1+\sigma ^{2}\right) ^{1/2}y_{\max }\right) ^{2}. \end{aligned}$$

Proof of Lemma 6

We derive as follows

$$\begin{aligned} \left\| g_{t}-2Z_{t}\alpha _{t}{\mathbb {H}}\left( U,\varvec{x}_{t}\right) \right\| ^{2}&\le \left( \lambda \left\| {\mathbf {w}}_{t}\right\| +2\left| \alpha _{t}\right| \left\| \Phi _{t}\right\| \right) ^{2}=\left( \lambda \left\| {\mathbf {w}}_{t}\right\| +2\left| \alpha _{t}\right| \left( 1+\sigma ^{2}\right) ^{1/2}\right) ^{2}. \end{aligned}$$

Here, we note that to gain the above inequality, we consider two cases $Z_{t}=1$ and $Z_{t}=0$ and use $\left\| {\mathbb {H}}\left( U,\varvec{x}_{t}\right) \right\| \le \left\| \Phi _{t}\right\| =\left( 1+\sigma ^{2}\right) ^{1/2}$.

We have

$$\begin{aligned} \left\| {\mathbf {w}}_{t}\right\|&\le \max \left( \frac{y_{\text {max}}}{\frac{\lambda }{2\left( 1+\sigma ^{2}\right) }-1},y_{\text {max}}\lambda ^{-1/2}\right) ,\\ \left| \alpha _{t}\right|&\le M=\left( 1+\sigma ^{2}\right) ^{1/2}\max \left( \frac{y_{\text {max}}}{\frac{\lambda }{2\left( 1+\sigma ^{2}\right) }-1},y_{\max }\lambda ^{-1/2}\right) +y_{\max }. \end{aligned}$$

Hence, we gain

$$\begin{aligned} \lambda \left\| {\mathbf {w}}_{t}\right\| +2\left| \alpha _{t}\right| \left( 1+\sigma ^{2}\right) ^{1/2}&\le \max \left( \frac{\lambda y_{\text {max}}}{\frac{\lambda }{2\left( 1+\sigma ^{2}\right) }-1},\lambda y_{\text {max}}\lambda ^{-1/2}\right) \\&\quad +2\left( 1+\sigma ^{2}\right) ^{1/2}\max \left( \frac{y_{\text {max}}}{\frac{\lambda }{2\left( 1+\sigma ^{2}\right) }-1},y_{\max }\lambda ^{-1/2}\right) \\&\quad +2\left( 1+\sigma ^{2}\right) ^{1/2}y_{\max },\\&= \left( \lambda +2\left( 1+\sigma ^{2}\right) ^{1/2}\right) \max \left( \frac{\lambda y_{\text {max}}}{\frac{\lambda }{2\left( 1+\sigma ^{2}\right) }-1},\lambda y_{\text {max}}\lambda ^{-1/2}\right) \\&\quad +2\left( 1+\sigma ^{2}\right) ^{1/2}y_{\max }. \end{aligned}$$

$\square $

Theorem 1

(Restated) Consider Algorithm 4.1 where $\left( \varvec{x}_{t},y_{t}\right) \sim P_{N}$, or $P_{{\mathcal {X}}\times {\mathcal {Y}}}$, arrives on fly, then the following bound holds

$$\begin{aligned} {\mathbb {E}}\left[ {\mathcal {J}}\left( {\overline{{\mathbf {w}}}}_{T}\right) -{\mathcal {J}}\left( {\mathbf {w}}_{*}\right) \right]&\le \frac{G\left( \log \,T+1\right) }{2\lambda T}+\frac{2WM\theta }{T}\sum _{t=1}^{T}p_{t}\le \frac{G\left( \log \,T+1\right) }{2\lambda T}+2WM\theta , \end{aligned}$$

where $G,\,M,\,W$ are positive constants and $p_{t}=\Pr \left( Z_{t}=1\right) $ as defined before.

Proof of Theorem 1

We have

$$\begin{aligned} \left\| {\mathbf {w}}_{t+1}-{\mathbf {w}}_{*}\right\| ^{2}&=\left\| \prod _{S}\left( {\mathbf {w}}_{t}-\eta _{t}\left( g_{t}-2Z_{t}\alpha _{t}{\mathbb {H}}\left( U,\varvec{x}_{t}\right) \right) \right) -{\mathbf {w}}^{*}\right\| ^{2}\\&\le \left\| {\mathbf {w}}_{t}-\eta _{t}\left( g_{t}-2Z_{t}\alpha _{t}{\mathbb {H}}\left( U,\varvec{x}_{t}\right) \right) -{\mathbf {w}}_{*}\right\| ^{2}\\&=\left\| {\mathbf {w}}_{t}-{\mathbf {w}}_{*}\right\| ^{2}+\eta _{t}^{2}\left\| g_{t}-2Z_{t}\alpha _{t}{\mathbb {H}}\left( U,\varvec{x}_{t}\right) \right\| ^{2}\\&\quad -2\eta _{t}\left\langle {\mathbf {w}}_{t}-w^{*},g_{t}-2Z_{t}\alpha _{t}{\mathbb {H}}\left( U,\varvec{x}_{t}\right) \right\rangle ,\\ \left\langle {\mathbf {w}}_{t}-{\mathbf {w}}_{*},g_{t}\right\rangle&\le \frac{\left\| {\mathbf {w}}_{t}-{\mathbf {w}}_{*}\right\| ^{2}-\left\| {\mathbf {w}}_{t+1}-{\mathbf {w}}_{*}\right\| ^{2}}{2\eta _{t}}\\&\quad +\frac{\eta _{t}}{2}\left\| g_{t} -2Z_{t}\alpha _{t}{\mathbb {H}}\left( U,\varvec{x}_{t}\right) \right\| ^{2}+\left\langle {\mathbf {w}}_{t}-{\mathbf {w}}_{*},2Z_{t}\alpha _{t}{\mathbb {H}}\left( U,\varvec{x}_{t}\right) \right\rangle . \end{aligned}$$

We recall that $g_{t}=\lambda {\mathbf {w}}_{t}+2\left( {\mathbf {w}}_{t}^{\mathsf {T}}\Phi _{t}-y_{t}\right) \Phi \left( \varvec{x}_{t}\right) $ and $\left( \varvec{x}_{t},y_{t}\right) \sim P_{N}$ or $P_{{\mathcal {X}}\times {\mathcal {Y}}}$. Hence, we gain ${\mathbb {E}}\left[ g_{t}|{\mathbf {w}}_{t}\right] ={\mathcal {J}}^{'}\left( {\mathbf {w}}_{t}\right) $.

Taking the conditional expectation w.r.t ${\mathbf {w}}_{t}$, we achieve

$$\begin{aligned} \left\langle {\mathbf {w}}_{t}-{\mathbf {w}}_{*},{\mathcal {J}}^{'}\left( {\mathbf {w}}_{t}\right) \right\rangle&\le {\mathbb {E}}\left[ \frac{\left\| {\mathbf {w}}_{t}-{\mathbf {w}}_{*}\right\| ^{2}-\left\| {\mathbf {w}}_{t+1}-{\mathbf {w}}_{*}\right\| ^{2}}{2\eta _{t}}\right] \\&\quad +\frac{\eta _{t}}{2}{\mathbb {E}}\left[ \left\| g_{t}-2Z_{t}\alpha _{t}{\mathbb {H}}\left( U,\varvec{x}_{t}\right) \right\| ^{2}\right] \\&\quad +{\mathbb {E}}\left[ \left\langle {\mathbf {w}}_{t}-{\mathbf {w}}_{*},2Z_{t}\alpha _{t}{\mathbb {H}}\left( U,\varvec{x}_{t}\right) \right\rangle \right] ,\\ {\mathcal {J}}\left( {\mathbf {w}}_{t}\right) -{\mathcal {J}}\left( {\mathbf {w}}_{*}\right) +\frac{\lambda }{2}\left\| {\mathbf {w}}_{t}-{\mathbf {w}}_{*}\right\| ^{2}&\le {\mathbb {E}}\left[ \frac{\left\| {\mathbf {w}}_{t}-{\mathbf {w}}_{*}\right\| ^{2}-\left\| {\mathbf {w}}_{t+1}-{\mathbf {w}}_{*}\right\| ^{2}}{2\eta _{t}}\right] \\&\quad +\frac{\eta _{t}}{2}{\mathbb {E}}\left[ \left\| g_{t}-2Z_{t}\alpha _{t}{\mathbb {H}}\left( U,\varvec{x}_{t}\right) \right\| ^{2}\right] \\&\quad +2{\mathbb {E}}\left[ \left\langle {\mathbf {w}}_{t}-{\mathbf {w}}_{*},Z_{t}\alpha _{t}{\mathbb {H}}\left( U,\varvec{x}_{t}\right) \right\rangle \right] . \end{aligned}$$

Taking expectation again, we obtain

$$\begin{aligned} {\mathbb {E}}\left[ {\mathcal {J}}\left( {\mathbf {w}}_{t}\right) -{\mathcal {J}}\left( {\mathbf {w}}_{*}\right) \right]&\le \frac{\lambda }{2}\left( t-1\right) {\mathbb {E}}\left[ \left\| {\mathbf {w}}_{t}-{\mathbf {w}}_{*}\right\| ^{2}\right] -\frac{\lambda }{2}t{\mathbb {E}}\left[ \left\| {\mathbf {w}}_{t+1}-{\mathbf {w}}_{*}\right\| ^{2}\right] \nonumber \\&\quad +\frac{\eta _{t}}{2}{\mathbb {E}}\left[ \left\| g_{t}-Z_{t}\alpha _{t}{\mathbb {H}}\left( U,\varvec{x}_{t}\right) \right\| ^{2}\right] \nonumber \\&\quad +2{\mathbb {E}}\left[ \left\langle {\mathbf {w}}_{t}-{\mathbf {w}}_{*},Z_{t}\alpha _{t}{\mathbb {H}}\left( U,\varvec{x}_{t}\right) \right\rangle \right] ,\nonumber \\&\le \frac{\lambda }{2}\left( t-1\right) {\mathbb {E}}\left[ \left\| {\mathbf {w}}_{t}-{\mathbf {w}}_{*}\right\| ^{2}\right] -\frac{\lambda }{2}t{\mathbb {E}}\left[ \left\| {\mathbf {w}}_{t+1}-{\mathbf {w}}_{*}\right\| ^{2}\right] \nonumber \\&\quad +\frac{G\eta _{t}}{2}+2{\mathbb {E}}\left[ \left\| {\mathbf {w}}_{t}-{\mathbf {w}}_{*}\right\| \left\| {\mathbb {H}}\left( U,\varvec{x}_{t}\right) \right\| \left| \alpha _{t}\right| Z_{t}\right] ,\nonumber \\&\le \frac{\lambda }{2}\left( t-1\right) {\mathbb {E}}\left[ \left\| {\mathbf {w}}_{t}-{\mathbf {w}}_{*}\right\| ^{2}\right] -\frac{\lambda }{2}t{\mathbb {E}}\left[ \left\| {\mathbf {w}}_{t+1}-{\mathbf {w}}_{*}\right\| ^{2}\right] \nonumber \\&\quad +\frac{G\eta _{t}}{2}+2WM{\mathbb {E}}\left[ \left\| {\mathbb {H}}\left( U,\varvec{x}_{t}\right) \right\| Z_{t}\right] ,\nonumber \\&\le \frac{\lambda }{2}\left( t-1\right) {\mathbb {E}}\left[ \left\| {\mathbf {w}}_{t}-{\mathbf {w}}_{*}\right\| ^{2}\right] -\frac{\lambda }{2}t{\mathbb {E}}\left[ \left\| {\mathbf {w}}_{t+1}-{\mathbf {w}}_{*}\right\| ^{2}\right] \nonumber \\&\quad +\frac{G\eta _{t}}{2}+2WM\theta \Pr \left( Z_{t}=1\right) . \end{aligned}$$

(13)

Taking sum the above when $t=1,\ldots ,T$, we achieve

$$\begin{aligned} \sum _{t=1}^{T}{\mathbb {E}}\left[ {\mathcal {J}}\left( {\mathbf {w}}_{t}\right) \right] -T{\mathcal {J}}\left( {\mathbf {w}}_{*}\right)&\le \frac{G}{2}\sum _{t=1}^{T}\frac{1}{\lambda t}+2WM\theta \sum _{t=1}^{T}p_{t},\\ \frac{1}{T}\sum _{t=1}^{T}{\mathbb {E}}\left[ {\mathcal {J}}\left( {\mathbf {w}}_{t}\right) \right] -{\mathcal {J}}\left( {\mathbf {w}}_{*}\right)&\le \frac{G\left( \log T+1\right) }{2\lambda T}+\frac{2WM\theta }{T}\sum _{t=1}^{T}p_{t}\le \frac{G\left( \log T+1\right) }{2\lambda T}+2WM\theta ,\\ {\mathbb {E}}\left[ {\mathcal {J}}\left( {\overline{{\mathbf {w}}}}_{T}\right) -{\mathcal {J}}\left( {\mathbf {w}}_{*}\right) \right]&\le \frac{G\left( \log T+1\right) }{2\lambda T}+\frac{2WM\theta }{T}\sum _{t=1}^{T}p_{t}\le \frac{G\left( \log T+1\right) }{2\lambda T}+2WM\theta . \end{aligned}$$

$\square $

Theorem 2

(Restated) Consider the output of Algorithm 4.1 and further let ${\overline{{\mathbf {w}}}}_{T}^{\gamma }=\frac{1}{\gamma T}\sum _{t=\left( 1-\gamma \right) T+1}^{T}{\mathbf {w}}_{t}$ and $W_{T}^{\gamma }={\mathbb {E}}[\left\| {\mathbf {w}}_{\left( 1-\gamma \right) T+1}-{\mathbf {w}}_{*}\right\| ^{2}]$ where $0<\gamma <1$, then, the following inequality holds

$$\begin{aligned} {\mathcal {J}}\left( {\overline{{\mathbf {w}}}}_{T}^{\gamma }\right) -{\mathcal {J}}\left( {\mathbf {w}}_{*}\right)&\le \frac{1}{\gamma T}\sum _{t=\gamma 'T+1}^{T}{\mathcal {J}}\left( {\mathbf {w}}_{t}\right) -{\mathcal {J}}\left( {\mathbf {w}}_{*}\right) ,\\&\le \frac{\lambda \gamma '}{2\gamma }W_{T}^{\gamma }+\frac{G\log \left( 1/\gamma '\right) }{2\lambda \gamma T}+\frac{2WM\theta }{\gamma T}\sum _{t=\gamma 'T+1}^{T}p_{t},\\&\le \frac{\lambda \gamma '}{2\gamma }W_{T}^{\gamma }+\frac{G\log \left( 1/\gamma '\right) }{2\lambda \gamma T}+2WM\theta , \end{aligned}$$

where $\gamma '=1-\gamma $.

Proof of Theorem 2

Taking sum in Eq. (13) when $t=\left( 1-\gamma \right) T+1,\ldots ,T$, we gain

$$\begin{aligned} \frac{1}{\gamma T}\sum _{t=\left( 1-\gamma \right) T+1}^{T}{\mathcal {J}}\left( {\mathbf {w}}_{t}\right) -{\mathcal {J}}\left( {\mathbf {w}}_{*}\right)&\le \frac{\lambda \left( 1-\gamma \right) }{2\gamma }W_{T}^{\gamma }+\frac{G\log \left( 1/\left( 1-\gamma \right) \right) }{2\lambda \gamma T}+\frac{2WM\theta }{\gamma T}\sum _{t=\left( 1-\gamma \right) T+1}^{T}p_{t},\\&\le \frac{\lambda \left( 1-\gamma \right) }{2\gamma }W_{T}^{\gamma }+\frac{G\log \left( 1/\left( 1-\gamma \right) \right) }{2\lambda T}+2WM\theta . \end{aligned}$$

We note that we have used the inequality $\sum _{t=\left( 1-\gamma \right) T+1}^{T}\frac{1}{t}\le \log \left( 1/\left( 1-\gamma \right) \right) $.

To achieve the conclusion, we use the convexity of the function ${\mathcal {J}}\left( .\right) $ which implies ${\mathcal {J}}\left( {\overline{{\mathbf {w}}}}_{T}^{\gamma }\right) \le \frac{1}{\gamma T}\sum _{t=\left( 1-\gamma \right) T+1}^{T}{\mathcal {J}}\left( {\mathbf {w}}_{t}\right) $. $\square $

Theorem 5

(Hoeffding inequality) Let the independent variables $X_{1},\ldots ,X_{n}$ where $a_{i}\le X_{i}\le b_{i}$ for each $i\in \left[ n\right] $. Let $S=\sum _{i=1}^{n}X_{i}$ and $\Delta _{i}=b_{i}-a_{i}$. The following statements hold

(i)
${\mathbb {P}}\left( S-{\mathbb {E}}\left[ S\right] >\varepsilon \right) \le \exp \left( -\frac{2\varepsilon ^{2}}{\sum _{i=1}^{n}\Delta _{i}^{2}}\right) $.
(ii)
${\mathbb {P}}\left( \left| S-{\mathbb {E}}\left[ S\right] >\varepsilon \right| \right) \le 2\exp \left( -\frac{2\varepsilon ^{2}}{\sum _{i=1}^{n}\Delta _{i}^{2}}\right) .$

Theorem 3

(Restated) Define the gap $m_{T}=\frac{\lambda \gamma '}{2\gamma }W_{T}^{\gamma }+\frac{2WM\theta }{\gamma T}\sum _{t=\gamma 'T+1}^{T}p_{t}$. Let r be any number randomly picked from $\{ \gamma 'T+1,\gamma 'T+2,\ldots ,T\} $. With a probability at least $1-\delta $, the following inequality holds

$$\begin{aligned} {\mathcal {J}}\left( {\mathbf {w}}_{r}\right) -{\mathcal {J}}\left( {\mathbf {w}}_{*}\right) \le \frac{G\log \left( 1/\gamma '\right) }{2\lambda \gamma T}+m_{T}+\Delta _{T}^{\gamma }\sqrt{\frac{1}{2}\log \frac{1}{\delta }}, \end{aligned}$$

where $\Delta _{T}^{\gamma }={\max }_{\gamma 'T+1\le t\le T}\left( {\mathcal {J}}\left( {\mathbf {w}}_{t}\right) -{\mathcal {J}}\left( {\mathbf {w}}_{*}\right) \right) $.

Proof of Theorem 3

From the above theorem, we achieve

$$\begin{aligned} \frac{1}{\gamma T}\sum _{t=\left( 1-\gamma \right) T+1}^{T}{\mathbb {E}}\left[ {\mathcal {J}}\left( {\mathbf {w}}_{t}\right) \right] -{\mathcal {J}}\left( {\mathbf {w}}_{*}\right) \le \frac{G\log \left( 1/\left( 1-\gamma \right) \right) }{2\lambda \gamma T}+m_{T}, \end{aligned}$$

where $m_{T}=\frac{\lambda \left( 1-\gamma \right) }{2\gamma }W_{T}^{\gamma }+\frac{W\theta }{\gamma T}\sum _{t=\left( 1-\gamma \right) T+1}^{T}p_{t}$.

Let us denote $X={\mathcal {J}}\left( {\mathbf {w}}_{r}\right) -{\mathcal {J}}\left( {\mathbf {w}}_{*}\right) $, where r is uniformly sampled from $\left\{ \left( 1-\gamma \right) T+1,2,\ldots ,T\right\} $. We have

$$\begin{aligned} {\mathbb {E}}_{r}\left[ X\right] =\frac{1}{\gamma T}\sum _{t=\left( 1-\gamma \right) T+1}^{T}{\mathbb {E}}\left[ {\mathcal {J}}\left( {\mathbf {w}}_{r}\right) \right] -{\mathcal {J}}\left( {\mathbf {w}}_{*}\right) \le \frac{G\log \left( 1/\left( 1-\gamma \right) \right) }{2\lambda \gamma T}+m_{T}. \end{aligned}$$

It follows that

$$\begin{aligned} {\mathbb {E}}\left[ X\right] ={\mathbb {E}}_{\left( x_{t},y_{t}\right) _{t=1}^{T}}\left[ {\mathbb {E}}_{r}\left[ X\right] \right] \le \frac{G\log \left( 1/\left( 1-\gamma \right) \right) }{2\lambda \gamma T}+m_{T}. \end{aligned}$$

Let us denote $\Delta _{T}^{\gamma }=\underset{\left( 1-\gamma \right) T+1\le t\le T}{\max }\left( {\mathcal {J}}\left( {\mathbf {w}}_{t}\right) -{\mathcal {J}}\left( {\mathbf {w}}_{*}\right) \right) ,$ which implies that $0<{\mathcal {J}}\left( w_{r}\right) -{\mathcal {J}}\left( w^{*}\right) <\Delta _{T}^{\gamma }$. Applying Hoeffding inequality for the random variable X, we gain

$$\begin{aligned} P\left( X-{\mathbb {E}}\left[ X\right]>\varepsilon \right) \le&\exp \left( -\frac{2\varepsilon ^{2}}{\left( \Delta _{T}^{\gamma }\right) ^{2}}\right) , \\ P\left( X-\frac{G\log \left( 1/\left( 1-\gamma \right) \right) }{2\lambda \gamma T}-m_{T}>\varepsilon \right) \le&\exp \left( -\frac{2\varepsilon ^{2}}{\left( \Delta _{T}^{\gamma }\right) ^{2}}\right) , \\ P\left( X\le G\log \left( 1/\left( 1-\gamma \right) \right) +m_{T}+\varepsilon \right) >&1-\exp \left( -\frac{2\varepsilon ^{2}}{\left( \Delta _{T}^{\gamma }\right) ^{2}}\right) . \end{aligned}$$

Choosing $\delta =\exp \left( -\frac{2\varepsilon ^{2}}{\left( \Delta _{T}^{\gamma }\right) ^{2}}\right) $ or $\varepsilon =\Delta _{T}^{\gamma }\sqrt{\frac{1}{2}\log \frac{1}{\delta }}$, then with the probability at least $1-\delta $, we have

$$\begin{aligned} {\mathcal {J}}\left( {\mathbf {w}}_{r}\right) -{\mathcal {J}}\left( {\mathbf {w}}_{*}\right) \le \frac{G\log \left( 1/\left( 1-\gamma \right) \right) }{2\lambda \gamma T}+m_{T}+\Delta _{T}^{\gamma }\sqrt{\frac{1}{2}\log \frac{1}{\delta }}. \end{aligned}$$

$\square $

Theorem 4

(Restated). Consider Algorithm 4.1 where $\left( \varvec{x}_{t},y_{t}\right) \sim P_{N}$ or $P_{{\mathcal {X}}\times {\mathcal {Y}}}$, after at most $T_{\theta }$ iterations, this algorithm reaches an $\theta $-stable state in which the size of inducing set is bounded by $T_{\theta }$, i.e., $\left| U\right| \le T_{\theta }$, and for any $\varvec{x}_{*}\notin U$ , the standard deviation $\sigma _{U}\left( \varvec{x}_{*}\right) $ of the distribution $p\left( f_{*}\mid \varvec{x}_{*},U\right) $ is less than $(\theta ^{2}-\sigma ^{2})^{1/2}$. More importantly, the constant $T_{\theta }$ is independent with the data distribution and arrival order.

Proof of Theorem 4

We assume that ${\mathcal {X}}\subset {\mathbb {R}}^{d}$ is a compact set (i.e., close and bounded) and $\Phi \left( .\right) $ is a continuous map. Since ${\mathcal {X}}$ is a compact set and $\Phi \left( .\right) $ is a continuous map, $\Phi \left( {\mathcal {X}}\right) $ is also a compact set. Let $\left\{ {\mathcal {C}}\left( \Phi \left( \varvec{s}\right) ,\theta \right) \right\} _{\varvec{s}\in {\mathcal {X}}}$ be an open coverage of $\Phi \left( {\mathcal {X}}\right) $. We note that the open sphere ${\mathcal {C}}\left( \Phi \left( \varvec{s}\right) ,\theta \right) $ is defined as

$$\begin{aligned} {\mathcal {C}}\left( \Phi \left( \varvec{s}\right) ,\theta \right) =\left\{ \phi \in \Phi \left( {\mathcal {X}}\right) \mid \left\| \phi -\Phi \left( \varvec{s}\right) \right\| <\theta \right\} . \end{aligned}$$

From this coverage, we can extract a finite coverage of $T_{\theta }$ open set, that is, $\left\{ {\mathcal {C}}\left( \Phi \left( \varvec{s}_{i}\right) ,\theta \right) \right\} _{i=1}^{T_{\theta }}$. We denote the set of inducing variables U right before adding the instance $\left( \varvec{x}_{t},y_{t}\right) $ by $U_{t}$. It is apparent that the resultant set of inducing variables is the union of all instantaneous set of inducing variables, i.e., $U=\bigcup _{t\ge 1}U_{t}$.

We now prove that if $\varvec{u},\varvec{v}\in U$ are two different elements in U, then $\left\| \Phi \left( \varvec{u}\right) -\Phi \left( \varvec{v}\right) \right\| >\theta $. We assume that $\varvec{u}=\varvec{x}_{t}$ and $\varvec{v}=\varvec{x}_{t'}$ with $t>t^{'}$. We then have

$$\begin{aligned} {\mathrm{dist}}\left( \Phi \left( \varvec{x}_{t}\right) ,{\mathcal {L}}_{\Phi \left( U_{t'}\right) }\right)>\theta ,\\ \underset{\varvec{d}}{\min }\left\| \Phi \left( \varvec{x}_{t}\right) -\sum _{\varvec{x}\in U_{t'}}d_{x}\Phi \left( \varvec{x}\right) \right\| >\theta . \end{aligned}$$

Since $\varvec{v}=x_{t'}\in U_{t}$, by choosing $d_{v}=1$ and $d_{x}=0,\,x\ne v$, we gain

$$\begin{aligned} \left\| \Phi \left( \varvec{x}_{t}\right) -\Phi \left( \varvec{x}_{t'}\right) \right\|&>\theta ,\\ \left\| \Phi \left( \varvec{u}\right) -\Phi \left( \varvec{v}\right) \right\|&>\theta . \end{aligned}$$

Therefore, each open sphere in the finite coverage $\left\{ {\mathcal {C}}\left( \Phi \left( \varvec{s}_{i}\right) ,\theta \right) \right\} _{i=1}^{T_{\theta }}$ cannot contain more than two points in U. Besides, U is a subset of $\bigcup _{i=1}^{T_{\theta }}{\mathcal {C}}\left( \Phi \left( \varvec{s}_{i}\right) ,\theta \right) $. Hence, the cardinality of U cannot exceed $T_{\theta }$, i.e., $\left| U\right| \le T_{\theta }$. $\square $

Rights and permissions

Reprints and permissions

About this article

Cite this article

Le, T., Nguyen, K., Nguyen, V. et al. GoGP: scalable geometric-based Gaussian process for online regression. Knowl Inf Syst 60, 197–226 (2019). https://doi.org/10.1007/s10115-018-1239-1

Download citation

Received: 22 December 2017
Revised: 10 April 2018
Accepted: 10 May 2018
Published: 20 July 2018
Issue Date: 01 July 2019
DOI: https://doi.org/10.1007/s10115-018-1239-1

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

GoGP: scalable geometric-based Gaussian process for online regression

Abstract

Access this article

Similar content being viewed by others

Introduction to Machine Learning

PolieDRO: a novel classification and regression framework with non-parametric data-driven regularization

The Frank-Wolfe Algorithm: A Short Introduction

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Appendix A: Proofs of Lemmas and Theorems

Appendix A: Proofs of Lemmas and Theorems

Proposition 1

Proof of Proposition 1

Proposition 2

Proof of Proposition 2

Lemma 1

Proof of Lemma 1

Lemma 2

Proof of Lemma 2

Lemma 3

Proof of Lemma 3

Lemma 4

Proof of Lemma 4

Lemma 5

Proof of Lemma 5

Lemma 6

Proof of Lemma 6

Theorem 1

Proof of Theorem 1

Theorem 2

Proof of Theorem 2

Theorem 5

Theorem 3

Proof of Theorem 3

Theorem 4

Proof of Theorem 4

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation