1 Introduction

There are numerous situations, both in research and industry, where it is necessary to know the optimal input to a black box function but sampling it is either difficult or expensive. Bayesian optimization is one of the most evaluation efficient methods for finding the input, \(x^{*}\), that will produce the optimal value of such systems [11]. It has been successfully applied to many problems in a wide range of fields including materials science, biomedical science, and even other computer science problems. An example of an application in materials science is the development of new polymer fibres [9]. In biomedical science, it has been used for many applications including studying how age effects time perception [19] and synthetic gene design [4]. Another application is the selection of hyperparameters for other machine learning algorithms [17].

Bayesian optimization methods work by fitting a probabilistic model to the available data (evaluation locations for the objective and corresponding function values). This model provides a distribution of all possible functions within a set range of possible inputs, \(\mathcal {X}\in \mathbb {R^{ d }}\). The most common model among these is the Gaussian process [15].

The predictions of the probabilistic model above are then used to make intelligent decisions about where to evaluate the objective function next, so that its optimum is found by using a reduced number of function evaluations. These intelligent decisions are made through an acquisition function. This maps \(\forall x\in \mathcal {X}\) to some property that describes how useful sampling at that point will be in determining the black box function’s true optima. Optimizing the acquisition function will therefore allow the best possible sample to be made from the black box function given the data and prior knowledge. Based on the decision of the acquisition function, the new sample is collected and evaluated. This sample can then be used to update the model, allowing the point after that to be determined from the updated acquisition function. This data-driven decision allows the optima of the function to be found in far fewer iterations than if samples had been taken at random [3].

The choice of acquisition function can greatly impact the number of iterations necessary to find the optimal input. As such, poor acquisition functions can lead to sup-optimal results or the need for a larger number of costly iterations. A good acquisition function needs to balance between trying to generalize from known good points (exploitation) and trying to search for new peaks in unexplored regions (exploration). There are currently many choices of acquisition functions that provide various degrees of exploration and exploitation. Two popular choices are expected improvement (EI) [7] and Gaussian process upper confidence bound (GP-UCB) [18], with EI being more exploitative without the need to choose hyperparameters and GP-UCB having more exploration but requiring the specification of several hyperparameters. These hyperparameters can reduce optimization performance if they are not suited to the problem and determining them is both computationally costly and potentially inaccurate. As such, EI is more popular.

We propose a modification to EI that will improve its exploration in the early stages of the experiment, but converge to its previous level of exploitation at later stages. This method is detailed in Sect. 3 along with a proof that it has a sub-linear convergence rate. In Sect. 4 we discuss results from several experiments performed using our method. First, we verify that our method has increased exploration by testing it against competing methods on a synthetic function designed to favour high-exploration methods. We then test its performance in comparison to these methods on several benchmark functions and a machine learning hyperparameter tuning problem. Finally, we discuss results concerning the analytical properties of our method.

2 Bayesian Optimization and Expected Improvement

Below we first provide a background of Bayesian optimization and Gaussian processes. Then we discuss acquisition functions with a focus on expected improvement.

2.1 Bayesian Optimization

Bayesian optimization is an efficient method for optimizing noisy, expensive black box-functions [7]. More formally, the ultimate goal of the method is to find the input,

$$\begin{aligned} x^{*}=\underset{_{x\in \mathcal {X}}}{\text {argmax}} f(x), \end{aligned}$$
(1)

that maximises the black-box function, f(x), in the bounded input space, \(\mathcal {X\subset \mathbb {R^{ d }}}\). It is possible to directly draw potentially noisy samples from the function: \(y_t=f(x_t)+\epsilon \) where \(\epsilon \) a random noise term, \(\epsilon \sim \mathcal {N}(0,\sigma _n)\) with some unknown \(\sigma _n\). However, doing so is expensive so we wish to determine \(x^*\) in as few samples as possible. To do this, Bayesian optimization uses a statistical model for the black box function to construct a surrogate function that is cheaper to sample. The statistical model is generated from all current information about the system, including all prior knowledge and all t sampled input-output pairs, \(D_{t}=\{x_{i},y_{i}\}_{i=1}^{t}\).

The statistical model is often chosen to be a Gaussian process due to its flexibility and analytic properties [2]. While the statistical model is generally not accurate enough to directly locate the optima of the function, it gives a probabilistic estimate of the function with epistemic uncertainties. This means that it can be used to select the “best” new point to sample. This is done by finding the optima of a surrogate function called an acquisition function which emphasises characteristics that are desirable for the new point to have. Acquisition functions are discussed further in Sect. 2.3.

Once the new samples are found, they can then be used to improve the model, allowing us to find a new, potentially better point to sample. This process is iterated until a predetermined stopping condition has been met. As samples are costly, it is common to choose a maximum number of iterations as a stopping criteria, but other stopping criteria can be used as well, such as stopping when a satisfactory result has been found or when the possible improvement predicted by the model becomes too small. A more detailed review of Bayesian optimization can be found in [2].

2.2 Gaussian Process

A Gaussian process is a statistical model of the black-box function. It represents the function values, f(x), at each point, \(x\in \mathcal {X}\) as infinitely many correlated Gaussian random variables. As such, it is completely characterized by its mean and covariance functions, m(x) and \(k(x_{i},x_{j})\). More formally, we assume that \(f(x)\sim \mathcal {GP}(m(x),k(x_{i},x_{j}))\). The covariance function, also called a kernel, has a profound impact on the shape of the resulting Gaussian process. As such, the use of an appropriate kernel is vital.

One of the most popular kernels is the square exponential kernel. This is given by \(k_{SE}(x_{i},x_{j})=\exp \left( -\frac{\left\| x_{i}-x_{j}\right\| ^{2}}{2l^{2}}\right) \). Here, the length scale is completely determined by a single hyperparameter, l. This kernel was chosen because it is simple and translation invariant. This property is important for the generation of Thompson samples as discussed in Sect. 3. Other popular kernel functions are discussed in Rasmussen et al. [15].

Using our data with this kernel gives a posterior distribution over the function, \(f(x)\vert D_t\sim \mathcal {GP}\left( m(x),\mathbf {K}_t\right) \). Here \(\mathbf {K}_t=[k(x_{i},x_{j})_{\forall x_{i},x_{j}\in D_{t}}]\) is the kernel matrix, which acts as the covariance matrix for the distribution. The posterior, in turn, can be used to calculate a predictive distribution, \(p(f(x)\mid D_t,x)=\mathcal {N}\left( \mu _t(x),\sigma _t(x)\right) \). The predictive distribution allows us to estimate the function value at any point x by calculating the predictive mean, \(\mu _t(x)\), and variance, \(\sigma ^2_t(x)\). These are given by \(\mu _t(x)=\mathbf {k_{*}}(\mathbf {K}_t+\sigma _n\mathbf {I})^{-1}\varvec{y}\) and \(\sigma _t^{2}(x)=k_t(x,x)-\mathbf {k_{*}}(\mathbf {K}_t+\sigma _n\mathbf {I})^{\mathrm {-1}}\mathbf {k}_{*}^{ T }\) with \(\mathbf {k}_{*}=[k(x_{1},x),k(x_{2},x),\ldots ,k(x_{t},x)]\). Here \(\mathbf {I}\) is the identity matrix with the same dimensions as \(\mathbf {K}_t\) and \(\sigma _n\) is the function noise standard deviation.

2.3 Acquisition Functions

Once the Gaussian process has been built, it is used to select the optimal next point to sample from the black box function, \(x_t\). However, exactly what qualifies a point to be \(x_t\) is non-trivial. As such, there are many potentially desirable properties that could be used to select \(x_t\). Once a desired property is chosen, an acquisition function, \(\alpha (x)\), is used to used to calculate it. This is generally far cheaper to evaluate than the black box function to the point where it is efficient to perform a global optimization on it \(\forall x\in \mathcal {X}\) to determine a single sample of the black box function. More formally, the optimal next point is given by

$$\begin{aligned} x_t=\arg \max _{x\in X}\alpha (x) \end{aligned}$$
(2)

Improvement Based Acquisition Functions. One of the most basic families of acquisition functions are the improvement based acquisition functions. These use the potential improvement over what is believed to be the current maxima, called the incumbent. The incumbent is often taken as the current best observed value, \(y^+ = \max _{i\le t}(y_i) \). The improvement is therefore given by \(I(x)=\max (f(x)-y^{+},0)\).

Probability of Improvement. A simple acquisition function is the probability of improvement (PI) [8], which gives the probability that a given point will have an improvement over the incumbent. Despite being an intuitive and simple formulation, PI often favours points near the incumbent [2]. As a result, the algorithm tends to over exploit. This can lead to the algorithm failing to quickly find promising peaks away from the incumbent, reducing the optimization efficiency in cases where there is more that one peak in the black box function. This lack of exploration can be improved by maximizing the expected improvement instead of PI [6].

Expected Improvement. As f(x) can be approximated by its Gaussian process predictive distribution, I(x) can likewise be approximated as a function of this random variable. This allows us to take the expectation over this to find the expected amount of improvement at any point \(x\in \mathcal {X}\), giving us the expected improvement acquisition function: \(\alpha ^{EI}(x)=\mathbb {E}[I(x)]\). This can be expressed with the Gaussian process predictive mean and variance in the following closed form [7]:

$$\begin{aligned} \alpha ^{EI}(x)={\left\{ \begin{array}{ll} (\mu (x)-y^{+})\varPhi (z)+\sigma (x)\phi (z), &{} \mathrm {if}\,\sigma (x)>0\\ 0 &{} \mathrm {if}\,\sigma (x)=0 \end{array}\right. } \end{aligned}$$
(3)

where \(z=\frac{\mu (x)-y^{+}}{\sigma (x)}\), \(\phi \) is the standard normal PDF, and \(\varPhi \) is the standard normal CDF. For a full analytical derivation of EI, we refer interested readers to [14].

EI has better exploration than PI but still tends to over-exploit in many situations, such as when it hits a local optimum. Despite this, EI is currently the most common acquisition functions due to its consistent performance without the need to choose additional hyperparameters.

A Heuristic Approach for Boosting the Exploration of EI: \(\varvec{\zeta }\)-EI. It is a common belief that artificially increasing the incumbent by some positive \(\zeta \) will reduce the value of the acquisition function near the currently sampled points, boosting exploration [10]. However, this method does not work well in practice as it is not easy to choose the right value of \(\zeta \). If this value is large, the algorithm will significantly over-explore. This often leads to inefficiency in optimization performance.

3 The Proposed E\(^3\)I Method

In this section we will outline our modification to EI that will improve its exploration without causing it to significantly over-explore. We will then prove that, under some mild assumptions, it has a sub-linear regret bound.

3.1 Thompson Sampling

For our method, we wish to generate full random approximations of the black-box function. Hernández-Lobato et al. [5] have developed a method for doing this through Thompson sampling. For a shift invariant kernel such as the square exponential kernel we are using, Bochner’s theorem [1] states that it has a Fourier dual, s(w), which is equal to the spectral density of \(k(x_i,x_j)\). Normalizing this as \(\hat{s}(w)=s(w)/\beta \) allows us to represent the kernel as

$$\begin{aligned} k(x_i,x_j)=2\beta \mathbb {E}_{\hat{s}(w)}\left[ \cos (w^Tx_i+b)\cos (w^Tx_j+b)\right] \end{aligned}$$
(4)

where \(b\sim \mathcal {U}[0,2\pi ]\). If we draw V random samples of w and let \(\phi (x) =\sqrt{\frac{2\beta }{V}}\left\{ \cos (Wx+b),\sin (Wx+b)\right\} \), we can approximate the kernel with \(k(x_i,x_j)\approx \phi (x_i)^T\phi (x_j)\). By setting \(\varPhi =\left[ \phi (x_{1}),\ldots ,\phi (x_{V})\right] \), we can also approximate the kernel matrix with \(\mathbf {K}\approx \varPhi \varPhi ^{T}+\sigma ^{2}\mathbf {I}\), where \(\mathbf {I}\) is the \(V\times V\) matrix identity. These estimates have been augmented by the random samples of w so they can be viewed as having random probable points added to them. This means that, if V is sufficiently large, the corresponding predictive mean can be viewed as a complete estimate of the black box function given the current data:

$$\begin{aligned} f(x)\approx g(x)=\phi (x)^{T}(\varPhi \varPhi ^{T}+\sigma ^{2}\mathbf {I})^{-1}\varPhi ^{T}\varvec{y} \end{aligned}$$
(5)

The process for generating and finding the optima of these Thompson samples is outlined in Algorithm 1.

figure a

3.2 Exploration Enhanced Expected Improvement (E\(^3\)I)

The Thompson sample functions have two useful properties. Firstly, without noise they will agree with the currently sampled points exactly (i.e. \(g(x_{i})=y_{i},\forall (x_{i},y_{i})\in D_{t}\)). This means that \(g(x^{+})=f(x^+)=y^{+}\). As such, either the maximum of g(x) will occur at \(x^{+}\), in which case \(g^{*}=\max _{x}g(x)=y^+\), or it will occur elsewhere, in which case \(g^*>y^+\). Secondly, the Thompson sample functions will also converge to the true function as the number of iterations increase. This means that \(g^{*}\) should converge towards \(y^+\).

These two properties allow us to use \(g^{*}\) as the incumbent in EI instead of \(y^{+}\). As \(g^{*}\ge y^{+},\) the algorithm will have greater exploration. However, as the Thompson samples are sensible approximations of the underlying function, the method does not have the same risk of over-exploration as artificially increasing the incumbent does. As \(g^{*}\rightarrow y^{+},\) it should also explore less at later stages in the algorithm when exploration is less important.

This approach assumes that any given Thompson sample is a good approximation of the black box function given the current data. Due to variations between Thompson samples, it is possible that any given Thompson sample may be an outlier, voiding this assumption. As such, we instead look at the distribution of possible Thompson samples. This makes the new acquisition function a function of this distribution. To obtain the best point, we take the expected value of this distribution, i.e. \(\alpha ^{E^{3}I}(x)=\mathbb {E}_{g}[\mathbb {E}_{x}[I(x,{g}^{*})]]\). Unfortunately, determining this directly is difficult. As such, we instead generate M Thompson samples, \(g^*_1,g^*_2,\ldots ,g^*_M\), and find the sample mean instead. Setting \(z=\frac{\mu (x)-g_{m}^{*}}{\sigma (x)}\) and \(\tau (z)=z\varPhi (z)+\phi (z)\) we get

$$\begin{aligned} \alpha ^{E^{3}I}(x)=\frac{1}{M}\sum _{m=1}^{M}\mathbb {E}_{x}[I(x,g_{m}^{*})]={\left\{ \begin{array}{ll} \frac{\sigma (x)}{M}\left[ \sum _{m=1}^{M}\tau (z)\right] , &{} \mathrm {if}\,\sigma (x)>0\\ 0 &{} \mathrm {if}\,\sigma (x)=0 \end{array}\right. } \end{aligned}$$
(6)

We outline the E\(^3\)I routine in Algorithm 2.

figure b

3.3 Convergence

The regret bound is one of the basic criteria for evaluating the performance of an optimization algorithm. EI has been shown to converge at a sub-linear rate under a variety of assumptions. Bull [3], and Ryzhov [16] both derive convergence rates in the absence of noise. Wang and de Freitas [20] were able to derive a convergence rate in the noisy setting, but they needed to use \(\mu ^+(x)=\max _{x}\mu (x)\) as the incumbent. As E\(^3\)I utilizes a different incumbent, it is not compatible with this approach. Nguyen et al. [14] have shown that EI has a sub-linear convergence rate if a minimum improvement stopping condition is used. The proof is valid in the noisy setting and does not require a modified incumbent. As such we extend and employ it to show that this is also true for our method. Many of the lemmas used in their proof can be directly applied to our method.

We start our derivation for the regret bound of E\(^3\)I as follows.

Lemma 1

(Srinivas et al. [18]). Let \(\delta \in (0,1)\) and assume that the noise variables, \(\epsilon _{t}\), are uniformly bounded by \(\sigma \). Define \(\beta _{t}=2\left\| f\right\| _{k}^{2}+300\gamma _{t}\ln ^{3}\left( \frac{t}{\delta }\right) \), then

$$\begin{aligned} p\left( \forall t,\forall x\in \mathcal {X},\left| \mu _{t}(x)-f(x)\right| \le \sqrt{\beta _{t}}\sigma _{t}(x)\right) \ge 1-\delta \end{aligned}$$

Lemma 2

The improvement function, \(I_{t,m}(x)=\max \left( 0,f(x)-g_{t,m}^{*}\right) \), and the acquisition function, \(\alpha _{t}^{E^{3}I}(x)=\frac{1}{M}\sum _{m=1}^{M}\mathbb {E}[I_{t,m}(x)]\) satisfy the inequality \(\frac{1}{M}\sum _{m=1}^{M}I_{t,m}(x)-\sqrt{\beta _{t}}\sigma _{t-1}(x)\le \alpha _{t}^{E^{3}I}(x)\).

Proof

In the case that \(\sigma _{t-1}(x)=0\), we have \(\mathbb {E}[I_{t,m}(x)]=I_{t,m}(x),\,\forall m\in [1,M]\). This means that \(\alpha _{t}^{E^{3}I}(x)=\frac{1}{M}\sum _{m=1}^{M}\mathbb {E}[I_{t,m}(x)]=\frac{1}{M}\sum _{m=1}^{M}I_{t,m}(x)\) so the result is trivial. For \(\sigma _{t-1}(x)>0\), the proof is as follows. Using Lemma 1, \(q_{t-1,m}=\frac{f(x)-g_{t-1,m}^{*}}{\sigma _{t-1}(x)}\), and \(z_{t-1,m}=\frac{\mu _{t-1}(x)-g_{t-1,m}^{*}}{\sigma _{t-1}(x)},\) we can get the following result:

$$\begin{aligned} \alpha _{t}^{E^{3}I}(x)&\ge \frac{1}{M}\sum _{m=1}^{M}\sigma _{t-1}(x)\tau \left( q_{t-1,m}-\sqrt{\beta _{t}}\right) \\&\ge \frac{1}{M}\sum _{m=1}^{M}\sigma _{t-1}(x)\left( q_{t-1,m}-\sqrt{\beta _{t}}\right) \qquad \qquad \mathrm {by}\,\tau (z)\ge z \end{aligned}$$

If \(I_{t}(x)=0\), the lemma becomes \(\alpha _{t}^{E^{3}I}(x)\ge -\sqrt{\beta _{t}}\sigma _{t-1}(x)\). As \(\alpha _{t}^{E^{3}I}(x)\ge 0\), \(\sqrt{\beta _{t}}\ge 0\), and \(\sigma _{t-1}(x)\ge 0\), this is always true. For \(I_{t}(x)>0\) we have \(q_{t-1,m}=\frac{I_{t}(x)}{\sigma _{t-1}(x)}\). This gives us

$$\begin{aligned} \alpha _{t}^{E^{3}I}(x)\ge \frac{1}{M}\sum _{m=1}^{M}I_{t,m}(x)-\sqrt{\beta _{t}}\sigma _{t-1}(x) \end{aligned}$$

which concludes our proof.    \(\square \)

We now prove the main theorem:

Theorem 3

Let \(\kappa >0\) be a predefined small constant as a stopping criteria, \(\sigma ^{2}\) be the measurement noise variance, \(C\triangleq \log \left[ \frac{1}{2\pi \kappa ^{2}}\right] \), \(\beta _{t}=2\left\| f\right\| _{k}^{2}+300\gamma _{t}\ln ^{3}\left( \frac{t}{\delta }\right) \) and \(\delta \in (0,1)\). Then, with probability at least \(1-\delta \), after T iterations the cumulative regret of E\(^3\)I using a collection of maxima samples \(g^*_m\) drawn from Thompson sampling as the incumbents obeys the following sublinear rate: \(R_{T}\lesssim \sqrt{T\beta _{T}\gamma _{T}}\sim \mathcal {O}\left( \sqrt{ T \times (\log T)^{d+4}}\right) \), where \(\gamma _{T}\sim \mathcal {O}\left( \left( \log T\right) ^{d+1}\right) \) is the maximum information gain for the squared exponential kernel.

Proof

Let \(x_{t}=\underset{_{x\in \mathcal {X}}}{\text {argmax}} \alpha _{t}^{E^{3}I}(x)\) be the choice at iteration t, the instantaneous regret is:

$$\begin{aligned} Mr_{t}&=Mf(x^{*})-Mf(x_{t})\\&=Mf(x^{*})-Mf(x_{t})+\sum _{m=1}^{M}g_{t-1,m}^{*}-\sum _{m=1}^{M}g_{t-1,m}^{*}\\&=\underbrace{\sum _{m=1}^{M}\left[ f(x^{*})-g_{t-1,m}^{*}\right] }_{A_{t}}+\underbrace{\sum _{m=1}^{M}\left[ f(x_{t})+g_{t-1,m}^{*}\right] }_{B_{t}} \end{aligned}$$

We need to connect this with the maximum information gain, \(\gamma _{T}\). This can be done by bounding \(r_{t}\) with the GP posterior variance. We bound \(A_{t}\) with the using Lemma 2, Lemma 1, and the fact that \(\alpha _{t}^{E^{3}I}(x^{*})\le \alpha _{t}^{E^{3}I}(x_{t})\) to get

$$\begin{aligned} A_{t}&=\sum _{m=1}^{M}\left[ f(x^{*})-g_{t-1,m}^{*}\right] =\sum _{m=1}^{M}I_{t,m}(x)\\ A_{t}&\le M\left[ \alpha _{t}^{E^{3}I}(x^{*})+\sqrt{\beta _{t}}\sigma _{t-1}(x^{*})\right] \\&\le M\left[ \alpha _{t}^{E^{3}I}(x_{t})+\sqrt{\beta _{t}}\sigma _{t-1}(x^{*})\right] \qquad \qquad \quad \,\,\,\mathrm {by}\,\mathrm {Lemma}\,\,{2}\\&=M\left[ \sigma _{t-1}(x_{t})\tau (z_{t-1}(x_{t}))+\sqrt{\beta _{t}}\sigma _{t-1}(x^{*})\right] \quad \mathrm {by}\,\mathrm {Lemma}\,{1} \end{aligned}$$

Likewise, we bound \(B_{t}\) with the following:

$$\begin{aligned} B_{t}&=\sum _{m=1}^{M}\left[ g_{t-1,m}^{*}-\mu _{t-1}(x_{t})+\mu _{t-1}(x_{t})-f(x_{t})\right] \\&\le \sum _{m=1}^{M}\left[ \sigma _{t-1}(x_{t})(-z_{t-1}(x_{t}))+\sigma _{t-1}(x)\sqrt{\beta _{t}}\right] \qquad \qquad \,\mathrm {by}\,\mathrm {Lemma}\,{1}\\&=M\sigma _{t-1}(x_{t})\left[ \tau (-z_{t-1}(x_{t}))+\sqrt{\beta _{t}}-\tau (z_{t-1}(x_{t})\right] \qquad \mathrm {by}\,z=\tau (z)-\tau (-z) \end{aligned}$$

Combining these bounds and noting that the M term cancels out, we get

$$\begin{aligned} r_{t}\le \left[ \sigma _{t-1}(x_{t})\left[ \sqrt{\beta _{t}}+\tau (-z_{t-1}(x_{t}))\right] +\sqrt{\beta _{t}}\sigma _{t-1}(x^{*})\right] \end{aligned}$$

Using the bound of \(\tau (-z_{t-1}(x_{t}))\) in Lemma 9 from [14] and setting \(C\triangleq \log \left[ \frac{1}{2\pi \kappa ^{2}}\right] \) we can simplify this to

$$\begin{aligned} r_{t}\le \underbrace{\sigma _{t-1}(x_{t})\left[ \sqrt{\beta _{t}}+1+C\right] }_{L_{t}}+\underbrace{\sqrt{\beta _{t}}\sigma _{t-1}(x^{*})}_{U_{t}} \end{aligned}$$

We now look at the sum of the regret,

$$\begin{aligned} R_{t}=\sum _{t=1}^{T}r_{t}\le \sum _{t=1}^{T}L_{t}+\sum _{t=1}^{T}U_{t} \end{aligned}$$

Using the Cauchy-Schwartz inequality that \((a+b+c)\le 3(a^{2}+b^{2}+c^{2})\), that \(\beta _{T}\ge \beta _{t},\forall t\le T\), and Lemma 7 from [14]) we can bound \(\sum _{t=1}^{T}L_{t}\) with the following

$$\begin{aligned} \sum _{t=1}^{T}L_{t}&\le \sum _{t}^{T}\sigma _{t-1}^{2}(x_{t})3(\beta _{t}+1+C)\\&\le 3(\beta _{T}+1+C)\sum _{t}^{T}\sigma _{t-1}^{2}(x_{t})\le \frac{6(\beta _{T}+1+C)\gamma _{T}}{\log (1+\sigma ^{-2})} \end{aligned}$$

Using the Cauchy-Schwartz inequality again we get

$$\begin{aligned} \sum _{t=1}^{T}L_{t}\le \sqrt{T}\sqrt{\sum _{t=1}^{T}L_{t}}\le \sqrt{\frac{6T(\beta _{T}+1+C)\gamma _{T}}{\log (1+\sigma ^{-2})}} \end{aligned}$$
(7)

We can use Lemma 7 [14] and the Cauchy-Schwartz inequality on \(\sum _{t=1}^{T}U_{t}\) as well to obtain a similar result:

$$\begin{aligned} \sum _{t=1}^{T}U_{t}\le \beta _{T}\sum _{t=1}^{T}\sigma _{t-1}(x^{*})\le \sqrt{\frac{2T\beta _{T}\gamma _{T}}{\log (1+\sigma ^{-2})}} \end{aligned}$$
(8)

Combining Eqs. (7) and (8) gives us our regret bound:

$$\begin{aligned} R_{T}\le \sqrt{\frac{2T\gamma _T}{\log (1+\sigma ^{-2})}}\left[ \sqrt{3(\beta _{T}+1+C)}+\sqrt{\beta _{T}}\right] \end{aligned}$$

The function of the maximum information gain, \(\sqrt{T\times \gamma _{T}}\), will usually dominate this expression as \(\beta _{T}\sim \mathcal {O}\left( (\log T)^{2}\right) \). It is kernel dependent but for the squared exponential kernel used in this paper it is \(\gamma _{T}\sim \mathcal {O}\left( \left( \log T\right) ^{d+1}\right) \). This means that our regret bound for this kernel is \(R_{T}\sim \mathcal {O}\left( \sqrt{T\times \left( \log T\right) ^{d+1}}\right) ,\) which vanishes in the limit of \(\lim _{T\rightarrow \infty }\frac{R_{T}}{T}=0\). We note that we achieve the similar form to the one in [14]. \(\square \)

Fig. 1.
figure 1

A plot of the 1D Gaussian mixture function for illustration (left) and the performance of various methods on the 5D version of the same Gaussian mixture function (right). The higher dimensional function was used as little exploration is required in 1D. Lower is better. Note that GP-UCB and EI both get stuck on the initial lower value peak. \(\zeta \)-EI manages to find the larger peak, but fails to exploit it. This suggests that it may be over-exploring due to an imperfect choice of \(\zeta \). On the contrary, it can be seen that E\(^3\)I is able to obtain a superior result through better late-stage exploitation.

4 Experiments

In this section we outline and discuss our experimental results. We apply our method to synthetic, benchmark and real world functions. We also test some other important properties of our method, such as the dependence on M and the convergence of the Thompson samples. The code used for this paper can be found at https://github.com/jmaberk/BO_with_E3I.

4.1 Experimental Setup

We performed several experiments comparing E\(^3\)I with the standard EI, \(\zeta \)-EI with \(\zeta =0.01\) [2], and GP-UCB. In our algorithm, we scaled the inputs to be in the range [0, 1] in all dimensions and standardized the sampled function values to have zero mean and a standard deviation of 1. This guarantees that our kernel magnitude will be scaled correctly for all functions. After this scaling, we use a square exponential kernel.

We ran our experiment 10 times per function with \(d+1\) random initial points, where d is the number of input dimensions. The experiment was stopped after \(T=20 d\) iterations. As we used a simple multi-start L-BFGS-B optimizer, we minimized the negative of all functions instead of maximizing them. As such, lower results are better.

4.2 Synthetic Multi-peak Function

As we expect our method to have higher exploration than EI, we will test it on functions which require exploration for better performance. In particular, we consider multi-peak functions. Methods with poor exploration can get stuck on sup-optimal peaks, significantly reducing performance. As such, we chose to use a two-peak Gaussian mixture function to verify the high exploration of our method. One peak was chosen to be wide (\(\mathcal {N}(0.7,0.01)\)) while the other peak was chosen to be narrow but taller (\(\mathcal {N}(0.1,0.001)\)) so that acquisition functions with poor exploration will tend to get stuck on the wider, smaller peak more often and hence not perform as well. We applied our suite of acquisition functions to the problem and have summarised our results in Fig. 1.

It is evident that our method is both better able to find the narrow peak faster, and that the performance gap increases as the number of dimensions increases.

Fig. 2.
figure 2

Performance of various methods on a range of multi-peak benchmark functions. Lower is better. Note that our method generally seems to converge slowly at the early stages when other methods are exploiting, but it can beat the exploitative methods by finding better peaks. This is because E\(^3\)I tends to explore in the early stages and then tends to exploit later to hit the optimum.

4.3 Benchmark Functions

We also tested our method on several common multi-peak benchmark functions. These include the Levy (5D), Schwefel (4D), Shubert (2D), and Ackley (5D) functionsFootnote 1. The results for these are displayed in Fig. 2.

These results show that our method is able to find a better optima more quickly than the other methods in these multi-peak test functions. GP-UCB also does very well with the Levy and Shubert functions, while the \(\zeta \)-EI does fairly well on all experiments except the Levy function. This variance in performance is unsurprising, as both methods have parameters that control their level of exploration. If these are not suited to the problem, they can be detrimental to the algorithm’s performance. Our methods increased exploration is automatically adjusted through the Thompson samples and therefore does not face this issue. As such, even in cases where these methods performed well, E\(^3\)I was able to show improvements over them.

Fig. 3.
figure 3

The results for hyperparameter tuning a BNMC experiment. Note that, again, our algorithm performs poorly in the early stages while it explores for new optima but is able to find a better optima sooner than the other methods.

4.4 Machine Learning Hyperparameter Tuning

Finally, we tested our method on a real-world application; the determination of optimal hyperparameters for a machine learning algorithm called Bayesian Nonparametric Multi-label Classification (BNMC) [12]. This algorithm is used to efficiently classify multi-labelled data by exploiting the correlation between the multiple labels and features. Its performance is dependant on six hyperparameters which we can tune with Bayesian optimization. These are the Dirichlet concentration parameters for both the feature and label, the learning rates for both SVI and SGD, the truncation threshold and the stick-breaking parameter. The data the algorithm was used on, called SceneData, consisted of 1196 test and 1211 training samples, each with 294 features. The results of this with our suite of acquisition functions is given in Fig. 3 with the F1 score used as the performance measure.

4.5 Sensitivity Analysis with Respect to the Number of Optima Samples, M

One of our key assumptions is that we can approximate the expectation over the distribution of Thompson samples as its sample mean. As \(M\rightarrow \infty \), this will be true. However, using a very large M will increase computational costs. As such, we wish to find a value for M that will not compromise our results while also not being too expensive. To determine this, we ran the same experiment on both a 2D Shubert function and a 2D Schwefel function for a range of M values from 1 to 200. The results of which are summarised in Fig. 4. From these results, we can see that \(M=100\) seems to be an appropriate number of Thompson samples.

Fig. 4.
figure 4

Performance of E\(^3\)I on two functions with various number of Thompson samples, M. We can see that increasing M noticeably improves the results until \(M=50\), after which time little improvement is seen.

4.6 Computational Considerations

While our method has competitive performance with other methods, it has a considerable computational cost. Each Thompson sample requires both the inversion of a \(V\times V\) matrix and a global optimization step which may increase exponentially with the dimension, d. The overall cost scales significantly with both the number of data points and the number of input dimensions, making it \(\mathcal {O}(MNV^2)\) where N is the number of observations [13]. To give this some context, the average time per iteration for the 2D Shubert function earlier was 0.33 s with EI and 26 s for E\(^3\)I. Moving up to 4D with the Schwefel function, these become 1.4 and 81 s respectively. These may seem high, but they are negligible when compared to the costs associated with sampling in many of the areas that Bayesian optimization is applied to.

Fig. 5.
figure 5

The reduction of standard deviation between the Thompson sample function maxima (left) and the convergence of \(\bar{g^*}\) to \(f^*=\max _{x}(f(x))\) (right). We can see that as more samples are taken, the inter-sample variance is reducing and their mean is approaching \(f^*\). This suggests that they are properly converging as the space is explored.

One way to potentially reduce computational costs is to use a method by Wang et al. [21] to find the maxima of Thompson samples by sampling a Gumbel distribution. However, this method makes several assumptions that may lead to inaccurate results and as such is left for future work.

4.7 Empirical Convergence Analysis of Thompson Samples

One of the key assumptions of our method is that the Thompson sample functions will converge to the true function as T increases. This convergence was experimentally tested and the results are shown in Fig. 5. It is evident that the sample mean, \(\bar{g^*}=\frac{1}{M}\sum _{m=1}^Mg^*_m\) is converging to \(f^*\) and that the sample standard deviation, calculated with \(\sigma (g^*)= \sqrt{ \frac{1}{M} \sum _{m=1}^M(g^*_m - \bar{g}^*)^2}\), is reducing with the number of iterations. These suggest that the Thompson samples are converging properly.

5 Conclusion

We have proposed a new approach for balancing exploration and exploitation in Bayesian optimization. Our approach makes use of Thompson sampling to guide the level of exploration. This results in the E\(^3\)I acquisition function.

Our method has been shown to perform better than competing methods on both several multi-peak test functions and on hyperparameter tuning for a BNMC experiment. We also show that it has a sub-linear regret bound.

The most important next step in improving \(\mathrm {E^{3}I}\) is to resolve some of its computational issues. Beyond this, the effects of similar distribution-based approaches should be explored on other acquisition functions besides EI.