1 Introduction

Experimental optimisation is a widely used technique in scientific studies and engineering design to find optimal solutions to a variety of problems via experimentation. In its most common form it is used to seek globally optimal solutions of unknown black-box functions, which are often expensive to evaluate. Bayesian optimisation offers a sample efficient solution to these kinds of problems. For example, it has been used in the scientific community for synthetic gene design [1], alloy optimisation [2] and fiber yield optimisation [3]. In machine learning it is often used to find the optimal hyperparameters for the learning algorithms and the optimisation routines [4]. Bayesian optimisation requires a probabilistic model of the function that is being optimised. Gaussian process is often a popular choice as the prior for the function model. Posterior computed based on the existing observations is then used to build a computationally cheap acquisition function to seek the next evaluation point. There are a variety of acquisition functions such as Probability of Improvement [5], Expected Improvement [6], GP-Upper Confidence Bound (UCB) [7], Predictive Entropy Search (PES) [8] etc. Nearly, all the acquisition functions address the trade off between sampling the regions where the posterior mean is high (exploitation) and sampling the regions where uncertainty is high (exploration) [9]. The maximiser of the acquisition function offers the best chance of being the optima. Since the acquisition functions are computationally cheap they admit standard global optimisation routines such as DIRECT [10]. Bayesian optimisation runs in a loop with experiments sequentially being performed at the locations of the respective acquisition functions optima until an acceptable solution is found or the iteration budget is exhausted. However, the generic Bayesian optimisation approach is susceptible to the “cold start” problem where it may recommend several points with low function values before reaching a high function value region. For experiments which are highly expensive, for example, hyperparameter tuning of a large deep network on a massive training data that takes weeks to get trained on a many clusters of GPUs, the cost due to “cold start” can be quite substantial, and we like to avoid that.

A principled approach to alleviate the “cold start” issue is to utilize the knowledge acquired in any previous function (source) optimisations to achieve faster convergence in the optimisation of a new related function (target) via transfer learning. State of the art work include, [11], in which the authors assume high similarity in the rank-space of function values and develop a model that transfers function ranking between source and target. Similarly, [12] assumes high similarity between all the functions when their mean functions are subtracted out. Both these methods do not model task to task relatedness, and are hence susceptible to negative transfer in the presence of a very different source function. In [13], the authors assume that the task relationships are already known and utilize the knowledge as per source target relationship in a multi-task setting. In [14, 15], the authors propose to compute overall similarity between functions based on all the observations and [16, 17] uses meta-features to compute similarities between functions. Whilst the former can get tricked by scaling of the functions, the latter depends crucially on the availability of right kind of meta-features. Most of them assume function space similarity, instead of a more aligned similarity measure in terms of the location of the optima, and hence may fail to find the global optima if the functions have different scaling or relatedness between them. For this reason, a robust transfer learning framework for Bayesian optimisation which transfer knowledge based only on the proximity of the function optima is still an open problem.

In this paper, we propose a transfer learning framework for Bayesian optimisation that performs knowledge transfer based on the location of the optima in the input space and thus invariant to scaling and robust to noise in function. Since the location of global optima is not known perfectly with finite observations (be it source or target), we use probabilistic knowledge of the optima. Thompson sampling can be used to obtain the optima distribution for both sources and the target. Following that we propose a novel measure of relatedness by computing the divergence between these distributions. These distributions are then merged into a mixture distribution based on the similarities and is used to build the acquisition function. We use Predictive Entropy Search (PES) as a vehicle to build our new acquisition function. Our method offers a “no bias” algorithm, in a sense that in the limit \(\left( T\rightarrow \infty \right) \) the similarity to any random source tends to zero with probability 1, making Bayesian optimisation for the target free from any source induced bias in the limit. We validate our framework through application to optimisation of both synthetic functions and real world experiments. We compare our method with three well known transfer learning methods as well as with the generic Bayesian optimisation algorithm and demonstrate the superiority of our method.

2 Background

2.1 Gaussian Process

Gaussian process is an effective method for regression and classification problems and has received considerable attention in machine learning community. It serves as a probabilistic prior over smooth functions and is a generalization of infinite collection of normally distributed random variables. A Gaussian process can be completely specified by a mean function, \(\mu (\mathbf{x})\) and covariance function, \(k(\mathbf{x},\mathbf{x}^{'})\). Due to these properties, we can model a smooth function as a draw from a Gaussian process. Formally,

$$\begin{aligned} f(\mathbf{x})\sim \text {GP}(\mu (\mathbf{x}),k(\mathbf{x},\mathbf{x}^{'})) \end{aligned}$$
(1)

where the function value at a point \(\mathbf{x}\), i.e. \(f(\mathbf{x})\) is a normal distribution and the relation between the function values at any two points \(\mathbf{x}\) and \(\mathbf{x}^{'}\) is modeled by covariance function \(k(\mathbf{x},\mathbf{x}^{'})\). Without loss in generality, the mean function can be defined to be zero thus making the Gaussian process fully specified by the covariance function [18, 19]. Popular choices of covariance functions include squared exponential kernel, Matérn kernel, linear kernel, etc. We assume the function measurements are noisy, i.e. observations \(\mathbf {y}_{i}=f(\mathbf{x}_{i})+\epsilon _{i}\), where \(\epsilon _{i}\sim \mathcal {N}(0,\sigma ^{2})\) is the measurement noise. Collectively, we denote the observations as \(\mathcal {D}_{n}=\left\{ \mathbf{x}_{1:n},\mathbf {y}_{1:n}\right\} \). The function values \([f(\mathbf{x}_{1}),\ldots ,f(\mathbf{x}_{n})]\) follow a multivariate Gaussian distribution \(\mathcal {N}(0,\mathbf{K}),\) where

$$\begin{aligned} \mathbf{K}=\left[ \begin{array}{ccc} k(\mathbf{x}_{1},\mathbf{x}_{1}) &{} \ldots &{} k(\mathbf{x}_{1},\mathbf{x}_{n})\\ \vdots &{} \ddots &{} \vdots \\ k(\mathbf{x}_{n},\mathbf{x}_{1}) &{} \ldots &{} k(\mathbf{x}_{n},\mathbf{x}_{n}) \end{array}\right] \end{aligned}$$
(2)

Given a new point \(\mathbf{x}\), \(\mathbf {y}_{1:n}\) and \(f(\mathbf{x})\) are jointly Gaussian, then by the properties of Gaussian process we can write

$$\begin{aligned} \left[ \begin{array}{c} \mathbf {y}_{1:n}\\ f(\mathbf{x}) \end{array}\right] \sim \mathcal {N}\left( 0,\left[ \begin{array}{cc} \mathbf {K}+\sigma ^{2}\mathbf {I} &{} \mathbf {k}\\ \mathbf {k}^{T} &{} k(\mathbf{x},\mathbf{x}) \end{array}\right] \right) \end{aligned}$$
(3)

where \(\mathbf {k}=\left[ k(\mathbf{x},\mathbf{x}_{1})\,k(\mathbf{x},\mathbf{x}_{2})\,\,\ldots \,\,k(\mathbf{x},\mathbf{x}_{n}))\right] \). Using Sherman-Morrison-Woodbury formula [18] we can write the predictive distribution at any \(\mathbf{x}\) as

$$\begin{aligned} p(y\mid \mathcal {D}_{n},\mathbf{x})=\mathcal {\mathcal {N}}(\mu _{n}(\mathbf{x}),\sigma _{n}^{2}(\mathbf{x})) \end{aligned}$$
(4)

where the predictive mean \(\mu _{n}(\mathbf{x})\) is given as

$$\begin{aligned} \mu _{n}(\mathbf{x})=\mathbf {k^{T}\left[ K+\sigma ^{2}\mathbf {I}\right] ^{-1}}\mathbf {y}_{1:n} \end{aligned}$$
(5)

and the predictive variance \(\sigma _{n}^{2}(\mathbf{x})\) is given as

$$\begin{aligned} \sigma _{n}^{2}(\mathbf{x})=k(\mathbf{x},\mathbf{x})-\mathbf {k^{T}\left[ K+\sigma ^{2}\mathbf {I}\right] ^{-1}k}. \end{aligned}$$
(6)

2.2 Bayesian Optimisation

Bayesian optimisation is an elegant approach for the global optimisation of expensive, black-box functions. Given a small set of observations from previous function evaluations, Bayesian optimisation proceeds by building a probabilistic model of the function generally using a Gaussian process (GP). However, other methods are also used for function modeling, e.g. random forests and Bayesian neural networks [20, 21]. The model of the function is then combined with the existing observations to derive a posterior distribution. This distribution is then used to construct a surrogate utility function called acquisition function, which is cheap to evaluate and finally optimised to recommend the next function evaluation point while keeping a trade-off between exploitation and exploration [9, 22].

Several popular acquisition functions are available in literature: Probability of improvement (PI), which takes into account the improvement in the probability over the current best function value [5], Expected improvement (EI), which considers the expected improvement over the current best [6] and GP-UCB, which selects the evaluation point based on the upper confidence bound [7]. These functions are based on predictive mean and variance of the model posterior. An alternative acquisition function that maximizes the expected posterior information gain about the location of the global optima over an input space grid is proposed in [23, 24]. Another information-based acquisition function called Predictive Entropy Search (PES) extended this approach to continuous search spaces [8].

2.3 Transfer Learning for Bayesian Optimisation

Transfer learning methods in Bayesian optimisation utilize ancillary information acquired from previous function (source) optimisations to achieve faster optimisation for a new related function (target). The crucial requirement is to determine the source function which is highly similar to the target. Limited work exist for transfer learning in Bayesian optimisation and most of them have made assumption regarding similarity between the source and the target. For example, Bardenet et al. [11] proposed the first work on transfer learning for Bayesian optimisation by transferring the source knowledge via a ranking function, which was assumed to be applicable for the target function as well. Another similar approach proposed in [12] assumes that the deviations of a function from its mean scaled through the standard deviation are transferable from source to target. Both these methods have strong assumption regarding source-target similarity and hence experience difficulty to find the global optima if among many sources, some have different function shapes or different optima compared to the target. To handle any potential differences between the source and the target, an alternate transfer learning framework was proposed by Joy et al. [14] modeling source data as noisy observations of the target function. The noise envelope is estimated by taking difference between any available source/target data and can be used to distinguish a related source from an unrelated one. Using the previous method as a base, Ramachandran et al. [25] proposed another transfer learning method for Bayesian optimisation that select sources which are highly related to the target. The authors use multi-armed bandit formulation for the selection of optimal sources. However, these methods would not be able to leverage from a related source having its output scale different from that of the target even though both source and target have their optima located at the same place. Meta-learning approaches proposed in [16, 17] can estimate source/target similarities, however it requires meta features for source and target functions, which may not be available in general.

3 Proposed Method

We propose a new information-theoretic transfer learning framework for Bayesian optimisation that utilizes data from source and target functions to maximize the information about the global minima of the target function. We first discuss an information-theoretic framework for Bayesian optimisation known as Predictive Entropy Search (PES) and then present our proposed transfer learning model.

3.1 Optimisation of Black-Box Functions Using Predictive Entropy Search (PES)

Let \(f(\mathbf {x})\) be an expensive black-box function and we need to find its global minimizer \(\mathbf {x}_{*}=\underset{\mathbf {x}\in \mathcal {X}}{\text {argmax}}f(\mathbf {x})\) over some domain \(\mathcal {X}\subset \mathbb {R}^{d}\). Let \(\mathcal {D}_{n}=\left\{ (\mathbf {x}_{i},y_{i})\right\} _{i=1}^{n}\) denote the noisy observations from the function \(f(\mathbf {x})\) under the observation model \(y_{i}=f(\mathbf {x}_{i})+\epsilon _{i}\) where \(\epsilon _{i}\sim \mathcal {N}(0,\sigma ^{2})\) is the measurement noise. Predictive entropy search is an acquisition function for Bayesian optimisation that recommends the next function evaluation point with an aim to maximize the information about \(\mathbf {x}_{*}\), whose posterior distribution is \(p(\mathbf {x}_{*}\mid \mathcal {D}_{n})\). The posterior distribution represents the likelihood of a location being the function global minimum after observing \(\mathcal {D}_{n}\) observations. The information about \(\mathbf {x}_{*}\) is measured in terms of the differential entropy between \(p(\mathbf {x}_{*}\mid \mathcal {D}_{n})\) and the expected value of \(p\left( \mathbf {x}_{*}\mid \mathcal {D}_{n}\cup \left\{ \left( \mathbf {x},y\right) \right\} \right) \). Formally, the PES acquisition function selects a point \(\mathbf {x}_{n+1}\) that maximizes the information about \(\mathbf {x}_{*}\) as

$$\begin{aligned} \mathbf {x}_{n+1}=\underset{\mathbf {x}\in \mathcal {X}}{\text {argmax}}\alpha _{n}(\mathbf {x})=\mathbb {H}\left[ p(\mathbf {x}_{*}\mid \mathcal {D}_{n})\right] -\mathbb {E}_{p(y\mid \mathcal {D}_{n},\mathbf {x})}\left[ \mathbb {H}\left[ p\left( \mathbf {x}_{*}\mid \mathcal {D}_{n}\cup \left\{ \left( \mathbf {x},y\right) \right\} \right) \right] \right] \end{aligned}$$
(7)

where \(\mathbb {H}\left[ p(\mathbf {x})\right] =-\int p(\mathbf {x})\log p(\mathbf {x})d\mathbf {x}\) is the differential entropy and the expectation is with respect to the posterior predictive distribution of y given \(\mathbf {x}\). Evaluation of the acquisition function in (7) is intractable and requires discretisation [24]. Noting that mutual information is a symmetric function, an easier yet equivalent formulation is as below:

$$\begin{aligned} \mathbf {x}_{n+1}=\underset{\mathbf {x}\in \mathcal {X}}{\text {argmax}}\alpha _{n}(\mathbf {x})=\mathbb {H}\left[ p(y\mid \mathcal {D}_{n},\mathbf {x})\right] -\mathbb {E}_{p(\mathbf {x}_{*}\mid \mathcal {D}_{n})}\left[ \mathbb {H}\left[ p\left( y\mid \mathcal {D}_{n},\mathbf {x},\mathbf {x}_{*}\right) \right] \right] \end{aligned}$$
(8)

The first term in (8) involves the predictive distribution \(p(y\mid \mathcal {D}_{n},\mathbf {x})\), which is Gaussian under Gaussian process modeling of \(f(\mathbf {x})\). Therefore, we have

$$ \mathbb {H}\left[ p(y\mid \mathcal {D}_{n},\mathbf {x})\right] =0.5\log \left[ 2\pi e\left( v_{n}\left( \mathbf {x}\right) +\sigma ^{2}\right) \right] $$

where \(v_{n}\left( \mathbf {x}\right) \) is the variance of \(p(y\mid \mathcal {D}_{n},\mathbf {x})\) and \(\sigma ^{2}\) is the variance due to measurement noise. The second term involves \(p\left( y\mid \mathcal {D}_{n},\mathbf {x},\mathbf {x}_{*}\right) \), which is the posterior distribution for y given the observed data \(\mathcal {D}_{n}\) and the location of the global minimizer of f. An exact form for the distribution \(p\left( y\mid \mathcal {D}_{n},\mathbf {x},\mathbf {x}_{*}\right) \) is intractable and its entropy \(\mathbb {H}\left[ p\left( y\mid \mathcal {D}_{n},\mathbf {x},\mathbf {x}_{*}\right) \right] \) for a given \(\mathbf {x}_{*}\) sample is usually computed using expectation propagation [26]. The expected value of the entropy with respect to the distribution \(p(\mathbf {x}_{*}\mid \mathcal {D}_{n})\) is approximated by averaging the entropy for Monte Carlo samples of \(\mathbf {x}_{*}\). A well known approach called Thompson sampling is typically used to draw the \(\mathbf {x}_{*}\) samples [27]. Using the estimated acquisition function, PES recommends the next evaluation point by the following maximization

$$\begin{aligned} \mathbf {x}_{n+1}=\underset{\mathbf {x}\in \mathcal {X}}{\text {argmax}}\alpha _{n}(\mathbf {x})=0.5\log \left[ v_{n}\left( \mathbf {x}\right) +\sigma ^{2}\right] -\frac{1}{M}\sum _{i=1}^{M}0.5\log \left[ v_{n}^{\left( i\right) }\left( \mathbf {x}\mid \mathbf {x}_{*}^{\left( i\right) }\right) +\sigma ^{2}\right] \end{aligned}$$
(9)

where M is the number of \(\mathbf {x}_{*}\) samples drawn, \(v_{n}\left( \mathbf {x}\right) \) and \(v_{n}^{\left( i\right) }\left( \mathbf {x}\mid \mathbf {x}_{*}^{\left( i\right) }\right) \) are the predictive variances given i-th sample of \(\mathbf {x}_{*}\). A pseudo-code for the Bayesian optimisation using PES acquisition function is presented in Algorithm 1. For further details on PES, we refer the reader to [8].

figure a

3.2 The Proposed Transfer Learning Method

Since the goal of the Bayesian optimisation is to find the optima (or minima) of a function, we develop a transfer learning method that directly transfers knowledge about the location of global minima from source functions to the target. Let \(\left\{ \{\mathbf{x}_{i}^{s},y_{i}^{s}\}_{i=1}^{N_{s}}\right\} _{s=1}^{S}\)be the observations from S sources under the observation model \(y_{i}^{s}=f^{s}(\mathbf {x}_{i}^{s})+\epsilon _{i}^{s}\) where \(\epsilon _{i}^{s}\sim \mathcal {N}(0,\sigma ^{2})\) and \(p^{s}(\mathbf {x}_{*})\) be the global minima distribution of the global minimizer \(\mathbf {x}_{*}\) from each source s. Similarly, let \(\left\{ \mathbf{x}_{j},y_{j}\right\} _{j=1}^{n_{0}}\) be the target observations under the observation model \(y_{j}=f(\mathbf {x}_{j})+\epsilon _{j}\) up to iteration \(n_{0}\), where \(\epsilon _{j}\sim \mathcal {N}(0,\sigma ^{2})\) and \(p(\mathbf {x}_{*})\) be the global minima distribution of its global minimizer \(\mathbf {x}_{*}\). Our proposed transfer learning scheme intervenes into the distribution of \(\mathbf {x}_{*}\) and modifies it to become a mixture distribution of \(p(\mathbf {x}_{*})\) from the target and \(p^{s}(\mathbf {x}_{*})\) from each source s. The proposed mixture distribution can be formally written as

$$\begin{aligned} p^{\text {TL}}(\mathbf {x}_{*})=\pi _{0}p(\mathbf {x}_{*})+\pi _{1}p^{1}(\mathbf {x}_{*})+\ldots +\pi _{S}p^{S}(\mathbf {x}_{*}) \end{aligned}$$
(10)

where \(\pi _{0},\pi _{1},\ldots ,\pi _{S}\) are the mixture coefficients such that \(\sum _{s=0}^{S}\pi _{s}=1\). Our model sets these mixture coefficients in proportion to the similarity between the target and a source. We first define a similarity measure \(\psi _{s}\) between the target \(p(\mathbf {x}_{*})\) and a source \(p^{s}(\mathbf {x}_{*})\). \(\psi _{0}\) is assumed to be the similarity of \(p(\mathbf {x}_{*})\) with itself and is set to 1. We define \(\pi _{0},\pi _{1},\ldots ,\pi _{S}\) as

$$\begin{aligned} \pi _{s}=\frac{\psi _{s}}{\sum _{s=0}^{S}\psi _{s}}. \end{aligned}$$
(11)

Given the proposed mixture distribution \(p^{\text {TL}}(\mathbf {x}_{*})\), our proposed information-theoretic transfer learning maximizes the following acquisition function

$$\begin{aligned} \mathbf {x}_{n+1}=\underset{\mathbf {x}\in \mathcal {X}}{\text {argmax}}\alpha _{n}(\mathbf {x})=\mathbb {H}\left[ p(y\mid \mathcal {D}_{n},\mathbf {x})\right] -\mathbb {E}_{p^{\text {TL}}(\mathbf {x}_{*}\mid \mathcal {D}_{n})}\left[ \mathbb {H}\left[ p\left( y\mid \mathcal {D}_{n},\mathbf {x},\mathbf {x}_{*}\right) \right] \right] \end{aligned}$$
(12)

The entropies in the above equation are computed as in (9).

Source/Target Similarity Measure. Given two probability distributions \(p(\mathbf {x}_{*})\) and \(p^{s}(\mathbf {x}_{*})\), we measure a divergence between p and \(p^{s}\) using their Kullback-Leibler (KL) divergence. KL-divergence takes non negative values and is unbounded. After measuring the divergence (denoted as \(D(p^{s}||p)\)), we map it to a similarity measure as \(\psi _{s}=\text {exp}(-\frac{D(p^{s}||p)}{\eta })\), where \(\eta >0\) is a model hyperparameter. Any other divergence measures such Hellinger distance, total variation distance or \(\chi ^{2}\)-divergence could also be used [28].

Since we have access to only samples of \(\mathbf {x}_{*}\) and no direct access to the closed form of the probability distributions \(p(\mathbf {x}_{*})\) and \(p^{s}(\mathbf {x}_{*})\), we need to estimate these density functions before computing the KL-divergence. A naive way to estimate probability density is via histograms with uniform binning, however, this method quickly becomes inefficient in the number of samples. To avoid this inefficiency, we estimate the KL-divergence based on nearest neighbor (NN) distances [29, 30], an approach that relies on non-parametric density estimation and then uses it to compute the KL-divergence. We refer to this estimate as NN-divergence. Let \(\left\{ \mathbf {x}_{*}^{s,1},\ldots ,\mathbf {x}_{*}^{s,n}\right\} \) and \(\left\{ \mathbf {x}_{*}^{1},\ldots ,\mathbf {x}_{*}^{m}\right\} \) be the d-dimensional samples drawn from \(p^{s}(\mathbf {x}_{*})\) and \(p(\mathbf {x}_{*})\) respectively. The NN divergence is estimated as

$$\begin{aligned} D_{n,m}\left( p^{s}(\mathbf {x}_{*})\parallel p(\mathbf {x}_{*})\right) =\frac{d}{n}\sum _{i=1}^{n}\log \frac{\tau _{m}(i)}{\rho _{n}(i)}+\text {log}\frac{m}{n-1} \end{aligned}$$
(13)

where

$$ \tau _{m}(i)=\min _{j=1,\ldots ,m}\parallel \mathbf {x}_{*}^{s,i}-\mathbf {x}_{*}^{j}\parallel $$

is the distance of \(\mathbf {x}_{*}^{s,i}\) to its nearest neighbor in the target sample set \(\left\{ \mathbf {x}_{*}^{1},\ldots ,\mathbf {x}_{*}^{m}\right\} \) and

$$ \rho _{n}(i)=\min _{j=1,\ldots ,n,j\ne i}\parallel \mathbf {x}_{*}^{s,i}-\mathbf {x}_{*}^{s,j}\parallel $$

is the distance of \(\mathbf {x}_{*}^{s,i}\) to its nearest neighbor within the source sample set \(\left\{ \mathbf {x}_{*}^{s,1},\ldots ,\mathbf {x}_{*}^{s,n}\right\} \).

Fig. 1.
figure 1

Evolution of \(p^{\text {TL}}(\mathbf {x}_{*})\) samples with increasing iterations for a one-dimensional target function (minimum at 0.3); (a)–(b) show histogram representation of \(p^{s}(\mathbf {x}_{*})\) for two one-dimensional source functions with minima at \(-0.7\) and 0.4 respectively; (c) shows the histogram representation of \(p(\mathbf {x}_{*})\) at the start of optimisation; (d)–(f) show the histogram representation of \(p^{\text {TL}}(\mathbf {x}_{*})\) at \(T=1, 3 \text { and}\,5\) respectively. The contribution of the source farther from the target reduces quickly while that of the related source increases.

Discussion. Since the target does not have many observations available in the initial iterations, the global minima samples from target, \(p(\mathbf {x}_{*})\) are distributed widely over the domain \(\mathcal {X}\). Therefore the NN divergence of each source with the target will have similar values and this in turn causes our transfer learning framework to choose almost equal number of samples from each source in the initial iterations. As the iterations increase, the NN divergence estimate improves as more target observations are made. This increases the contribution of related sources and decreases the contribution of unrelated sources in the mixture distribution \(p^{\text {TL}}(\mathbf {x}_{*})\). Asymptotically, the contribution of the \(p^{s}(\mathbf {x}_{*})\) distribution from an unrelated source becomes zero. This makes our transfer learning algorithm capable of preventing negative transfer from unrelated sources. In the limit when our algorithm has densely sampled the target function, \(p(\mathbf {x}_{*})\) becomes nearly an impulse function and its KL-divergence with any source becomes extremely large implying that \(p^{\text {TL}}(\mathbf {x}_{*})\rightarrow p(\mathbf {x}_{*})\) and becomes free of any bias from the sources. Figure 1 provides an illustration of the evolution of the mixture distribution \(p^{\text {TL}}(\mathbf {x}_{*})\). We can see that the contribution of unrelated source reduces quickly whilst that of the related source increases. We considered two source functions as:

$$ f(\mathbf {x})=1-a*\text {exp}\left( -\frac{1}{2}(\mathbf {x}-\mu )(\mathbf {x}-\mu )^{T}\right) $$

For source 1, \(\mu =-0.7\), \(a=2\) and for source 2, \(\mu =0.4\), \(a=2\). The target function has a similar form with \(\mu =0.3\), \(a=1\). For each source, 20 data points are sampled randomly from the support \(\left[ -1.5,1.5\right] \) and \(p^{s}(\mathbf {x}_{*})\) samples are drawn using their posterior Gaussian process models. Figures (1a) and (b) show the histogram counts of source \(p^{s}(\mathbf {x}_{*})\) samples. We also show the histogram count of \(p(\mathbf {x}_{*})\) samples drawn using the posterior Gaussian process models of initial target data (see Figure (1c)). Figures (1d)–(f) show the evolution of \(p^{\text {TL}}(\mathbf {x}_{*})\) with increasing iterations. Initially \(p^{\text {TL}}(\mathbf {x}_{*})\) samples are widely distributed. With increasing iterations, the mass of \(p^{\text {TL}}(\mathbf {x}_{*})\) samples near the global minima location increases as it selects more samples from the closer source (minimum at 0.4). This example illustrates the typical behavior of our transfer learning algorithm in relying more on related sources.

Our proposed information-theoretic transfer learning algorithm for Bayesian optimisation is summarized in Algorithm 2.

figure b

4 Experiments

We perform experiments using both synthetic and real optimisation tasks. Through synthetic experiments, we analyze the behavior of our proposed transfer learning method in a controlled setting. Through real data experiments, we show that our method can tune the hyperparameter for support vector machine and elastic net efficiently. For both synthetic and real experiments, we compare our proposed method with three other well known transfer learning methods and with Bayesian optimisation that does not use any knowledge from sources. The following baselines are used:

  • Env-GP: This algorithm [14] models a source function as noisy measurements of the target function. The noise for each source is estimated separately and then the observations from each source are merged.

  • SMBO: This algorithm [12] transfers deviation of a function from its mean scaled through the standard deviation. The observations from each source are first standardized and then data from all sources are merged.

  • SCoT: This algorithm [11] transfers function ranking from a source to the target in latent space. The observations from each source are first adjusted and then data from all sources are merged.

  • Generic-BO (No Transfer): This baseline is a PES based Bayesian optimisation algorithm (see Algorithm 1) and does not use any information from source functions. This is used to assess the gain in optimisation performance due to transfer learning.

4.1 Experimental Setting

We use the square-exponential kernel as the covariance function in Gaussian process (GP) modeling. All GP hyperparameters are estimated using maximum a posteriori (MAP) estimation. In all our experiments, the hyperparameter \(\eta \) (used in similarity measure) is set to 10. All the results are averaged over 10 runs with random initializations.

4.2 Synthetic Experiments

We consider 4 source functions as:

$$ f(\mathbf {x})=1-a*\text {exp}\left( -\frac{1}{2}(\mathbf {x}-\varvec{\mu })(\mathbf {x}-\varvec{\mu })^{T}\right) $$

For source 1,  \(\varvec{\mu }=\left[ -1.8,-1.8,-1.8\right] \), \(a=2\). For source 2, \(\varvec{\mu }=\left[ -0.7,-0.7,-0.7\right] \), \(a=2\). For source 3, \(\varvec{\mu }=\left[ 0.4,0.4,0.4\right] \), \(a=2\). For source 4, \(\varvec{\mu }=\left[ 1.5,1.5,1.5\right] \), \(a=2\). The target function has a similar form with \(\varvec{\mu }=\left[ 0.3,0.3,0.3\right] \), \(a=1\). For each source, 50 data points were sampled randomly from \(\left[ -2,2\right] \) along each dimension. Our transfer learning method selects \(p^{\text {TL}}(\mathbf {x}_{*})\) samples from any of the four sources based on the estimated source/target similarity. Figure 2a shows the minimum function value obtained with respect to iterations for the proposed method and the baselines. Each method starts with the same four random observations. Our method outperforms the baselines by achieving \(95\%\) of the minimum value in the \(8^{th}\) iteration. The performance of Env-GP method is poor because it estimates source and target similarity by measuring the difference between the source and the target functions, not their optima location. The other two transfer learning baselines also show poor performance as some of the sources are not similar to the target. Figure 2b shows the mixture proportions of sources and the target distribution with respect to increasing iterations. As seen from the Figure, the source with minima at \(\varvec{\left[ 0.4,0.4,0.4\right] }\) contributes maximally in the mixture distribution. As the iterations increase, the contribution of sources with distant minimum (from target’s minimum) reduces to small values.

Fig. 2.
figure 2

Synthetic Experiments: (a) Minimum value vs optimisation iterations (b) Proportions of various sources and the target in the mixture distribution with respect to optimisation iterations.

4.3 Hyperparameter Tuning

We tune hyperparameters of two classification algorithms: Support Vector Machine (SVM) and Elastic net. We consider 5 binary classification datasets - ‘banana’, ‘breast cancer’, ‘colon cancer’, ‘german numer’ and ‘diabetes’ (LibSVM repository [31]). A brief summary about these datasets is provided in Table 1. We train a classifier for each dataset. For each dataset, \(70\%\) data is chosen randomly for training and the rest \(30\%\) used for validation. Hyperparameter tuning involves optimising validation performance (measured via AUC) as a function of hyperparameter values. We consider the first 4 hyperparameter tuning functions (‘banana’, ‘breast cancer’, ‘colon cancer’ and ‘german numer’) as source functions and the hyperparameter tuning function of the ‘diabetes’ as the target function. We assume that the several samples from the source functions are already available.

Table 1. Binary datasets used in our experiments.

SVM with RBF kernel has two hyperparameters to tune: cost parameter (C) and kernel parameter (\(\gamma \)). The range for \(\gamma \) is set as \(\left[ 10^{-3},10^{3}\right] \) and the same for C is \(\left[ 2^{-3},2^{3}\right] \). We run our proposed method and other baseline methods and report the results in Fig. 3. Figure 3a shows the AUC performance on the held-out validation set. The baseline, Generic-BO (No Transfer)shows better performance than other three transfer learning baselines. In contrast, the proposed method is able to outperform Generic-BO (No Transfer) converging faster than all the baselines. Figure 3b shows the proportion of contributions from different sources versus iterations.

Fig. 3.
figure 3

Hyperparameter tuning (SVM): (a) 1-AUC vs optimisation iterations (b) Proportions of sources and the target in the mixture distribution with respect to optimisation iterations.

Fig. 4.
figure 4

Hyperparameter tuning (Elastic net): (a) 1-AUC vs optimisation iterations (b) Proportions of sources and the target in the mixture distribution with respect to optimisation iterations.

Elastic net has two hyperparameters to tune: \(L_{1}\) and \(L_{2}\) penalty weights. The ranges for both these hyperparameters are set as \(\left[ 10^{-2},10^{0}\right] \). The performance in terms of AUC on the held-out validation set is shown in Fig. 4a. Our method performs significantly better than all the baselines. A plot depicting the proportions of contributions from different sources versus iterations is shown in Fig. 4b. The source code used for all these experiments are available at https://github.com/AnilRamachandran/ITTLBO.git and the datasets are available at https://doi.org/10.7910/DVN/LRNLZV.

5 Conclusion

We propose a novel information-theoretic transfer learning algorithm for Bayesian optimisation. Our algorithm is based on constructing a mixture distribution of optima from both sources and the target combining them by the divergence between the optima distributions. This biased distributions is then used to formulate a new information theoretic acquisition function in a manner similar to the Predictive Entropy Search. In the limit \(\left( T\rightarrow \infty \right) \) the optimisation becomes free from any random sources influence with probability 1. We evaluate our algorithm with diverse optimisation tasks and show that it outperforms other well known transfer learning methods.

The framework proposed in this paper is the first attempt to build a novel information-theoretic transfer learning framework for Bayesian optimisation. There are several possibilities for applying this idea in other related frameworks such as optimal sensor placements in monitoring systems [32], optimal experimental design for reservoir forecasting [33] and automatic emulator constructor for radiative transfer models (RTMs) [34].