1 Introduction

In task scheduling, it is often assumed that the scheduler knows the execution time of the tasks, based on the assumption that the techniques are present to predict the performance of tasks. However, it is a non-trivial task to produce accurate performance prediction for tasks, although a number of techniques are indeed developed to predict the task performance [queuing-theory based] [12]. The situation worsens for the tasks that are running simultaneously (co-running) on multiple CPU cores in a multi-core processor for the following reasons. There is resource contention and interference among the co-running tasks, since they need to share (contend) the resources in the computer such as internal buses, cache, memory, hard disk, etc. The resource contention may lead to the longer completion times of the tasks. The resource contention relation is complicated because both the intensity level of contention and the type of the contention (i.e., which type of the resource is contended by the tasks most intensively) do not only relate to the hardware specification of the system (such as cache size and memory bandwidth), but also vary from the characteristics of the co-running tasks (such as memory access frequency, I/O requirement and cache usage). The complexity nature of the contention among co-running tasks makes it difficult to develop the static formulas for accurate performance prediction.

In this paper, we investigate the performance of the co-running tasks on multi-core chips and present the method to identify the influential factors, which are the performance events provided by the Operating Systems, for the given co-running tasks. Further, we propose a machine learning-based approach to predicting the performance of the co-running tasks. Two prediction frameworks are developed for two types of task in our model: repetitive tasks (i.e., the tasks that arrive at the system repetitively) and new tasks (i.e., the tasks that are submitted to the system the first time). The difference between which is that we have the historical running information of the repetitive tasks while we do not have the prior knowledge about new tasks.

Given the limited information of the new tasks, a two-stage online prediction framework is developed to predict the performance of the co-running new tasks by sampling the performance events on the fly for a short period and then feeding the sampled results to the prediction framework. We conducted extensive experiments with the SPEC2006 benchmark suite to compare the effectiveness of different machine learning methods considered in this paper and present our observations and analysis. The results show that our prediction model can achieve the accuracy of 99.38% and 87.18% for repetitive tasks and new tasks, respectively.

2 Motivating benchmark experiments

We used SPEC2006 to conduct the benchmarking experiments to investigate the impact of task co-running on performance.

Figure 1 shows the execution time of three benchmarks in SPEC2006, 401 (401.perlbench is a compression program), 416 (416.gamess is a wide range of quantum chemical computations) and 470 (470.lbm is a computational fluid dynamics program using the lattice boltzmann method) [9], when they co-run with other SPEC2006 benchmarks on a multi-core processor. In Fig. 1a, the execution time of the solo-run of SPEC 401 (i.e., when the benchmark runs on a core without other programs co-running on other cores) is 89.4 s. When co-running with other benchmarks, the execution time of 401 vary from 89.50 (co-running with 462) to 114.92 (co-running with 470). Its performance degradation is noticeable: from 0 to 28%. The same phenomenon occurs with other benchmarks in SPEC. Through the experiment, we also observed that some benchmarks, such as SPEC 470 are more contention-sensitive (up to 65%) than others, such as SPEC 444 (up to 5%). In Fig. 1b, experiments are conducted on a quad-core processor to show the performance impact of SPEC 416(470) when it is running with different degrees of contentions. The x-axis represents the number of co-running tasks. For example, x-axis ’0’ means that SPEC 416(470) is solo-running with no resource contention; x-axis ’3’ means that there are four SPEC 416(470) tasks co-running on different cores within the quad-core processor. The y-axis represents the ratio of \(Makespan_{co-run}\) to \(Makespan_{solo}\). We can see that the resource contention leads to an enormous increasing of completion time of SPEC 470; While there is not any performance impact on SPEC 416 during the co-running.

Fig. 1
figure 1

Motivation experiments

Almost all latest researches cite cache miss and memory bandwidth as the most important factors that affect the performance of co-running tasks. Based on these two factors, several performance models are constructed to formulate the impact of task co-running on multi-core processors [2, 5, 6, 18]. However, our research show that more factors show noticeable impact on the co-running performance, such as branch-misses, context-switches, and minor faults, etc. We collect 30 performance events provided by the Operating System during the execution of co-running tasks, as shown on the x-axis of Fig. 2. Figure 2 shows the ratio of the values of these performance events gathered when SPEC 459 co-runs with 470 to those when SPEC 459 solo-runs. As can be seen from this figure, the values of the performance events have considerable changes when the benchmarks co-run.

Fig. 2
figure 2

Comparison of performance events of SPEC 459 between solo execution and co-running

3 The performance prediction framework for co-running tasks

We develop the performance prediction framework for both repetitive tasks, which has been run in the system before and therefore we have the historical performance event data when the task solo-runs, and new tasks, which are submitted to run on the system the first time.

When the co-running tasks are the repetitive tasks, we use the historical performance event data of individual tasks (their solo-run performance data) as the input of the prediction framework. When the co-running tasks contain new tasks, we do not have the prior knowledge about the new tasks and therefore different procedure is developed in this paper for new tasks.

The data of all 30 performance events can be collected using Perf (a profiling tool in Linux) during the execution of the tasks. We introduce a notion called the the rate of performance event, which equals to the collected value of a performance event divided by the execution time of the task (i.e., the frequency at which the performance event occurs during the execution of the task). We then use the rates of the performance events as the attributes of the training model.

The performance impact of a task is defined as the ratio of co-running completion time to its solo completion time. When task \(t_i\) and \(t_j\) co-run, the predicted performance impact of \(t_i\), denoted by \(PI\_T_i\), is represented as in 1, where \(\varGamma _{co-running}\) is a trained model, \(PE_i\) and \(PE_j\) are the set of solo-run performance events of task \(t_i\) and \(t_j\), respectively.

$$ PI\_T_i = \varGamma _{co-running} (PE_{i}, PE_{j}) $$
(1)

In order to understand why our prediction model for new tasks works, see a benchmarking experiment we conducted. Figure 3 shows the trend of the selected performance events of SPEC 401 during its co-running with SPEC 403. In the experiments, we collect the performance event data once every 500ms. The execution time of SEPC 401 (1 iteration) is 30 s. Thus we obtained 60 sampled data (time intervals). As can be seen from this figure, many performance events show repeated or similar trend as the co-running tasks progress, which provides a ground for our prediction model for new tasks.

Fig. 3
figure 3

The trend of performance events as the co-running tasks progress

Since we do not have the prior knowledge about the new tasks, we develop a two-stage prediction framework. In the first stage, we construct a prediction model for each performance event. Thus we have 30 models in total, corresponding to the 30 performance events. We sample the performance events at a preset sampling rate for a preset period when the task is solo-executing; Then we sample the performance events for the same length of period when the task is co-running with other tasks. For example, the performance events are sampled once every 200ms for 2 s for a task of 100 s. The prediction model for performance event takes the sampled data of the performance event as input and predicts the value of the performance event when the solo-running and the co-running task complete, respectively. In the second stage, we use the impact ratio of the performance events to predict the performance impact of the co-running tasks. The impact ratio is defined as the ratio of the value of the co-running performance event to the value of the solo-running performance event that is predicted in the first stage.

The two-stage prediction model is formulated as follows. \(t_{sp}\) denotes the time period of sampling and int represents the sampling interval (the inverse of the sampling rate). Then \(t_{sp}/int\) is the number of sampled data we obtain for a performance event.

The predicted data of the performance event when the solo-running task (or co-running task) \(t_i\) completes, denoted by \(PPE_i^{solo}\) (or \(PPE_i^{co}\)), can be represented by the vector that is derived from Eq. 2 (or 3). \(F_1^n\) represents the prediction model for performance event n. \(s\_PE_{j}^n\) and \(c\_PE_{j}^n\) denote the performance event data of the j-th sampling interval for performance event n under solo-run and co-run, respectively. In our work, there are in total 30 performance events (i.e., \(m=30\)). Thus we train 30 models in the first stage.

$$PPE_i^{solo} = \left [F_1^1(s\_PE_1^1,\ldots ,s\_PE_{t_{sp}/int}^1), \ldots , F_1^m(s\_PE_1^m,\ldots ,s\_PE_{t_{sp}/int}^m\right ] $$
(2)
$$PPE_i^{co} = \Bigg [F_1^1(c\_PE_1^1,\ldots ,c\_PE_{t_{sp}/int}^1), \ldots , F_1^m(c\_PE_1^m,\ldots ,c\_PE_{t_{sp}/int}^m\Bigg ] $$
(3)

In the second stage, the impact ratio vector of task \(t_i\), denoted by \(IR_i\), can be derived from:

$$ IR_i = \frac{PPE_i^{co}-PPE_i^{solo}}{PPE_i^{solo}}$$
(4)

Then the prediction model for the performance impact of task \(t_i\) can be represented by

$$ PI\_T_i = \varGamma '_{co-running} (IR_i) $$
(5)

where \(\varGamma '_{co-running}\) represents the trained model for predicting the performance impact of new tasks. Note that we do not need the performance event data for the co-running task \(t_j\) as the input of this formula because the execution information of \(t_j\) has been reflected in the sampled performance event data since tasks \(t_i\) and \(t_j\) are co-running. Furthermore, we can predict the performance impact of a specific task in the same way, no matter how many tasks it is co-running with in a multi-core processor.

In the above model representations, \(PPE_i^{solo}\) and \(PPE_i^{co}\) represents the first stage work while \(PI\_T_i\) represents the second stage work in the two-stage prediction model. The machine learning approaches used in our prediction frameworks are: linear regression, naive Bayes, support-vector machine (SVM) and random forest. We examined the four popular machine learning approaches mentioned, aiming to identify most effective approach for our scenario.

4 Evaluation

The accuracy of our prediction frameworks plugged with the above four machine learning approaches is evaluated in this section. The testing environment is a Personal Computer with a 3.30 GHz dual-core Intel i5 CPU and 8 GB memory. It has 32k L1d cache, 32k L1i cache, 256k L2 cache and 6144k L3 cache. 28 benchmarks used in the evaluation come from the Standard Performance Evaluation Corporation (SPEC) benchmark suite. There are For each application, the features are collected only once statically. These data are repeatedly used for training and predicting. Note that the training time of our models vary from 2.1 to 4.6 s and the prediction time of one instance is very short and can be neglected.

We co-run the benchmarks in SPEC on two cores. There are in total \(\left( {\begin{array}{c}28\\ 2\end{array}}\right) =406\) combinations. Each combination generates two instances. Therefore, we obtained 812 pieces of training data. We choose a 4:1 split for training and testing. Table 1 shows the experiment results for predicting the performance impact of co-running repetitive tasks and new tasks. We divide our experiments into four categories: Static regression and static classification are for repetitive tasks while online regression is for new tasks.

Fig. 4
figure 4

Residual analysis for co-running task prediction (online)

Fig. 5
figure 5

Residual analysis for co-running task prediction (static)

Figures 4 and 5 are the residual analysis of the trained model. 1) In the “Residual vs. Fitted” figure, the y axis is the residual while the x axis is the fitted values. The residuals “bounce randomly” around the 0 line. This suggests that the assumption that the relationship is linear is reasonable. There are several outliers that need to be removed from the training set, such as 647, 619 and 707, etc. 2) In the “Normal Q–Q” figure, the results suggest that the residuals (and hence the error terms) are normally distributed but with the several outliers. 3) The “scale-location” plot shows whether the residuals are spread equally along the ranges of predictors. Here, we can see a horizontal line with equally (randomly) spread points, which suggests that the two trained models satisfy homoscedastic. 4) the “Residuals vs Leverage” plot helps to find the influential cases if any. The results for the static model are to be expected except there exit several potential problematic cases in the training set of the online model (with the row numbers of the data in the dataset).

The predicting result for the regression is a number while the predicting result for classification is a range. For the regression, we set a tolerance of 3%. Namely, if the difference between the predicting result and the actual measurement is less than 3%, we regard the predicting result as being correct. We set this tolerance because the measured co-running time is not constant. The execution time of a specific task fluctuates even when it co-runs with the same task. For the classification, we set the reasonable ranges for the data. The difference between the upper bound and lower bound is around 3% of the average performance impact of the application. If the actual performance impact resides within the range that we predict, we regard the predicting result as being correct.

Table 1 Experiment results for predicting the co-running time for repetitive tasks and new tasks, respectively

In Table 1, we observe that our model can predict the co-running time accurately for most applications. Furthermore, among the predictors, SVM and random forest have the best accuracy for both static and online tests. Random forest achieves over 90% accuracy in both regression and classification for repetitive tasks. We also preprocess the data by discretizing and normalizing the data (denoted by d&n), comparing with the prediction accuracies in the cases of non-discretizing (denoted by n–d) and non-normalizing (denoted by n–n). We discretizes the attributes (i.e. performance event) into specific number of bins. This operator discretizes the selected numerical attributes to nominal attributes. To achieve this, a width is calculated from (max-min)/bins. max and mix represent the maximum and the minimum of each attribute, respectively. The range of numerical values is partitioned into segments of equal size (i.e. width). Then we assigned the attributes to the corresponding segments. Discretization can help to reduce the categories to a reasonable number that the classifier is able to handle.

Comparing to the predicting result with non-discretizing and non-normalizing, the discretization and normalization performs a better accuracy in most cases. Discretized feature has strong robustness to the abnormal data. It makes the prediction model more stable and reduces over-fitting. Normalization accelerates the gradient descent and also achieves a higher accuracy, especially for SVM and linear regression. In the online prediction model (for new tasks), we did not normalize the data because the input data are ratios.

Furthermore, we order the features by their importance scores under the random forest predictor. The IncNodePurity reflects the total decrease in node impurities from splitting on the variable, averaged over all trees. We find the features which are important for the prediction. In the order of their decreasing importance scores, these features are instructions, bus cycles, LLC loads, L1 dcache loads, dTLB loads, branch instructions, branch.loads and context switches. These features are essential for the prediction of performance impact because the prediction accuracy drops significantly (79.3% and 71.79% for static prediction and online prediction, respectively) when we remove any of these features from the feature set of the input data.

5 Related work

Several works have explored the performance degradation problem of co-running tasks. Performance and energy model are built to analyze and predict the performance impact [3, 11, 13, 15,16,17, 19] and [14]. Reference [2] studies the impact of L2 cache sharing on the concurrent threads; Reference [19] proposes an interference model that considers the time-variant inter-dependency among different levels of resource interference to predict the application QoS metric. Reference [18] decomposes parallel runtime into compute, synchronization and steal time, and uses the runtime breakdown to measure program progress and identify the execution inefficiency under interference (in virtual machine environment). Reference [1] reveals that the cross-application interference problem is related to the amount of simultaneous access to several shared resources. Reference [10] predicts the execution time of an application workload for a hypothetical change of configuration on the number of CPU cores of the hosting VM. Reference [8] gained the insight into the principle of enriching the capability of the existing approaches to predicting the performance of multicore systems. Reference [7] develops an efficient ELM based on the Spark framework (SELM), which includes three parallel subalgorithms, is proposed for big data classification. Reference [4] proposes a Patient Treatment Time Prediction (PTTP) algorithm to predict the waiting time for each treatment task for a patient.

Most of the above studies consider that the features such as cache and bandwidth will impact on the co-running performance. However, our experimental results reveal that more performance events (such as ref-cycles, context-switches, branch-load-misses, etc.) may play influential roles. In this work, we take into account all performance events provided by the profiling tool, Perf.

6 Conclusions and future work

This paper investigate the influential factors that affect the performance of the co-running tasks. A performance model is built and several machine learning methods are applied to predict the performance impact of the co-running tasks. Experiments conducted with SPEC 2006 benchmark suite show that our prediction model of performance impact achieves the accuracy of 99.38% and 87.18% on repetitive tasks and new tasks, respectively.

In future, our research will extend to modeling the performance impact of running the tasks with different CPU frequencies and also modeling the performance impact of time-sharing execution.