A novel statistical time-series pattern based interval forecasting strategy for activity durations in workflow systems

https://doi.org/10.1016/j.jss.2010.11.927Get rights and content

Abstract

Forecasting workflow activity durations is of great importance to support satisfactory QoS in workflow systems. Traditionally, a workflow system is often designed to facilitate the process automation in a specific application domain where activities are of the similar nature. Hence, a particular forecasting strategy is employed by a workflow system and applied uniformly to all its workflow activities. However, with newly emerging requirement to serve as a type of middleware services for high performance computing infrastructures such as grid and cloud computing, more and more workflow systems are designed to be general purpose to support workflow applications from many different domains. Due to such a problem, the forecasting strategies in workflow systems must adapt to different workflow applications which are normally executed repeatedly such as data/computation intensive scientific applications (mainly with long-duration activities) and instance intensive business applications (mainly with short-duration activities). In this paper, with a systematic analysis of the above issues, we propose a novel statistical time-series pattern based interval forecasting strategy which has two different versions, a complex version for long-duration activities and a simple version for short-duration activities. The strategy consists of four major functional components: duration series building, duration pattern recognition, duration pattern matching and duration interval forecasting. Specifically, a novel hybrid non-linear time-series segmentation algorithm is designed to facilitate the discovery of duration-series patterns. The experimental results on real world examples and simulated test cases demonstrate the excellent performance of our strategy in the forecasting of activity duration intervals for both long-duration and short-duration activities in comparison to some representative time-series forecasting strategies in traditional workflow systems.

Introduction

Given the recent popularity of high performance computing for e-science and e-business applications, and more importantly the appearing applicability of workflow systems to facilitate high performance computing infrastructures such as grid and cloud computing, there is an increasing demand to investigate non-traditional workflow systems (Deelman et al., 2008, Yu and Buyya, 2005a, Yuan et al., 2010). One of the research issues is the interval forecasting strategy for workflow activity durations (Liu et al., 2008b, Nadeem and Fahringer, 2009a, Nadeem and Fahringer, 2009b, Smith et al., 2004). It is of great significance to deliver satisfactory QoS (Quality of Service) in workflow systems where workflow applications are often executed repeatedly. For example, grid workflow systems can support many data/computation intensive scientific applications such as climate modelling, disaster recovery simulation, astrophysics and high energy physics, as well as many instance intensive business applications such as bank transactions, insurance claims, securities exchange and flight bookings (Taylor et al., 2007, Wang et al., 2009). These scientific and business processes are modelled or redesigned as workflow specifications (consisting of such as workflow task definitions, process structures and QoS constraints) at the build-time modelling stage (Chen and Yang, 2010, Hsu and Wang, 2008, Liu et al., 2008c, Zeng et al., 2008). The specifications may contain a large number of computation and data intensive activities and their non-functional requirements such as QoS constraints on budget and time (Chen and Yang, 2008, Li et al., 2004, Zhao et al., 2004, Martinez et al., 2007). Then, at the run-time execution stage, with the support of workflow execution functionalities such as workflow scheduling (Stavrinides and Karatza, 2010, Neng and Zhang, 2009), load balancing (Taylor et al., 2007) and temporal verification (Chen and Yang, 2007), workflow instances are executed by employing the computing and data sharing ability of the underlying computing infrastructures with satisfactory QoS (Son and Kim, 2001). Interval forecasting for workflow activity durations, i.e. forecasting the upper bound and the lower bound of workflow activity durations, is required both at build time for creating workflow QoS specifications and at runtime for supporting workflow execution functionalities (Liu et al., 2008c, Zhuge et al., 2001). Therefore, effective interval forecasting strategies need to be investigated for workflow systems.

Traditional workflow systems are often designed to support process automation in specific application domains. For example, Staffware (van der Aalst and Hee, 2002) and SAP workflow system (SAP, 2010) are designed mainly to support business workflow applications for enterprises while GridBus Project (2010) and Kelper Project (2010) are designed mainly to facilitate scientific workflow applications for research institutes. However, with the emerging of high performance computing infrastructures, especially cloud computing which is becoming a type of computing utility to provide the basic level of computing services (Buyya et al., 2009), workflow systems should be able to support a large number of workflow applications regardless of their specific application domains. In the real world, most workflow applications can be classified as either scientific workflows or business workflows (van der Aalst and Hee, 2002, Barga and Gannon, 2007, Yu and Buyya, 2005b). Therefore, in this paper, we focus on two representative workflow applications: data/computation intensive scientific workflows (Deelman et al., 2008) and instance intensive business workflows (Liu et al., 2008a). In data/computation intensive scientific workflows, there is a large number of workflow activities which normally occupy some fixed resources and run continuously for a long time due to their own computation complexity, for example, the De-dispersion activity in a pulsar searching activity which takes around 13 h to generate 90 GB of de-dispersion files (NMDC, 2010). On the contrary, in instance intensive business workflows, there is a large volume of concurrent relatively simple workflow activities with fierce competition of dynamic resources, for example, the Clearing activity in a securities exchange business workflow which checks the balance between branches and clients with 50,000 transactions in 3 min (Liu et al., 2010). More examples are provided in Section 2.1. Therefore, it is important that the forecasting strategy in a workflow system can effectively adapt to the requirements of different workflow applications. Specifically, as will be discussed in Section 2.2, the execution time of a long-duration activity is mainly decided by the average performance of the workflow system over its lifecycle while the execution time of a short-duration activity is mainly dependent on the system performance at the current moment where the activities are being executed. However, in a traditional workflow system, a particular forecasting strategy is normally designed for a specific type of workflow applications and applied uniformly to all of its workflow instances (regardless of the differences between long-duration and short-duration activities). Therefore, new forecasting strategies in workflow systems need to be investigated.

Interval forecasting for activity durations in workflow systems is a non-trivial issue. On one hand, workflow activity durations consist of complex components. In workflows, activity durations cover the time intervals from the initial job submission to the final completion of each workflow activity. Hence, besides the exact running time on allocated resources, they also consist of extra time, i.e. workflow overheads. As introduced in (Prodan and Fahringer, 2008), there exist four main categories of workflow overheads in scientific workflow applications including middleware overhead, data transfer overhead, loss of parallelism overhead and activity related overhead. For example, in grid workflows, activity durations involve much more affecting factors than the running time of conventional computation tasks which are dominated by the load of high performance computing resources. On the other hand, the service performance is highly dynamic. For example, grid workflows are deployed on grid computing infrastructures where the performance of grid services is highly dynamic since they are organised and managed in a heterogeneous and loosely coupled fashion. Moreover, the workload of these shared resources, such as the computing units, the storage spaces and the network, is changing dynamically. Therefore, many traditional multivariate models which consist of many affecting factors such as CPU load, memory space, network speed (Dinda and O’Hallaron, 2000, Wu et al., 2007, Zhang et al., 2008) are either not satisfactory in performance or too complex to be applicable in practice, let alone the fact that it is very difficult, if not impossible, to measure these affecting factors in the distributed and heterogeneous computing environments such as grid and cloud (Dobber et al., 2007, Glasner and Volkert, in press). There are also many strategies which define a type of “templates” like models (Nadeem and Fahringer, 2009b, Nadeem et al., 2006). These models may include workflow activity properties such as workflow structural properties (like control and data flow dependency, etc.), activity properties (like problem size, executables, versions, etc.), execution properties (like scheduling algorithm, external load, number of CPUs, number of jobs in the queue, free memory, etc.), and then use “similarity search” to find out the most similar activity instances and predict activity durations (Nadeem and Fahringer, 2009b). However, they also suffer the same problems faced by traditional multivariate models mentioned above.

In this paper, we utilise time-series based forecasting models. In both scientific and business fields, time-series models are probably the most widely used statistical ways for formulating and forecasting the dynamic behaviour of complex objects (Chatfield, 2004, Liu et al., 2008b). A time series is a set of observations made sequentially through time. Some representative time series, including marketing time series, temperature time series and quality control time series, are effectively applied in various scientific and business domains (Huang et al., 2007). Similarly, a workflow activity duration time series, or duration series for short, is composed of ordered duration samples obtained from workflow system logs or other forms of historical data. Here, the samples specifically refer to the historic durations of the same workflow activity instances. Therefore, duration series is a specific type of time series in the workflow domain. In this paper, the term “time series” and “duration series” can be used interchangeably in most of the cases. In this paper, instead of applying traditional multivariate models, we conduct univariate time-series analysis which analyses the behaviour of duration series itself to build a model for the correlation between its neighbouring samples, or in other words, forecasting the future activity durations only based on the past activity durations (Chatfield, 2004). Unlike “template” based strategies which need to identify and collect various data about workflow activity properties, all the information are embedded implicitly in duration series. Therefore, the problems faced by both multivariate models and “template” based strategies, e.g. the complexity of models and the difficulty of data collection, can be overcome by time-series based forecasting strategies. Meanwhile, we focus on workflow systems in this paper since one of the fundamental requirements for time-series based forecasting strategies is the sufficient amount of samples. Cleary, such a requirement can normally be satisfied in workflow systems since the same workflow instances are frequently (or periodically) executed in order to realise some scientific or business processes.

Current forecasting strategies for computation tasks mainly reside on the prediction of CPU load (Akioka and Muraoka, 2004, Zhang et al., 2008). However, this is quite different from the prediction of workflow activity durations in high performance computing environments due to the reasons mentioned above. Besides, most forecasting strategies also neglect another important issue that is the difference between long-duration activities (mainly in the data/computation intensive scientific applications) and short-duration activities (mainly in the instance intensive business applications). Typically for time-series forecasting strategies, the major requirements for long-duration activities are to handle the problems of limited sample size and frequent turning points in the duration series, while the major requirements for short-duration activities are to identify the latest system performance state and predict the changes (as detailed in Section 2.2). However, forecasting strategies in traditional workflow systems, either focusing on the prediction of the system performance state of the next moment or the prediction of the average system performance over a long period, but overlook the different requirements of long-duration activities and short-duration activities by treating them in an identical way. Therefore, there is a space to improve the accuracy of prediction if we differentiate long-duration activities from short-duration activities and investigate effective forecasting strategies which can adapt to their different requirements.

In this paper, a novel non-linear time-series segmentation algorithm named K–MaxSDev is proposed to facilitate a statistical time-series pattern based forecasting strategy for both long-duration activities and short-duration activities. For long-duration activities, two problems of the duration series which seriously hinder the effectiveness of conventional time-series forecasting strategies are limited sample size and frequent turning points. Limited sample size impedes the fitting of time-series models and frequent turning points where dramatic deviations take place significantly deteriorate the overall accuracy of time-series forecasting. To address such two problems, we utilise a statistical time-series pattern based forecasting strategy (denoted as K–MaxSDev(L) where L stands for long-duration). First, an effective periodical sampling plan is conducted to build representative duration series. Second, a pattern recognition process employs our K–MaxSDev time-series segmentation algorithm to discover the minimum number of potential patterns which are further validated and associated with specified turning points. Third, given the latest duration sequences, pattern matching and interval forecasting are performed to make predictions based on the statistical features of the best-matched patterns. Meanwhile, concerning the occurrences of turning points, three types of duration sequences are identified and then handled with different pattern matching results. As for short-duration activities, the key is to identify the system performance state at the current moment and detect possible changes. Therefore, with fewer steps, a more efficient time-series pattern based forecasting strategy (denoted as K–MaxSDev(S) where S stands for short-duration) is adapted to apply on the latest duration sequences for short-duration activities. Similarly, the advantages of pattern based forecasting strategy such as accurate interval and the handling of turning points are still well retained.

Both real world examples and simulated test cases are used to evaluate the performance of our strategy in SwinDeW-G (Swinburne Decentralised Workflow for Grid) workflow system. The experimental results demonstrate that our time-series segmentation algorithm is capable of discovering the smallest potential pattern set compared with three generic algorithms: Sliding Windows, Top-Down and Bottom-Up (Keogh et al., 2001). The comparison results further demonstrate that our statistical time-series pattern based strategy behaves better than several representative time-series forecasting strategies including MEAN (Dinda and O’Hallaron, 2000), LAST (Dobber et al., 2007), Exponential Smoothing (ES) (Dobber et al., 2007), Moving Average (MA) (Chatfield, 2004), and Auto Regression (AR) (Chatfield, 2004) and Network Weather Service (NWS) (Wolski, 1997), in the prediction of high confidence duration intervals and the handling of turning points.

The remainder of the paper is organised as follows. Section 2 presents two motivating examples and problem analysis. Section 3 defines the time-series patterns and overviews our pattern based time-series forecasting strategy for both long-duration and short-duration activities. Section 4 proposes the novel time-series segmentation algorithm which is the core of the whole strategy. Sections 5 Interval forecasting for long-duration activities in data/computation intensive scientific applications, 6 Interval forecasting for short-duration activities in instance intensive business applications describe the detailed algorithms designed for long-duration activities and short-duration activities respectively. Section 7 demonstrates comprehensive simulation experiments to evaluate the effectiveness of our strategy. Section 8 discusses the related work. Finally, Section 9 addresses our conclusions and points out the future work.

Section snippets

Motivating examples and problem analysis

In this section, we first introduce two motivating examples, one for data/computation intensive scientific workflow and another for instance intensive business workflow. Afterwards, the problems with the forecasting of both long-duration and short-duration activities are analysed.

Time-series pattern based forecasting strategy

As the problems analysed above, rather than fitting conventional linear time-series models in this paper which is not ideal, we investigate the idea of pattern based time-series forecasting. In this section, we first introduce the definition of statistical duration-series patterns and then we present the overview of our statistical time-series pattern based forecasting strategy.

Novel time-series segmentation algorithm: K–MaxSDev

In this section, we propose K–MaxSDev, a non-linear time-series segmentation algorithm which is designed to formulate the duration series with minimal number of segments, i.e. potential duration-series patterns. Here, K is the initial value for equal segmentation and MaxSDev is the Maximum Standard Deviation specified as the testing criterion. The intuition for the initial K equal segmentation is an efficiency enhancement to the generic Bottom-Up algorithm which normally starts from the finest

Interval forecasting for long-duration activities in data/computation intensive scientific applications

As presented in Section 3, the interval forecasting strategy for long-duration activities in data/computation intensive scientific applications are composed of four functional components: duration series building, duration pattern recognition, duration pattern matching, and duration interval forecasting. In this section, we will propose the detailed algorithms for all the functional components. Note that since pattern matching and interval forecasting are always performed together, we

Interval forecasting for short-duration activities in instance intensive business applications

In this section, we present the detailed algorithms for the interval forecasting for short-duration activities in instance intensive business applications which consists of three functional components including duration pattern recognition, pattern matching, and interval forecasting. Similarly, since pattern matching and interval forecasting are always performed together, we illustrate them with an integrated process.

Evaluation

In this section, both real world examples and simulated test cases are used to evaluate the performance of our forecasting strategy. The real world historic data employed are collected from the workflow system logs and other sources of historic data for three example workflow applications, viz. the pulsar searching workflow and the securities exchange workflow as demonstrated in Section 2.1, and an additional weather forecast scientific workflow introduced in Liu et al. (2010a), are employed

Related work

Generally speaking, there are five major dimensions of workflow QoS constraints including time, cost, fidelity, reliability and security (Yu and Buyya, 2005a). In this paper, we focus on the dimension of time. In fact, since time is the basic measurement of performance, temporal QoS is one of the most important issues in many service oriented/enabled processes such as service composition (Zhao et al., 2004). Real world scientific or business processes normally stay in a temporal context and are

Conclusions and future work

Interval forecasting for activity durations in workflow systems is of great importance since it is related to most of workflow QoS and non-QoS functionalities such as load balancing, workflow scheduling and temporal verification. However, to predict accurate duration intervals is very challenging due to the dynamic nature of computing infrastructures. Most of the recent studies focus on the prediction of CPU load to facilitate the forecasting of computation intensive activities. Meanwhile,

Acknowledgments

This work is partially supported by Australian Research Council under Linkage Project LP0990393, the National Natural Science Foundation of China project under Grant No. 70871033.

Xiao Liu received his master degree in management science and engineering from Hefei University of Technology, Hefei, China, 2007. He is currently a PhD student in Centre for Complex Software Systems and Services in the Faculty of Information and Communication Technologies at Swinburne University of Technology, Melbourne, Australia. His research interests include workflow management systems, scientific workflow, business process management and data mining.

References (60)

  • D. Yuan et al.

    A data placement strategy in scientific cloud workflows

    Future Generation Computer Systems

    (2010)
  • Q.T. Zeng et al.

    Conflict detection and resolution for workflows constrained by resources and non-determined durations

    Journal of Systems and Software

    (2008)
  • Y. Zhang et al.

    Predict task running time in grid environments based on CPU load predictions

    Future Generation Computer Systems

    (2008)
  • H. Zhuge et al.

    A timed workflow process model

    Journal of Systems and Software

    (2001)
  • S. Akioka et al.

    Extended forecast of CPU and network load on computational Grid

  • R. Barga et al.

    Scientific versus business workflows

    Workflows for e-Science

    (2007)
  • C. Chatfield

    The Analysis of Time Series: An Introduction

    (2004)
  • J. Chen et al.

    Multiple states based temporal consistency for dynamic verification of fixed-time constraints in grid workflow systems

    Concurrency and Computation: Practice and Experience, Wiley

    (2007)
  • J. Chen et al.

    A taxonomy of grid workflow verification and validation

    Concurrency and Computation: Practice and Experience

    (2008)
  • J. Chen et al.

    Temporal dependency based checkpoint selection for dynamic verification of temporal constraints in scientific workflow systems

    ACM Transactions on Software Engineering and Methodology

    (2009)
  • E. Deelman et al.

    Workflows and e-science: an overview of workflow system features and capabilities

    Future Generation Computer Systems

    (2008)
  • P.A. Dinda et al.

    Host load prediction using linear models

    Cluster Computing

    (2000)
  • M. Dobber et al.

    Statistical properties of task running times in a global-scale grid environment

    Proc. Sixth IEEE International Symposium on Cluster Computing and the Grid

    (2006)
  • Glasner, C., Volkert, J., in press. Adaps—a three-phase adaptive prediction system for the run-time of jobs based on...
  • GridBus Project, http://www.gridbus.org, accessed on 1st August...
  • Hadoop, http://hadoop.apache.org/, accessed on 1st August...
  • J.W. Han et al.

    Data Mining: Concepts and Techniques

    (2006)
  • Kepler Project, http://kepler-project.org/, accessed on 1st August...
  • E. Keogh et al.

    An online algorithm for segmenting time series

  • C.F. Lai et al.

    An evolutionary approach to pattern-based time series segmentation

    IEEE Transactions on Evolutionary Computation

    (2004)
  • Cited by (67)

    • Modular performance prediction for scientific workflows using Machine Learning

      2020, Future Generation Computer Systems
      Citation Excerpt :

      uses multiple prediction schemes and picks the most likely accurate candidate. This selection from an array of models is claimed to be generating near-optimal solutions. [25] introduces a time series pattern based forecasting strategy.

    • Ranking-based architecture generation for surrogate-assisted neural architecture search

      2024, Concurrency and Computation: Practice and Experience
    • Group-guided artificial bee colony algorithm with elastic adjustment strategy

      2023, Concurrency and Computation: Practice and Experience
    • Numerical study on over-pressure performance of premixed explosion in utility tunnel

      2023, Concurrency and Computation: Practice and Experience
    • Acquisition method of target leveling height based on machine vision

      2023, Concurrency and Computation: Practice and Experience
    View all citing articles on Scopus

    Xiao Liu received his master degree in management science and engineering from Hefei University of Technology, Hefei, China, 2007. He is currently a PhD student in Centre for Complex Software Systems and Services in the Faculty of Information and Communication Technologies at Swinburne University of Technology, Melbourne, Australia. His research interests include workflow management systems, scientific workflow, business process management and data mining.

    Zhiwei Ni received his master degree from the Department of Computer Science and Engineering, Anhui University, Hefei, China, 1991 and a PhD degree from the Department of Computer Science and Technology, University of Science and Technology of China, Hefei, China, in 2002, all in computer science. He is currently a full Professor in the School of Management and also the Director for the Institute of Intelligent Management in Hefei University of Technology, Hefei, China. His major research interests include Artificial Intelligence, Machine Learning, Intelligent Management and Intelligent Decision-making Techniques.

    Dong Yuan was born in Jinan, China. He received the B.Eng. degree in 2005 and M.Eng. degree in 2008 both from Shandong University, Jinan, China, all in computer science. He is currently a PhD student in the Faculty of Information and Communication Technologies at Swinburne University of Technology, Melbourne, Vic., Australia. His research interests include data management in workflow systems, scheduling and resource management, grid and cloud computing.

    Yuanchun Jiang received his bachelor degree in management science and engineering from Hefei University of Technology, Hefei, China. He is a PhD student in Institute of Electronic Commerce in the School of Management at Hefei University of Technology. He is currently a visiting PhD student in the Joseph M. Katz Graduate School of Business at University of Pittsburgh. His research interests include decision science, electronic commerce and data mining. He has published papers in journals such as Decision Support Systems, Expert System with Applications, and Knowledge-Based Systems.

    Zhangjun Wu received his master degree in Software Engineering from University of Science and Technology of China in 2005, Hefei, China. He is currently a Ph.D. student in Hefei University of Technology, Hefei, China. From March to September, 2010, he has been visiting the Faculty of Information and Communication Technologies at Swinburne University of Technology, Melbourne, Australia. His research interests include evolutionary algorithms, workflow scheduling and cloud computing.

    Jinjun Chen received his Ph.D. degree in Computer Science and Software Engineering from Swinburne University of Technology, Melbourne, Australia in 2007. He is currently a Lecturer in Centre for Complex Software Systems and Services in the Faculty of Information and Communication Technologies at Swinburne University of Technology, Melbourne, Australia. His research interests include: Scientific Workflow Management and Applications, Workflow Management and Applications in Web Service or SOC Environments, Workflow Management and Applications in Grid (Service)/Cloud Computing Environments, Software Verification and Validation in Workflow Systems, QoS and Resource Scheduling in Distributed Computing Systems such as Cloud Computing, Service Oriented Computing (SLA, Negotiation, Engineering, Composition), Semantics and Knowledge Management, Cloud Computing.

    Yun Yang was born in Shanghai, China. He received a Master of Engineering degree from The University of Science and Technology of China, Hefei, China, in 1987, and a PhD degree from The University of Queensland, Brisbane, Australia, in 1992, all in computer science. He is currently a full Professor and Associate Dean (Research) in the Faculty of Information and Communication Technologies at Swinburne University of Technology, Melbourne, Australia. Prior to joining Swinburne as an Associate Professor in late 1999, he was a Lecture and Senior Lecturer at Deakin University during 1996–1999. Before that, he was a Research Scientist at DSTC-Cooperative Research Centre for Distributed Systems Technology during 1993–1996. He also worked at Beihang University in China during 1987–1988. He has published more than 160 papers on journals and refereed conferences. His research interests include software engineering; p2p, grid and cloud computing based workflow systems; service-oriented computing; Internet computing applications; and CSCW.

    The initial work was published in Proc. of 4th IEEE International Conference on e-Science (e-Science08), pp. 23–30, Indianapolis, USA, December 2008.

    View full text