A novel general framework for automatic and cost-effective handling of recoverable temporal violations in scientific workflow systems

doi:10.1016/j.jss.2010.10.027

Journal of Systems and Software

Volume 84, Issue 3, March 2011, Pages 492-509

https://doi.org/10.1016/j.jss.2010.10.027 Get rights and content

Abstract

Due to the complex nature of scientific workflow environments, temporal violations often take place and may severely reduce the timeliness of the execution's results. To handle temporal violations in an automatic and cost-effective fashion, two interdependent fundamental issues viz. the definition of fine-grained recoverable temporal violations and the design of light-weight effective exception handling strategies need to be resolved. However, most existing works study them separately without defining a comprehensive framework. To address such a problem, with the probability based temporal consistency model which defines the range of recoverable temporal violations, a novel general automatic and cost-effective exception handling framework is proposed in this paper where fine-grained temporal violations are defined based on the empirical function for the capability lower bounds of the exception handling strategies. To serve as a representative case study, a concrete example exception handling framework which consists of three levels of fine-grained temporal violations and their corresponding exception handling strategies is presented. The effectiveness of the example framework is evaluated by large scale simulation experiments conducted in the SwinDeW-G scientific grid workflow system. The experimental results demonstrate that the example framework can significantly reduce the overall average violation rates of local temporal constraints and global temporal constraints to 0.127% and 0.167% respectively.

Introduction

Scientific workflow systems are a type of workflow management systems aiming at supporting complex scientific processes in many e-science applications such as climate modelling, disaster recovery simulation, astrophysics and high energy physics (Deelman et al., 2008, Taylor et al., 2007). Scientific workflow systems can also be seen as a type of high-level middleware services for high performance computing infrastructures such as cluster, grid, peer-to-peer (p2p) or cloud computing (Buyya et al., 2009, Foster and Kesselman, 2004, Kim et al., 2007, Yang et al., 2007). In recent years, due to the growing demand for high performance computing infrastructures and large scale distributed and collaborative e-science applications, scientific workflow systems have been attracting increasing interests from distributed and parallel system researchers in the area of High Performance Computing (HPGC, 2009, PDSEC, 2009) and software engineering researchers in the area of Software Engineering for Computational Science and Engineering (Chen and Yang, in press, SECES, 2008). One of the common research issues is how to deliver satisfactory workflow QoS (quality of service), i.e. how to satisfy workflow QoS constraints such as the constraints on time, cost, fidelity, reliability and security (Son and Kim, 2001, Yu and Buyya, 2005). Among them, time is one of the basic measurements for system and software performance and hence attracts many researchers in the workflow area (van der Aalst et al., 2000, Chen and Yang, 2008, Duan et al., 2009, Eder et al., 1999, Li et al., 2004, Yu and Buyya, 2005, Zhuge et al., 2001).

In reality, a scientific workflow and its workflow segments are normally subject to specific temporal constraints such as global temporal constraints (deadlines) for workflow instances, and local temporal constraints (milestones) for workflow segments, in order to achieve predefined scientific goals on schedule (Li et al., 2004, Zeng et al., 2008). Otherwise, the timeliness of its execution results will be significantly deteriorated. For example, a daily weather forecast scientific workflow has to be finished before the broadcasting of the weather forecast programme everyday at, for instance, 6:00 pm. Meanwhile, given the large number of data and computation intensive activities for scientific investigation purposes, scientific workflows are usually deployed on distributed high performance infrastructures such as grid and cloud. Therefore, to deliver satisfactory temporal QoS, the violations of both local temporal constraints (or local violations for short) and global temporal constraints (or global violations for short), need to be proactively detected and handled (Zhuge et al., 2001). Recent studies on temporal verification in scientific workflows mainly focus on runtime checkpoint selection (Chen and Yang, in press) and multiple-state based temporal verification (Chen and Yang, 2007) which can deal with the monitoring of temporal consistency states and the detection of potential temporal violations. However, a significant follow-up issue is how to handle those temporal violations. Till date, work on such an issue is still in its infancy. However, it must be properly addressed so as to guarantee high success rates for on-time completion of scientific workflows. Specifically, two fundamental requirements for handling temporal violations, automation and cost-effectiveness, need to be considered.

(1)
Automation. Due to the complex nature of scientific applications and their distributed running environments such as grid and cloud, a large number of temporal violations may often be expected in scientific workflows. Besides, scientific workflow systems are designed to be highly automatic to conduct large scale scientific processes, human interventions which are normally of low efficiency should be avoided as much as possible, especially during workflow runtime (Deelman et al., 2008). Therefore, similar to dynamic checkpoint selection and temporal verification strategies (Chen and Yang, in press), handling strategies are required to automatically tackle a large number of temporal violations and relieve users from the heavy workload of handling those exceptions.
(2)
Cost-effectiveness. The purpose of handling temporal violations is to reduce, or ideally remove, the delays of workflow execution by exception handling strategies with the sacrifice of additional cost which consists of both monetary cost and time overheads. Conventional exception handling strategies for temporal violations, such as resource recruitment and workflow restructure, are usually very expensive (Buhr and Mok, 2000, Hagen and Alonso, 2000, Prodan and Fahringer, 2008, Russell et al., 2006a). The cost for recruiting new resources (e.g. the cost for service discovery and deployment, the cost for data storage and transfer) is normally very large during workflow runtime in distributed computing environments (Prodan and Fahringer, 2008). As for workflow restructure, it is usually realised by the amendment of local workflow segments or temporal QoS contracts, i.e. modifying scientific workflow specifications by human decision makers (Liu et al., 2008b). However, due to budget (i.e. monetary cost) limits and temporal constraints, these heavy-weight strategies (with large monetary cost and/or time overheads) are usually too costly to be practical. To avoid these heavy-weight strategies, recoverable violations (in comparison to severe temporal violations which can be regarded as non-recoverable in practice) need to be identified first and then handled by light-weight strategies (with small monetary cost and/or time overheads) in a cost-effective fashion.

Given the requirement of Automation, exception handling strategies need to be designed to handle temporal violations in an automatic fashion without human interventions. Meanwhile, since most strategies have their limits in the capability of recovering temporal violations, different handling strategies are normally only effective for a range of temporal violations with limited amount of time deficits (the time delays given specific temporal constraints). Given the requirement of Cost-effectiveness, for all the candidate strategies which are capable of handling the current temporal violation, ideally, only the one with the lowest cost should be applied. Therefore, the definition of fine-grained temporal violations and the design of exception handling strategies should be investigated as two interdependent tasks within the same exception handling framework. However, since recent studies in temporal verification mainly focus on the detection of temporal violations, fine-grained temporal violations are usually defined for the general application purpose ignoring the performance of exception handling strategies in the specific workflow systems. For example, the work in Chen and Yang (2007) proposes a multiple-states based temporal consistency model. Besides SC (strong consistency) which requires no action, three types of fine-grained temporal inconsistency states including WC (weak consistency), WI (weak inconsistency) and SI (strong inconsistency) are defined based on the minimum, mean and maximum workflow execution time. However, without the investigation on the performance of different exception handing strategies, it is difficult to determine which strategy should be applied to handle the detected temporal violations. Therefore, it is more reasonable that fine-grained temporal violations should be defined according to the selection of different exception handling strategies with different capabilities, rather than most of the previous studies where fine-grained temporal violations are defined in the first place then looking for available exception handling strategies. To the best of our knowledge, this is the first work to systematically investigate a general exception handling framework for automatic and cost-effective handling of temporal violations in scientific workflow systems.

In this paper, along with the probability based temporal consistency model which defines the range of recoverable temporal violations, a novel general automatic and cost-effective exception handling framework is proposed. Specifically, fine-grained temporal violations are first defined based on the empirical function for the capability lower bounds of the exception handling strategies. Afterwards, to serve as a case study, a concrete example framework is presented which consists of three levels of fine-grained temporal violations, viz., level I, level II and level III temporal violations defined within the recoverable probability range, and three light-weight automatic exception handling strategies, viz., TDA (Time Deficit Allocation), ACOWR (Ant Colony Optimisation based two-stage Workflow local Rescheduling) and TDA + ACOWR (the combined strategy of TDA and ACOWR). Large scale simulation experiments are conducted in the SwinDeW-G scientific grid workflow system (Yang et al., 2007) to evaluate the effectiveness of the example framework.

The remainder of the paper is organised as follows. Section 2 presents a motivating example and the problem analysis. Section 3 proposes a general exception handling framework for temporal violations. Section 4 presents a case study with a concrete exception handling framework with three levels of temporal violations and their corresponding handling strategies. Section 5 demonstrates comprehensive simulation results. Section 6 reviews the related work. Finally, Section 7 concludes the paper and points out the future work.

Section snippets

Motivating example

In this section, we present an example scientific workflow in Astrophysics. Parkes Radio Telescope (http://www.parkes.atnf.csiro.au/, located 380 km west of Sydney, Australia), one of the most famous radio telescopes, is serving institutions around the world. Swinburne Astrophysics group has been conducting a pulsar searching survey based on the observation data from Parkes Radio Telescope (http://astronomy.swin.edu.au/pulsar/). The pulsar searching process is a typical scientific workflow which

A general exception handling framework for temporal violations

In this section, an overview of a probability based temporal consistency model is presented and the range of recoverable temporal violations is defined. Afterwards, a general exception handling framework is proposed where fine-grained temporal violations are defined based on the empirical function for the capability lower bounds of exception handling strategies.

An example implementation of the framework

Based on the general exception handling framework defined in Section 3, this section presents an automatic and cost-effective exception handling framework which serves as a representative example.

Evaluations on example framework

In this section, we evaluate the performance of the example framework to demonstrate the effectiveness of our general exception handing framework. In a qualitative fashion, we can claim that our example framework satisfies the two basic requirements of Automation and Cost-effectiveness.

Automation: Based on our previous work on checkpoint selection and temporal verification (Chen and Yang, in press), different levels of temporal violations can be automatically detected in an efficient fashion.

Related work

Temporal constraint is one of the most important workflow QoS constraints besides cost, fidelity, reliability and security as discussed in Yu and Buyya (2005). In practice, a set of temporal constraints can be deemed as a QoS contract between clients and service providers. In order to successfully fulfil these contracts, efficient monitoring mechanisms such as checkpoint selection (Chen and Yang, in press) and temporal verification (Chen and Yang, 2007) are implemented to dynamically detect

Conclusions and future work

Latest studies in checkpoint selection and temporal verification can only detect temporal violations but cannot handle them. In this paper, the issue of handling temporal violations in scientific workflows has been systematically investigated and addressed by our proposed exception handling framework. Given the two fundamental requirements of Automation and Cost-effectiveness, a novel general exception handling framework has been proposed where fine-grained temporal violations are defined based

Acknowledgments

This work is partially supported by Australian Research Council under Linkage Project LP0990393, the National Natural Science Foundation of China project under Grant No. 70871033. Part of this work, particularly the example framework, has been accepted by ICPADS’2010. We are also grateful for the discussions with Dr. W. van Straten and Ms. L. Levin from Swinburne Centre for Astrophysics and Supercomputing.

References (40)

R. Buyya et al.
Cloud computing and emerging IT platforms: vision, hype, and reality for delivering computing as the 5th utility
Future Generation Computer Systems
(2009)
R. Chang et al.
An ant algorithm for balanced job scheduling in grids
Future Generation Computer Systems
(2009)
J. Chen et al.
Localising temporal constraints in scientific workflows
Journal of Computer and System Sciences
(2010)
H. Duan et al.
Classification and evaluation of timed running schemas for workflow based on process mining
Journal of Systems and Software
(2009)
H. Li et al.
Resource constraints analysis of workflow specifications
Journal of Systems and Software
(2004)
X. Liu et al.
Forecasting duration intervals of scientific workflow activities based on time-series patterns
Q.T. Zeng et al.
Conflict detection and resolution for workflows constrained by resources and non-determined durations
Journal of Systems and Software
(2008)
H. Zhuge et al.
A timed workflow process model
Journal of Systems and Software
(2001)
P.A. Buhr et al.
Advanced exception handling mechanisms
IEEE Transactions on Software Engineering
(2000)
J. Chen et al.
Multiple states based temporal consistency for dynamic verification of fixed-time constraints in grid workflow systems
Concurrency and Computation: Practice and Experience
(2007)

J. Chen et al.

A taxonomy of grid workflow verification and validation

Concurrency and Computation: Practice and Experience

(2008)

Chen, J., Yang, Y., in press. Temporal dependency based checkpoint selection for dynamic verification of temporal...

W. Chen et al.

An ant colony optimization approach to a grid workflow scheduling problem with various QoS requirements

IEEE Transactions on Systems, Man, and Cybernetics, Part C: Applications and Reviews

(2009)

W. Chen et al.

Workflow scheduling in grids: an ant colony optimization approach

P. Choudhury et al.

Hybrid scheduling of dynamic task graphs with selective duplication for multiprocessors under memory and time constraints

IEEE Transactions on Parallel and Distributed Systems

(2008)

K. Cooper et al.

New grid scheduling and rescheduling methods in the GrADS project

E. Deelman et al.

Workflows and e-science: an overview of workflow system features and capabilities

Future Generation Computer Systems

(2008)

J. Eder et al.

Time constraints in workflow systems

I. Foster et al.

The Grid: Blueprint for a New Computing Infrastructure

(2004)

C. Hagen et al.

Exception handling in workflow management systems

IEEE Transactions on Software Engineering

(2000)

Cited by (30)

A hyper-heuristic cost optimisation approach for Scientific Workflow Scheduling in cloud computing
2018, Future Generation Computer Systems
Effective management of Scientific Workflow Scheduling (SWFS) processes in a cloud environment remains a challenging task when dealing with large and complex Scientific Workflow Applications (SWFAs). Cost optimisation of SWFS benefits cloud service consumers and providers by reducing temporal and monetary costs in processing SWFAs. However, cost optimisation performance of SWFS approaches is affected by the inherent nature of the SWFA as well as various types of scenarios that depend on the number of available virtual machines and varied sizes of SWFA datasets. Cost optimisation performance of existing SWFS approaches is still not satisfactory for all considered scenarios. Thus, there is a need to propose a dynamic hyper-heuristic approach that can effectively optimise the cost of SWFS for all different scenarios. This can be done by employing different meta-heuristic algorithms in order to utilise their strengths for each scenario. Thus, the main objective of this paper is to propose a Completion Time Driven Hyper-Heuristic (CTDHH) approach for cost optimisation of SWFS in a cloud environment. The CTDHH approach employs four well-known population-based meta-heuristic algorithms, which act as Low Level Heuristic (LLH) algorithms. In addition, the CTDHH approach enhances the native random selection way of existing hyper-heuristic approaches by incorporating the best computed workflow completion time to act as a high-level selector to dynamically pick a suitable algorithm from the pool of LLH algorithms after each run. A real-world cloud based experimentation environment has been considered to evaluate the performance of the proposed CTDHH approach by comparing it with five baseline approaches, i.e. four population-based approaches and an existing hyper-heuristic approach named Hyper-Heuristic Scheduling Algorithm (HHSA). Several different scenarios have also been considered to evaluate data-intensiveness and computation-intensive performance. Based on the results of the experimental comparison, the proposed approach has proven to yield the most effective performance results for all considered experimental scenarios.
Dynamic service selection with QoS constraints and inter-service correlations using cooperative coevolution
2017, Future Generation Computer Systems
Citation Excerpt :
To rapidly recover from failures, service selection should also be finely executed when potential QoS violations are detected at run-time, which we call dynamic service selection [14]. Recently, some research works have addressed the problem of dynamic service selection [14–18]. They use exhaustive or approximate methods to select services for activities which have not commenced at run-time, so as to handle recoverable QoS violations.
Building business processes by Web services in cloud computing has become the hotspot of service applications. Due to the complexity and uncertainty of business environment, QoS violations of service processes often take place at run-time. To rapidly recover from failures and minimize their impacts on the original execution plan of service processes, dynamic service selection is urgently needed once potential QoS violations are detected. However, existing research works do not fully investigate QoS constraints and inter-service correlations, as well as the breach penalty caused by service adjustment. In this paper, we present a new cooperative coevolutionary approach for dynamic service selection with QoS constraints and inter-service correlations. First, a novel formal model for the dynamic service selection problem with QoS constraints and inter-service correlations is presented. Second, a Double Information based Cooperative Coevolutionary algorithm (DICC) is proposed which uses Potter’s cooperative coevolutionary framework and provides both local and global knowledge for the dynamic service selection optimization. Finally, we develop a prototype system to apply our approach and adopt different test cases to show that our DICC approach performs more effectively and efficiently than existing algorithms.
Cost optimization approaches for scientific workflow scheduling in cloud and grid computing: A review, classifications, and open issues
2016, Journal of Systems and Software
Citation Excerpt :
Another hybrid cloud scenario is where the total cost of SWFS is defined as data transfers from and to the cloud (Ramakrishnan et al., 2011; Bittencourt et al., 2012). The available bandwidth into the connected processing resources of the hybrid cloud affects the makespan (Malawski et al., 2012; Liu et al., 2011b; 2010b; Chen et al., 2009). Therefore, the bandwidth cost for the hybrid cloud environment can be defined as the cost that the service provider charges to service consumers per the amount of data transferred ($/GB).
Workflow scheduling in scientific computing systems is one of the most challenging problems that focuses on satisfying user-defined quality of service requirements while minimizing the workflow execution cost. Several cost optimization approaches have been proposed to improve the economic aspect of Scientific Workflow Scheduling (SWFS) in cloud and grid computing. To date, the literature has not yet seen a comprehensive review that focuses on approaches for supporting cost optimization in the context of SWFS in cloud and grid computing. Furthermore, providing valuable guidelines and analysis to understand the cost optimization of SWFS approaches is not well-explored in the current literature. This paper aims to analyze the problem of cost optimization in SWFS by extensively surveying existing SWFS approaches in cloud and grid computing and provide a classification of cost optimization aspects and parameters of SWFS. Moreover, it provides a classification of cost based metrics that are categorized into monetary and temporal cost parameters based on various scheduling stages. We believe that our findings would help researchers and practitioners in selecting the most appropriate cost optimization approach considering identified aspects and parameters. In addition, we highlight potential future research directions in this on-going area of research.
Reliability-driven scheduling of time/cost-constrained grid workflows
2016, Future Generation Computer Systems
Citation Excerpt :
Supporting these constraints is favorable for both users and Grid owners. From users’ viewpoint, many scientific and business goals are realized when the execution of workflow is completed within some deadline [20,21]. Users usually have monetary constraints to execute their applications [21,22].
Workflow scheduling in Grids and Clouds is a NP-Hard problem. Constrained workflow scheduling, arisen in recent years, provides the description of the user requirements through defining constraints on factors like makespan and cost. This paper proposes a scheduling algorithm to maximize the workflow execution reliability while respecting the user-defined deadline and budget. We have used ant colony system to minimize an aggregation of reliability and constraints violation. Three novel heuristics have been proposed which are adaptively selected by ants. Two of them are employed to find feasible schedules and the other is used to enhance the reliability. Two methods have been investigated for time and cost considerations in the resource selection. One of them assigns equal importance to the time and cost factors, and the other weighs them according to the tightness of satisfaction of the corresponding constraints. Simulation results demonstrate the effectiveness of the proposed algorithm in finding feasible schedules with high reliability. As it is shown, as an additional achievement, the Grid profit loss has been decreased.
A Novel Throughput Based Temporal Violation Handling Strategy for Instance-Intensive Cloud Business Workflows
2020, Communications in Computer and Information Science
Dynamic Monitoring of Service Outsourcing for Timed Workflow Processes
2019, IEEE Transactions on Engineering Management

View all citing articles on Scopus

View full text

A novel general framework for automatic and cost-effective handling of recoverable temporal violations in scientific workflow systems

Abstract

Introduction

Section snippets

Motivating example

A general exception handling framework for temporal violations

An example implementation of the framework

Evaluations on example framework

Related work

Conclusions and future work

Acknowledgments

Future Generation Computer Systems

Future Generation Computer Systems

Journal of Computer and System Sciences

Journal of Systems and Software

Journal of Systems and Software

Journal of Systems and Software

Journal of Systems and Software

Advanced exception handling mechanisms

IEEE Transactions on Software Engineering

Multiple states based temporal consistency for dynamic verification of fixed-time constraints in grid workflow systems

Concurrency and Computation: Practice and Experience

A taxonomy of grid workflow verification and validation

Concurrency and Computation: Practice and Experience

An ant colony optimization approach to a grid workflow scheduling problem with various QoS requirements

IEEE Transactions on Systems, Man, and Cybernetics, Part C: Applications and Reviews

Workflow scheduling in grids: an ant colony optimization approach

Hybrid scheduling of dynamic task graphs with selective duplication for multiprocessors under memory and time constraints

IEEE Transactions on Parallel and Distributed Systems

New grid scheduling and rescheduling methods in the GrADS project

Workflows and e-science: an overview of workflow system features and capabilities

Future Generation Computer Systems

Time constraints in workflow systems

The Grid: Blueprint for a New Computing Infrastructure

Exception handling in workflow management systems

IEEE Transactions on Software Engineering