A novel general framework for automatic and cost-effective handling of recoverable temporal violations in scientific workflow systems
Introduction
Scientific workflow systems are a type of workflow management systems aiming at supporting complex scientific processes in many e-science applications such as climate modelling, disaster recovery simulation, astrophysics and high energy physics (Deelman et al., 2008, Taylor et al., 2007). Scientific workflow systems can also be seen as a type of high-level middleware services for high performance computing infrastructures such as cluster, grid, peer-to-peer (p2p) or cloud computing (Buyya et al., 2009, Foster and Kesselman, 2004, Kim et al., 2007, Yang et al., 2007). In recent years, due to the growing demand for high performance computing infrastructures and large scale distributed and collaborative e-science applications, scientific workflow systems have been attracting increasing interests from distributed and parallel system researchers in the area of High Performance Computing (HPGC, 2009, PDSEC, 2009) and software engineering researchers in the area of Software Engineering for Computational Science and Engineering (Chen and Yang, in press, SECES, 2008). One of the common research issues is how to deliver satisfactory workflow QoS (quality of service), i.e. how to satisfy workflow QoS constraints such as the constraints on time, cost, fidelity, reliability and security (Son and Kim, 2001, Yu and Buyya, 2005). Among them, time is one of the basic measurements for system and software performance and hence attracts many researchers in the workflow area (van der Aalst et al., 2000, Chen and Yang, 2008, Duan et al., 2009, Eder et al., 1999, Li et al., 2004, Yu and Buyya, 2005, Zhuge et al., 2001).
In reality, a scientific workflow and its workflow segments are normally subject to specific temporal constraints such as global temporal constraints (deadlines) for workflow instances, and local temporal constraints (milestones) for workflow segments, in order to achieve predefined scientific goals on schedule (Li et al., 2004, Zeng et al., 2008). Otherwise, the timeliness of its execution results will be significantly deteriorated. For example, a daily weather forecast scientific workflow has to be finished before the broadcasting of the weather forecast programme everyday at, for instance, 6:00 pm. Meanwhile, given the large number of data and computation intensive activities for scientific investigation purposes, scientific workflows are usually deployed on distributed high performance infrastructures such as grid and cloud. Therefore, to deliver satisfactory temporal QoS, the violations of both local temporal constraints (or local violations for short) and global temporal constraints (or global violations for short), need to be proactively detected and handled (Zhuge et al., 2001). Recent studies on temporal verification in scientific workflows mainly focus on runtime checkpoint selection (Chen and Yang, in press) and multiple-state based temporal verification (Chen and Yang, 2007) which can deal with the monitoring of temporal consistency states and the detection of potential temporal violations. However, a significant follow-up issue is how to handle those temporal violations. Till date, work on such an issue is still in its infancy. However, it must be properly addressed so as to guarantee high success rates for on-time completion of scientific workflows. Specifically, two fundamental requirements for handling temporal violations, automation and cost-effectiveness, need to be considered.
- (1)
Automation. Due to the complex nature of scientific applications and their distributed running environments such as grid and cloud, a large number of temporal violations may often be expected in scientific workflows. Besides, scientific workflow systems are designed to be highly automatic to conduct large scale scientific processes, human interventions which are normally of low efficiency should be avoided as much as possible, especially during workflow runtime (Deelman et al., 2008). Therefore, similar to dynamic checkpoint selection and temporal verification strategies (Chen and Yang, in press), handling strategies are required to automatically tackle a large number of temporal violations and relieve users from the heavy workload of handling those exceptions.
- (2)
Cost-effectiveness. The purpose of handling temporal violations is to reduce, or ideally remove, the delays of workflow execution by exception handling strategies with the sacrifice of additional cost which consists of both monetary cost and time overheads. Conventional exception handling strategies for temporal violations, such as resource recruitment and workflow restructure, are usually very expensive (Buhr and Mok, 2000, Hagen and Alonso, 2000, Prodan and Fahringer, 2008, Russell et al., 2006a). The cost for recruiting new resources (e.g. the cost for service discovery and deployment, the cost for data storage and transfer) is normally very large during workflow runtime in distributed computing environments (Prodan and Fahringer, 2008). As for workflow restructure, it is usually realised by the amendment of local workflow segments or temporal QoS contracts, i.e. modifying scientific workflow specifications by human decision makers (Liu et al., 2008b). However, due to budget (i.e. monetary cost) limits and temporal constraints, these heavy-weight strategies (with large monetary cost and/or time overheads) are usually too costly to be practical. To avoid these heavy-weight strategies, recoverable violations (in comparison to severe temporal violations which can be regarded as non-recoverable in practice) need to be identified first and then handled by light-weight strategies (with small monetary cost and/or time overheads) in a cost-effective fashion.
Given the requirement of Automation, exception handling strategies need to be designed to handle temporal violations in an automatic fashion without human interventions. Meanwhile, since most strategies have their limits in the capability of recovering temporal violations, different handling strategies are normally only effective for a range of temporal violations with limited amount of time deficits (the time delays given specific temporal constraints). Given the requirement of Cost-effectiveness, for all the candidate strategies which are capable of handling the current temporal violation, ideally, only the one with the lowest cost should be applied. Therefore, the definition of fine-grained temporal violations and the design of exception handling strategies should be investigated as two interdependent tasks within the same exception handling framework. However, since recent studies in temporal verification mainly focus on the detection of temporal violations, fine-grained temporal violations are usually defined for the general application purpose ignoring the performance of exception handling strategies in the specific workflow systems. For example, the work in Chen and Yang (2007) proposes a multiple-states based temporal consistency model. Besides SC (strong consistency) which requires no action, three types of fine-grained temporal inconsistency states including WC (weak consistency), WI (weak inconsistency) and SI (strong inconsistency) are defined based on the minimum, mean and maximum workflow execution time. However, without the investigation on the performance of different exception handing strategies, it is difficult to determine which strategy should be applied to handle the detected temporal violations. Therefore, it is more reasonable that fine-grained temporal violations should be defined according to the selection of different exception handling strategies with different capabilities, rather than most of the previous studies where fine-grained temporal violations are defined in the first place then looking for available exception handling strategies. To the best of our knowledge, this is the first work to systematically investigate a general exception handling framework for automatic and cost-effective handling of temporal violations in scientific workflow systems.
In this paper, along with the probability based temporal consistency model which defines the range of recoverable temporal violations, a novel general automatic and cost-effective exception handling framework is proposed. Specifically, fine-grained temporal violations are first defined based on the empirical function for the capability lower bounds of the exception handling strategies. Afterwards, to serve as a case study, a concrete example framework is presented which consists of three levels of fine-grained temporal violations, viz., level I, level II and level III temporal violations defined within the recoverable probability range, and three light-weight automatic exception handling strategies, viz., TDA (Time Deficit Allocation), ACOWR (Ant Colony Optimisation based two-stage Workflow local Rescheduling) and TDA + ACOWR (the combined strategy of TDA and ACOWR). Large scale simulation experiments are conducted in the SwinDeW-G scientific grid workflow system (Yang et al., 2007) to evaluate the effectiveness of the example framework.
The remainder of the paper is organised as follows. Section 2 presents a motivating example and the problem analysis. Section 3 proposes a general exception handling framework for temporal violations. Section 4 presents a case study with a concrete exception handling framework with three levels of temporal violations and their corresponding handling strategies. Section 5 demonstrates comprehensive simulation results. Section 6 reviews the related work. Finally, Section 7 concludes the paper and points out the future work.
Section snippets
Motivating example
In this section, we present an example scientific workflow in Astrophysics. Parkes Radio Telescope (http://www.parkes.atnf.csiro.au/, located 380 km west of Sydney, Australia), one of the most famous radio telescopes, is serving institutions around the world. Swinburne Astrophysics group has been conducting a pulsar searching survey based on the observation data from Parkes Radio Telescope (http://astronomy.swin.edu.au/pulsar/). The pulsar searching process is a typical scientific workflow which
A general exception handling framework for temporal violations
In this section, an overview of a probability based temporal consistency model is presented and the range of recoverable temporal violations is defined. Afterwards, a general exception handling framework is proposed where fine-grained temporal violations are defined based on the empirical function for the capability lower bounds of exception handling strategies.
An example implementation of the framework
Based on the general exception handling framework defined in Section 3, this section presents an automatic and cost-effective exception handling framework which serves as a representative example.
Evaluations on example framework
In this section, we evaluate the performance of the example framework to demonstrate the effectiveness of our general exception handing framework. In a qualitative fashion, we can claim that our example framework satisfies the two basic requirements of Automation and Cost-effectiveness.
Automation: Based on our previous work on checkpoint selection and temporal verification (Chen and Yang, in press), different levels of temporal violations can be automatically detected in an efficient fashion.
Related work
Temporal constraint is one of the most important workflow QoS constraints besides cost, fidelity, reliability and security as discussed in Yu and Buyya (2005). In practice, a set of temporal constraints can be deemed as a QoS contract between clients and service providers. In order to successfully fulfil these contracts, efficient monitoring mechanisms such as checkpoint selection (Chen and Yang, in press) and temporal verification (Chen and Yang, 2007) are implemented to dynamically detect
Conclusions and future work
Latest studies in checkpoint selection and temporal verification can only detect temporal violations but cannot handle them. In this paper, the issue of handling temporal violations in scientific workflows has been systematically investigated and addressed by our proposed exception handling framework. Given the two fundamental requirements of Automation and Cost-effectiveness, a novel general exception handling framework has been proposed where fine-grained temporal violations are defined based
Acknowledgments
This work is partially supported by Australian Research Council under Linkage Project LP0990393, the National Natural Science Foundation of China project under Grant No. 70871033. Part of this work, particularly the example framework, has been accepted by ICPADS’2010. We are also grateful for the discussions with Dr. W. van Straten and Ms. L. Levin from Swinburne Centre for Astrophysics and Supercomputing.
References (40)
- et al.
Cloud computing and emerging IT platforms: vision, hype, and reality for delivering computing as the 5th utility
Future Generation Computer Systems
(2009) - et al.
An ant algorithm for balanced job scheduling in grids
Future Generation Computer Systems
(2009) - et al.
Localising temporal constraints in scientific workflows
Journal of Computer and System Sciences
(2010) - et al.
Classification and evaluation of timed running schemas for workflow based on process mining
Journal of Systems and Software
(2009) - et al.
Resource constraints analysis of workflow specifications
Journal of Systems and Software
(2004) - et al.
Forecasting duration intervals of scientific workflow activities based on time-series patterns
- et al.
Conflict detection and resolution for workflows constrained by resources and non-determined durations
Journal of Systems and Software
(2008) - et al.
A timed workflow process model
Journal of Systems and Software
(2001) - et al.
Advanced exception handling mechanisms
IEEE Transactions on Software Engineering
(2000) - et al.
Multiple states based temporal consistency for dynamic verification of fixed-time constraints in grid workflow systems
Concurrency and Computation: Practice and Experience
(2007)
A taxonomy of grid workflow verification and validation
Concurrency and Computation: Practice and Experience
An ant colony optimization approach to a grid workflow scheduling problem with various QoS requirements
IEEE Transactions on Systems, Man, and Cybernetics, Part C: Applications and Reviews
Workflow scheduling in grids: an ant colony optimization approach
Hybrid scheduling of dynamic task graphs with selective duplication for multiprocessors under memory and time constraints
IEEE Transactions on Parallel and Distributed Systems
New grid scheduling and rescheduling methods in the GrADS project
Workflows and e-science: an overview of workflow system features and capabilities
Future Generation Computer Systems
Time constraints in workflow systems
The Grid: Blueprint for a New Computing Infrastructure
Exception handling in workflow management systems
IEEE Transactions on Software Engineering
Cited by (30)
A hyper-heuristic cost optimisation approach for Scientific Workflow Scheduling in cloud computing
2018, Future Generation Computer SystemsDynamic service selection with QoS constraints and inter-service correlations using cooperative coevolution
2017, Future Generation Computer SystemsCitation Excerpt :To rapidly recover from failures, service selection should also be finely executed when potential QoS violations are detected at run-time, which we call dynamic service selection [14]. Recently, some research works have addressed the problem of dynamic service selection [14–18]. They use exhaustive or approximate methods to select services for activities which have not commenced at run-time, so as to handle recoverable QoS violations.
Cost optimization approaches for scientific workflow scheduling in cloud and grid computing: A review, classifications, and open issues
2016, Journal of Systems and SoftwareCitation Excerpt :Another hybrid cloud scenario is where the total cost of SWFS is defined as data transfers from and to the cloud (Ramakrishnan et al., 2011; Bittencourt et al., 2012). The available bandwidth into the connected processing resources of the hybrid cloud affects the makespan (Malawski et al., 2012; Liu et al., 2011b; 2010b; Chen et al., 2009). Therefore, the bandwidth cost for the hybrid cloud environment can be defined as the cost that the service provider charges to service consumers per the amount of data transferred ($/GB).
Reliability-driven scheduling of time/cost-constrained grid workflows
2016, Future Generation Computer SystemsCitation Excerpt :Supporting these constraints is favorable for both users and Grid owners. From users’ viewpoint, many scientific and business goals are realized when the execution of workflow is completed within some deadline [20,21]. Users usually have monetary constraints to execute their applications [21,22].
A Novel Throughput Based Temporal Violation Handling Strategy for Instance-Intensive Cloud Business Workflows
2020, Communications in Computer and Information ScienceDynamic Monitoring of Service Outsourcing for Timed Workflow Processes
2019, IEEE Transactions on Engineering Management