Application noteParallelization and optimization of spatial analysis for large scale environmental model data assembly
Highlights
► Simulation of environmental processes over large areas is increasingly necessary. ► A zonal statistics algorithm was developed and parallelized for processing input data. ► The algorithm depends on open-source libraries and packages. ► The algorithm achieves 1440 times speed-up compared to traditional GIS software. ► This speeds input data processing for environmental modelling at high resolution.
Introduction
Process-based environmental models such as those that estimate the growth of agricultural crops, forest stands, or water resources typically require high spatial and temporal resolution input data (Van Wesemael et al., 2010). One of the seemingly foundational laws of environmental modelling is that this data is rarely in exactly the right format and resolution required by the model. For example, the study units are irregular agro-ecological or climate/soil zones (Devendra and Thomas, 2002, Fischer et al., 2002), while the input data are usually site data from station observation or grid data from spatial interpolation. As a result, a significant amount of pre-processing is often required to summarize and assemble data prior to running the model. This can be computationally demanding and time-consuming. Traditional Geographic Information Systems (GIS) are yet to widely embrace the shift in computer technology to multi-core processors (Bryan, 2012). Hence, the efficiency of data processing within GIS software has largely ceased to increase over the past decade following the limitations on computer processor clock speed (Dongarra et al., 2007). Although some GIS vendors have proposed some batch methods, we found that the capacity still cannot meet the requirements of our application. So we developed a customized algorithm using high-performance computing (HPC) to increase the efficiency of data pre-processing.
In this application, we focused on the assembly of high spatial and temporal resolution climate data to simulate agricultural systems. Variation in climate across both time and space significantly influence crop growth in agricultural systems (Hansen and Jones, 2000; Luo et al., 2003; Reidsma et al., 2009). Understanding how spatial–temporal variation of climate impacts agricultural systems is of critical importance for agricultural decision-making (Overpeck et al., 2011). Process-based crop models are a common way to study agricultural systems at regional or national scales and usually need a complete and accurate source of climate data across the whole landscape (Bryan et al., 2010, Bryan et al., 2011, Safir et al., 2008, Zhao et al., in press). Interpolating station-based climate records to raster layers is a common practice to overcome the deficiencies of observational data at the regional scale (Jeffrey et al., 2001, Thornton et al., 2009). However, coupling this kind of spatial data with agricultural systems models needs a series of intensive spatial analyses, which has impeded the application of agricultural systems models to large areas at high spatial resolution, despite the fact that the computing resources are more readily available today (Finley et al., 2012).
The purpose of this study was to prepare climate data for the Agricultural Production Systems sIMulator (APSIM), a process-based agricultural systems model (Keating et al., 2003, Wang et al., 2009), to simulate wheat productivity under various management practices across Australia’s wheat-growing regions. To achieve this it was necessary to extract and summarize 122 years of daily gridded climate data for 11,575 climate–soil zones. A zonal statistics algorithm, commonly found with raster GIS, was developed in Python’s Numpy module to calculate summary statistics on the raster climate data layers for each zone. We parallelized this algorithm and assessed its performance under a varying number of CPU cores. We also assessed the efficiency of parallel computing with different job scheduling and load balancing approaches to identify the most efficient high-performance computing strategy. The utility of these techniques for data assembly and summary for input into environmental models is discussed.
Section snippets
Climate–soil zones and climate data
The study area forms a crescent-shaped area from the northeast coast of Queensland around the south of the continent to the west coast of Western Australia (Fig. 1). The study area includes a 100 km buffer around the actual areas sown to wheat in 2006 (ABARE, 2006, Marinoni et al., 2012). The whole study area was divided into 11,575 climate–soil zones (CS zones) with relatively homogenous climate and soil properties. Using zones as basic modelling units instead of grid cells enables us to reduce
Design and implementation
The zonal statistics algorithm developed in this study aggregates continuous values in one raster for each zone in another raster, and computes the statistics (Fig. 2). The algorithm takes as input two raster data sets, namely zones and values, which share the same resolution and extent. Similar algorithms can be found in many raster GIS. Initially, we undertook the spatial analyses using a combination of ArcGIS tools in batch processing: Make NetCDF Raster Layer, Resample, and zonal Statistics
Results
Processing a single year of data (1825 data sets) using our zonal statistics algorithm took 1884.15 s for a single worker, or roughly one second per climate data set (Fig. 5a). The improvement was significant compared with 180 s per data set for desktop GIS processing. Run time decreased further with an increasing number of active workers up to around nine cores (228 s for 1825 jobs), beyond which performance degraded. A larger chunk size achieved greater processing efficiency, especially when
Discussion and conclusion
With the development of open source software and programming languages that are operating system independent and very flexible, it becomes possible to develop tailored spatial analysis algorithms for applications involving expensive computation. In this paper, we developed and implemented a Python-based serial and parallel version of a zonal statistics algorithm commonly found in many off-the-shelf GIS packages. This method is useful for model data processing and assembly especially when the
Acknowledgements
The authors are grateful for the support of CSIRO’s Sustainable Agriculture Flagship, the National Basic Research Program of China (Grant No. 2012CB955304), WRON Data Centre, and Chinese Scholarship Council. Efforts and comments from Andrew Higgins, Mike Grundy, and two anonymous reviewers greatly improved the manuscript.
References (27)
- et al.
Modelling and mapping agricultural opportunity costs to guide landscape planning for natural resource management
Ecological Indicators
(2011) - et al.
Crop–animal systems in Asia: importance of livestock and characterisation of agro-ecological zones
Agriculture System
(2002) - et al.
Scaling-up crop models for climate variability applications
Agriculture System
(2000) - et al.
Using spatial interpolation to construct a comprehensive archive of Australian climate data
Environmental Modelling and Software
(2001) - et al.
An overview of APSIM, a model designed for farming systems simulation
European Journal of Agronomy
(2003) - et al.
Quantitative and visual assessments of climate change impacts on South Australian wheat production
Agriculture System
(2003) - et al.
Development of a system to produce maps of agricultural profit on a continental scale: an example for Australia
Agriculture System
(2012) - et al.
Regional crop modelling in Europe: the impact of climatic conditions and farm characteristics on maize yields
Agriculture System
(2009) - et al.
Simulation of corn yields in the upper great lakes region of the US using a modeling framework
Computers and Electronics in Agriculture
(2008) - et al.
Spatial variation of crop yield response to climate change in East Africa
Global Environment Change
(2009)
Modelling farming systems performance at catchment and regional scales to support natural resource management
NJAS – Wageningen Journal of Life Sciences
Cited by (23)
Planning for automatic product assembly using reinforcement learning
2021, Computers in IndustryDevelopment of an EPIC parallel computing framework to facilitate regional/global gridded crop modeling with multiple scenarios: A case study of the United States
2019, Computers and Electronics in AgricultureCitation Excerpt :This study used a Linux-based computing cluster and achieved a 40-fold reduction in the time required to run 140,000 EPIC simulations. In other study (Zhao et al., 2012), used a hybrid HPC approach to simulate national agricultural systems of Australia on a heterogonous distributed computing grid. The hybrid HPC approach accelerates most processing by 1000-fold and completes jobs within a few days rather than months.
A standardized workflow to utilise a grid-computing system through advanced message queuing protocols
2016, Environmental Modelling and SoftwareLand-use and sustainability under intersecting global change and domestic policy scenarios: Trajectories for Australia to 2050
2016, Global Environmental ChangeDevelopment of mpi_EPIC model for global agroecosystem modeling
2015, Computers and Electronics in AgricultureCitation Excerpt :HPC can facilitate detailed management optimization at local scales, and can help aggregate regional, national and global distributions of production and environmental impacts as a tool for decision makings at larger spatial scales. Demonstrations of regional and national high-resolution agroecosystem modeling under an HPC environment are limited (Nichols et al., 2011; Zhao et al., 2012). Nichols et al. (2011) constructed a modeling system to conduct regional high-resolution (30 m) assessment of production and environmental effects for cropping systems in the Midwest US with the high-performance computing Environmental Policy and Integrated Climate (HPC-EPIC) model.