Application note
Parallelization and optimization of spatial analysis for large scale environmental model data assembly

https://doi.org/10.1016/j.compag.2012.08.007Get rights and content

Abstract

Spatial–temporal modelling of environmental systems such as agriculture, forestry, and water resources requires high resolution input data. Assembling and summarizing this data in the appropriate format for model input often requires a series of spatial analyses which can be extremely time-consuming, especially when many large data sets are involved. In this paper we investigated the ability of high-performance computing techniques to improve the efficiency of spatial analysis for model data assembly. We implemented an array-based algorithm to calculate summary statistics for long time-series daily grid climate data sets for 11,575 climate–soil zones across the Australian wheat-growing regions for input into a crop simulation model. We developed a zonal statistics algorithm using Python’s Numpy module then parallelized it and processed it using a shared memory, multi-processor system. We assessed algorithm performance with a varying number of CPU cores, and assessed the influence of load balancing on the efficiency of parallel processing. Compared with traditional desktop GIS software, the serial and parallel (32 cores) implementation achieved about 180 and 1440 times speed-up, respectively. We also found that the most efficient computation occurred when not all of the available CPU cores were used, and the chunk size of jobs also had an important influence on computing efficiency. The algorithm and the parallel processing scheme provides a useful approach to address computing challenges posed by spatial analysis of numerous large data sets for large scale environmental modelling.

Highlights

► Simulation of environmental processes over large areas is increasingly necessary. ► A zonal statistics algorithm was developed and parallelized for processing input data. ► The algorithm depends on open-source libraries and packages. ► The algorithm achieves 1440 times speed-up compared to traditional GIS software. ► This speeds input data processing for environmental modelling at high resolution.

Introduction

Process-based environmental models such as those that estimate the growth of agricultural crops, forest stands, or water resources typically require high spatial and temporal resolution input data (Van Wesemael et al., 2010). One of the seemingly foundational laws of environmental modelling is that this data is rarely in exactly the right format and resolution required by the model. For example, the study units are irregular agro-ecological or climate/soil zones (Devendra and Thomas, 2002, Fischer et al., 2002), while the input data are usually site data from station observation or grid data from spatial interpolation. As a result, a significant amount of pre-processing is often required to summarize and assemble data prior to running the model. This can be computationally demanding and time-consuming. Traditional Geographic Information Systems (GIS) are yet to widely embrace the shift in computer technology to multi-core processors (Bryan, 2012). Hence, the efficiency of data processing within GIS software has largely ceased to increase over the past decade following the limitations on computer processor clock speed (Dongarra et al., 2007). Although some GIS vendors have proposed some batch methods, we found that the capacity still cannot meet the requirements of our application. So we developed a customized algorithm using high-performance computing (HPC) to increase the efficiency of data pre-processing.

In this application, we focused on the assembly of high spatial and temporal resolution climate data to simulate agricultural systems. Variation in climate across both time and space significantly influence crop growth in agricultural systems (Hansen and Jones, 2000; Luo et al., 2003; Reidsma et al., 2009). Understanding how spatial–temporal variation of climate impacts agricultural systems is of critical importance for agricultural decision-making (Overpeck et al., 2011). Process-based crop models are a common way to study agricultural systems at regional or national scales and usually need a complete and accurate source of climate data across the whole landscape (Bryan et al., 2010, Bryan et al., 2011, Safir et al., 2008, Zhao et al., in press). Interpolating station-based climate records to raster layers is a common practice to overcome the deficiencies of observational data at the regional scale (Jeffrey et al., 2001, Thornton et al., 2009). However, coupling this kind of spatial data with agricultural systems models needs a series of intensive spatial analyses, which has impeded the application of agricultural systems models to large areas at high spatial resolution, despite the fact that the computing resources are more readily available today (Finley et al., 2012).

The purpose of this study was to prepare climate data for the Agricultural Production Systems sIMulator (APSIM), a process-based agricultural systems model (Keating et al., 2003, Wang et al., 2009), to simulate wheat productivity under various management practices across Australia’s wheat-growing regions. To achieve this it was necessary to extract and summarize 122 years of daily gridded climate data for 11,575 climate–soil zones. A zonal statistics algorithm, commonly found with raster GIS, was developed in Python’s Numpy module to calculate summary statistics on the raster climate data layers for each zone. We parallelized this algorithm and assessed its performance under a varying number of CPU cores. We also assessed the efficiency of parallel computing with different job scheduling and load balancing approaches to identify the most efficient high-performance computing strategy. The utility of these techniques for data assembly and summary for input into environmental models is discussed.

Section snippets

Climate–soil zones and climate data

The study area forms a crescent-shaped area from the northeast coast of Queensland around the south of the continent to the west coast of Western Australia (Fig. 1). The study area includes a 100 km buffer around the actual areas sown to wheat in 2006 (ABARE, 2006, Marinoni et al., 2012). The whole study area was divided into 11,575 climate–soil zones (CS zones) with relatively homogenous climate and soil properties. Using zones as basic modelling units instead of grid cells enables us to reduce

Design and implementation

The zonal statistics algorithm developed in this study aggregates continuous values in one raster for each zone in another raster, and computes the statistics (Fig. 2). The algorithm takes as input two raster data sets, namely zones and values, which share the same resolution and extent. Similar algorithms can be found in many raster GIS. Initially, we undertook the spatial analyses using a combination of ArcGIS tools in batch processing: Make NetCDF Raster Layer, Resample, and zonal Statistics

Results

Processing a single year of data (1825 data sets) using our zonal statistics algorithm took 1884.15 s for a single worker, or roughly one second per climate data set (Fig. 5a). The improvement was significant compared with 180 s per data set for desktop GIS processing. Run time decreased further with an increasing number of active workers up to around nine cores (228 s for 1825 jobs), beyond which performance degraded. A larger chunk size achieved greater processing efficiency, especially when

Discussion and conclusion

With the development of open source software and programming languages that are operating system independent and very flexible, it becomes possible to develop tailored spatial analysis algorithms for applications involving expensive computation. In this paper, we developed and implemented a Python-based serial and parallel version of a zonal statistics algorithm commonly found in many off-the-shelf GIS packages. This method is useful for model data processing and assembly especially when the

Acknowledgements

The authors are grateful for the support of CSIRO’s Sustainable Agriculture Flagship, the National Basic Research Program of China (Grant No. 2012CB955304), WRON Data Centre, and Chinese Scholarship Council. Efforts and comments from Andrew Higgins, Mike Grundy, and two anonymous reviewers greatly improved the manuscript.

References (27)

  • E. Wang et al.

    Modelling farming systems performance at catchment and regional scales to support natural resource management

    NJAS – Wageningen Journal of Life Sciences

    (2009)
  • ABARE, 2006. Australian Crop Report. Australian Bureau of Agricultural and Resource Economics and...
  • Bryan, B.A., 2012. High-performance computing tools for the integrated assessment and modelling of social–ecological...
  • Cited by (23)

    • Development of an EPIC parallel computing framework to facilitate regional/global gridded crop modeling with multiple scenarios: A case study of the United States

      2019, Computers and Electronics in Agriculture
      Citation Excerpt :

      This study used a Linux-based computing cluster and achieved a 40-fold reduction in the time required to run 140,000 EPIC simulations. In other study (Zhao et al., 2012), used a hybrid HPC approach to simulate national agricultural systems of Australia on a heterogonous distributed computing grid. The hybrid HPC approach accelerates most processing by 1000-fold and completes jobs within a few days rather than months.

    • Development of mpi_EPIC model for global agroecosystem modeling

      2015, Computers and Electronics in Agriculture
      Citation Excerpt :

      HPC can facilitate detailed management optimization at local scales, and can help aggregate regional, national and global distributions of production and environmental impacts as a tool for decision makings at larger spatial scales. Demonstrations of regional and national high-resolution agroecosystem modeling under an HPC environment are limited (Nichols et al., 2011; Zhao et al., 2012). Nichols et al. (2011) constructed a modeling system to conduct regional high-resolution (30 m) assessment of production and environmental effects for cropping systems in the Midwest US with the high-performance computing Environmental Policy and Integrated Climate (HPC-EPIC) model.

    View all citing articles on Scopus
    View full text