Parallelization and optimization of spatial analysis for large scale environmental model data assembly

doi:10.1016/j.compag.2012.08.007

Computers and Electronics in Agriculture

Volume 89, November 2012, Pages 94-99

https://doi.org/10.1016/j.compag.2012.08.007 Get rights and content

Abstract

Spatial–temporal modelling of environmental systems such as agriculture, forestry, and water resources requires high resolution input data. Assembling and summarizing this data in the appropriate format for model input often requires a series of spatial analyses which can be extremely time-consuming, especially when many large data sets are involved. In this paper we investigated the ability of high-performance computing techniques to improve the efficiency of spatial analysis for model data assembly. We implemented an array-based algorithm to calculate summary statistics for long time-series daily grid climate data sets for 11,575 climate–soil zones across the Australian wheat-growing regions for input into a crop simulation model. We developed a zonal statistics algorithm using Python’s Numpy module then parallelized it and processed it using a shared memory, multi-processor system. We assessed algorithm performance with a varying number of CPU cores, and assessed the influence of load balancing on the efficiency of parallel processing. Compared with traditional desktop GIS software, the serial and parallel (32 cores) implementation achieved about 180 and 1440 times speed-up, respectively. We also found that the most efficient computation occurred when not all of the available CPU cores were used, and the chunk size of jobs also had an important influence on computing efficiency. The algorithm and the parallel processing scheme provides a useful approach to address computing challenges posed by spatial analysis of numerous large data sets for large scale environmental modelling.

Highlights

► Simulation of environmental processes over large areas is increasingly necessary. ► A zonal statistics algorithm was developed and parallelized for processing input data. ► The algorithm depends on open-source libraries and packages. ► The algorithm achieves 1440 times speed-up compared to traditional GIS software. ► This speeds input data processing for environmental modelling at high resolution.

Introduction

Process-based environmental models such as those that estimate the growth of agricultural crops, forest stands, or water resources typically require high spatial and temporal resolution input data (Van Wesemael et al., 2010). One of the seemingly foundational laws of environmental modelling is that this data is rarely in exactly the right format and resolution required by the model. For example, the study units are irregular agro-ecological or climate/soil zones (Devendra and Thomas, 2002, Fischer et al., 2002), while the input data are usually site data from station observation or grid data from spatial interpolation. As a result, a significant amount of pre-processing is often required to summarize and assemble data prior to running the model. This can be computationally demanding and time-consuming. Traditional Geographic Information Systems (GIS) are yet to widely embrace the shift in computer technology to multi-core processors (Bryan, 2012). Hence, the efficiency of data processing within GIS software has largely ceased to increase over the past decade following the limitations on computer processor clock speed (Dongarra et al., 2007). Although some GIS vendors have proposed some batch methods, we found that the capacity still cannot meet the requirements of our application. So we developed a customized algorithm using high-performance computing (HPC) to increase the efficiency of data pre-processing.

In this application, we focused on the assembly of high spatial and temporal resolution climate data to simulate agricultural systems. Variation in climate across both time and space significantly influence crop growth in agricultural systems (Hansen and Jones, 2000; Luo et al., 2003; Reidsma et al., 2009). Understanding how spatial–temporal variation of climate impacts agricultural systems is of critical importance for agricultural decision-making (Overpeck et al., 2011). Process-based crop models are a common way to study agricultural systems at regional or national scales and usually need a complete and accurate source of climate data across the whole landscape (Bryan et al., 2010, Bryan et al., 2011, Safir et al., 2008, Zhao et al., in press). Interpolating station-based climate records to raster layers is a common practice to overcome the deficiencies of observational data at the regional scale (Jeffrey et al., 2001, Thornton et al., 2009). However, coupling this kind of spatial data with agricultural systems models needs a series of intensive spatial analyses, which has impeded the application of agricultural systems models to large areas at high spatial resolution, despite the fact that the computing resources are more readily available today (Finley et al., 2012).

The purpose of this study was to prepare climate data for the Agricultural Production Systems sIMulator (APSIM), a process-based agricultural systems model (Keating et al., 2003, Wang et al., 2009), to simulate wheat productivity under various management practices across Australia’s wheat-growing regions. To achieve this it was necessary to extract and summarize 122 years of daily gridded climate data for 11,575 climate–soil zones. A zonal statistics algorithm, commonly found with raster GIS, was developed in Python’s Numpy module to calculate summary statistics on the raster climate data layers for each zone. We parallelized this algorithm and assessed its performance under a varying number of CPU cores. We also assessed the efficiency of parallel computing with different job scheduling and load balancing approaches to identify the most efficient high-performance computing strategy. The utility of these techniques for data assembly and summary for input into environmental models is discussed.

Section snippets

Climate–soil zones and climate data

The study area forms a crescent-shaped area from the northeast coast of Queensland around the south of the continent to the west coast of Western Australia (Fig. 1). The study area includes a 100 km buffer around the actual areas sown to wheat in 2006 (ABARE, 2006, Marinoni et al., 2012). The whole study area was divided into 11,575 climate–soil zones (CS zones) with relatively homogenous climate and soil properties. Using zones as basic modelling units instead of grid cells enables us to reduce

Design and implementation

The zonal statistics algorithm developed in this study aggregates continuous values in one raster for each zone in another raster, and computes the statistics (Fig. 2). The algorithm takes as input two raster data sets, namely zones and values, which share the same resolution and extent. Similar algorithms can be found in many raster GIS. Initially, we undertook the spatial analyses using a combination of ArcGIS tools in batch processing: Make NetCDF Raster Layer, Resample, and zonal Statistics

Results

Processing a single year of data (1825 data sets) using our zonal statistics algorithm took 1884.15 s for a single worker, or roughly one second per climate data set (Fig. 5a). The improvement was significant compared with 180 s per data set for desktop GIS processing. Run time decreased further with an increasing number of active workers up to around nine cores (228 s for 1825 jobs), beyond which performance degraded. A larger chunk size achieved greater processing efficiency, especially when

Discussion and conclusion

With the development of open source software and programming languages that are operating system independent and very flexible, it becomes possible to develop tailored spatial analysis algorithms for applications involving expensive computation. In this paper, we developed and implemented a Python-based serial and parallel version of a zonal statistics algorithm commonly found in many off-the-shelf GIS packages. This method is useful for model data processing and assembly especially when the

Acknowledgements

The authors are grateful for the support of CSIRO’s Sustainable Agriculture Flagship, the National Basic Research Program of China (Grant No. 2012CB955304), WRON Data Centre, and Chinese Scholarship Council. Efforts and comments from Andrew Higgins, Mike Grundy, and two anonymous reviewers greatly improved the manuscript.

References (27)

B.A. Bryan et al.
Modelling and mapping agricultural opportunity costs to guide landscape planning for natural resource management
Ecological Indicators
(2011)
C. Devendra et al.
Crop–animal systems in Asia: importance of livestock and characterisation of agro-ecological zones
Agriculture System
(2002)
J.W. Hansen et al.
Scaling-up crop models for climate variability applications
Agriculture System
(2000)
S.J. Jeffrey et al.
Using spatial interpolation to construct a comprehensive archive of Australian climate data
Environmental Modelling and Software
(2001)
B.A. Keating et al.
An overview of APSIM, a model designed for farming systems simulation
European Journal of Agronomy
(2003)
Q.Y. Luo et al.
Quantitative and visual assessments of climate change impacts on South Australian wheat production
Agriculture System
(2003)
O. Marinoni et al.
Development of a system to produce maps of agricultural profit on a continental scale: an example for Australia
Agriculture System
(2012)
P. Reidsma et al.
Regional crop modelling in Europe: the impact of climatic conditions and farm characteristics on maize yields
Agriculture System
(2009)
G.R. Safir et al.
Simulation of corn yields in the upper great lakes region of the US using a modeling framework
Computers and Electronics in Agriculture
(2008)
P.K. Thornton et al.
Spatial variation of crop yield response to climate change in East Africa
Global Environment Change
(2009)

E. Wang et al.

Modelling farming systems performance at catchment and regional scales to support natural resource management

NJAS – Wageningen Journal of Life Sciences

(2009)

ABARE, 2006. Australian Crop Report. Australian Bureau of Agricultural and Resource Economics and...

Bryan, B.A., 2012. High-performance computing tools for the integrated assessment and modelling of social–ecological...

Cited by (23)

Planning for automatic product assembly using reinforcement learning
2021, Computers in Industry
Assembly connects functional modules and components of products. The efficient and accurate assembly can improve performance of the product operation and maintenance. It is therefore essential to have an effective method for product assembly. Existing methods of the mechanical product assembly use mainly manual processes that rely on experience of operators. This paper proposes a reinforcement learning method to enable an automatic operation for improved efficiency and accuracy of the mechanical product assembly. A representation of the product assembly is proposed to build a machine learning model. The automatic assembly of product operations is planned by reinforcement learning agents. Constraints of assembly operations are considered to develop searching strategies of the maximum reward for the optimal solution of assembly operations. A quantitative method is proposed to measure efficiency of assembly operations based on the operation time. The proposed method has been applied in the assembly improvement of function modules of an industrial machine.
Development of an EPIC parallel computing framework to facilitate regional/global gridded crop modeling with multiple scenarios: A case study of the United States
2019, Computers and Electronics in Agriculture
Citation Excerpt :
This study used a Linux-based computing cluster and achieved a 40-fold reduction in the time required to run 140,000 EPIC simulations. In other study (Zhao et al., 2012), used a hybrid HPC approach to simulate national agricultural systems of Australia on a heterogonous distributed computing grid. The hybrid HPC approach accelerates most processing by 1000-fold and completes jobs within a few days rather than months.
Crop models are increasingly used to evaluate crop yields at regional/global scales. These applications require the integration and processing of very large data sets in order to explore the implications of land management options across spatially heterogeneous scales. These modeling involve the combination of large spatially explicit data sets for climate, biophysical and crop management variables as well as significant computational capacity for regional/global scale simulations. As a result, the application of crop models at regional/global scales is challenging due to the requirements for input data, calibration, validation and simulation setups appropriate for thousands to millions of spatial points. Not surprisingly, the implementation of these models across large areas using fine-scale grids can be limited by computational time requirements. To reduce the large computational load of an agroecosystem simulation process for regional and global scales, we developed an EPIC Parallel Computing Framework (EPCF) to facilitate regional/global gridded crop modeling. The EPCF can make full use of the CPU resources of the workstation through parallel processing. For future users, only a few lines of additional code modification are needed to convert the single process code to parallel computing code. Parallel processing in one machine makes it easy to handle the whole system without the overhead and expertise required for a distributed system. EPCF is a system that provides not only the ease of development but also cost-efficiency.
A standardized workflow to utilise a grid-computing system through advanced message queuing protocols
2016, Environmental Modelling and Software
Point-based crop models are frequently used to investigate the interaction of genotype, environment and management (G × E × M) for breeding programs and adaptation research. These studies require the processing of millions of simulations to make assessments in national scales. Here we demonstrate a platform to run crop simulations with HTCondor as implemented across more than 12 000 cores on CSIRO network. The workflow of HTCondor usages was built through a server-client structure with standardized messages (simulations), which were queued by message queue sever. The new workflow maximized the performance of the CSIRO HTCondor service and could simultaneously utilize more than 8000 cores. A case study of 4.8 million APSIM simulations (ca. 2 mins each on model desktop), was completed in 32.5 h. This platform makes it is possible to assess all combinations of G × E × M in a short period. The generic platform has since been used to run other massive serial processing tasks.
Land-use and sustainability under intersecting global change and domestic policy scenarios: Trajectories for Australia to 2050
2016, Global Environmental Change
Understanding potential future influence of environmental, economic, and social drivers on land-use and sustainability is critical for guiding strategic decisions that can help nations adapt to change, anticipate opportunities, and cope with surprises. Using the Land-Use Trade-Offs (LUTO) model, we undertook a comprehensive, detailed, integrated, and quantitative scenario analysis of land-use and sustainability for Australia’s agricultural land from 2013–2050, under interacting global change and domestic policies, and considering key uncertainties. We assessed land use competition between multiple land-uses and assessed the sustainability of economic returns and ecosystem services at high spatial (1.1 km grid cells) and temporal (annual) resolution. We found substantial potential for land-use transition from agriculture to carbon plantings, environmental plantings, and biofuels cropping under certain scenarios, with impacts on the sustainability of economic returns and ecosystem services including food/fibre production, emissions abatement, water resource use, biodiversity services, and energy production. However, the type, magnitude, timing, and location of land-use responses and their impacts were highly dependent on scenario parameter assumptions including global outlook and emissions abatement effort, domestic land-use policy settings, land-use change adoption behaviour, productivity growth, and capacity constraints. With strong global abatement incentives complemented by biodiversity-focussed domestic land-use policy, land-use responses can substantially increase and diversify economic returns to land and produce a much wider range of ecosystem services such as emissions abatement, biodiversity, and energy, without major impacts on agricultural production. However, better governance is needed for managing potentially significant water resource impacts. The results have wide-ranging implications for land-use and sustainability policy and governance at global and domestic scales and can inform strategic thinking and decision-making about land-use and sustainability in Australia. A comprehensive and freely available 26 GB data pack (http://doi.org/10.4225/08/5604A2E8A00CC) provides a unique resource for further research. As similarly nuanced transformational change is also possible elsewhere, our template for comprehensive, integrated, quantitative, and high resolution scenario analysis can support other nations in strategic thinking and decision-making to prepare for an uncertain future.
Spatiotemporal data representation and its effect on the performance of spatial analysis in a cyberinfrastructure environment - A case study with raster zonal analysis
2016, Computers and Geosciences
This paper conducts a systematic research to uncover the impact of spatiotemporal data representation on the performance of raster analysis in a cyberinfrastructure environment. Two broad categories of data organization based on file system and database system are presented and discussed. In particular, these include five specific approaches of storing time-series raster data involving tiling (partitioning the entire image file into non-overlapping pieces), stacking (compositing multiple single-band images into a large multi-band image) techniques, and a combination of tiling and stacking in files or database tables. Raster zonal statistics, which have been used to support a variety of GIS applications ranging from watershed analysis to summarizing forest products, is selected as an example raster analysis algorithm. A series of experiments were conducted to evaluate the performance of the five proposed approaches using different spatial and spatiotemporal queries. The results show that spatiotemporal data representation, though largely ignored in the design of a cyberinfrastructure system, does play an important role in system performance. Specifically, tiling techniques with the support of spatial database outperforms all other approaches, especially those adopting stacking techniques in the data organization. For illustration, the best raster analysis solution was implemented and integrated into an operational cyberinfrastructure in the context of providing spatial decision support in polar science. We expect this work to offer insights to develop efficient cyberinfrastructure modules to support spatial analysis through a thorough analysis of spatiotemporal data representation.
Development of mpi_EPIC model for global agroecosystem modeling
2015, Computers and Electronics in Agriculture
Citation Excerpt :
HPC can facilitate detailed management optimization at local scales, and can help aggregate regional, national and global distributions of production and environmental impacts as a tool for decision makings at larger spatial scales. Demonstrations of regional and national high-resolution agroecosystem modeling under an HPC environment are limited (Nichols et al., 2011; Zhao et al., 2012). Nichols et al. (2011) constructed a modeling system to conduct regional high-resolution (30 m) assessment of production and environmental effects for cropping systems in the Midwest US with the high-performance computing Environmental Policy and Integrated Climate (HPC-EPIC) model.
Agroecosystem models that can incorporate management practices and quantify environmental effects are necessary to assess sustainability-associated food and bioenergy production across spatial scales. However, most agroecosystem models are designed for a plot scale. Tremendous computational capacity on simulations and datasets is needed when large scales of high-resolution spatial simulations are conducted. We used the message passing interface (MPI) parallel technique and developed a master–slave scheme for an agroecosystem model, EPIC on global food and bioenergy studies. Simulation performance was further enhanced by applying the Vampir framework. On a Linux-based supercomputer, Cray XT7 Titan, we used 2048 cores and successfully shortened the running time from days to 30 min for a global 30 years of modeling of a bioenergy crop at the resolution of half-degree (62,482 grids) with the message passing interface based EPIC (mpi_EPIC). The results illustrate that mpi_EPIC using parallel design can balance simulation workloads and facilitate large-scale, high-resolution analyses of agricultural production systems, management alternatives and environmental effects.

View all citing articles on Scopus

View full text

Application noteParallelization and optimization of spatial analysis for large scale environmental model data assembly

Abstract

Highlights

Introduction

Section snippets

Climate–soil zones and climate data

Design and implementation

Results

Discussion and conclusion

Acknowledgements

Ecological Indicators

Agriculture System

Agriculture System

Environmental Modelling and Software

European Journal of Agronomy

Agriculture System

Agriculture System

Agriculture System

Computers and Electronics in Agriculture

Global Environment Change

NJAS – Wageningen Journal of Life Sciences

Application note
Parallelization and optimization of spatial analysis for large scale environmental model data assembly