Keywords

1 Introduction

Urban zoning is a common practice adopted by many developed countries for urban planning [1]. The primary purpose of urban zoning is to segregate an urban area into distinct zones with regard to the use of spaces (e.g., residential, commercial, industrial), while specifying the height and bulk of structures, the lot dimensions, and open space requirements [2]. Urban planners, administrators, and policy makers rely on urban zoning maps to analyze, predict, and plan for urban development.

Conventional urban zoning approaches [2] require tremendous manual efforts; hence, they are time-consuming, prone to error, and non-scalable. Routine processes in updating a zoning map typically require several months of intensive labor work. Therefore, an automatic approach for urban zoning is highly favorable and deserves in-depth studies.

Existing efforts (e.g., [3,4,5,6,7,8,9,10,11,12,13,14,15,16]) have been classifying land use and land cover using remotely sensed data (i.e., satellite and aerial images). Several methods (e.g., [3, 9,10,11]) applied image segmentation techniques on aerial images. However, manually segmenting those images is laborious and challenging: human efforts are needed for visually interpreting every pixel from single or multiple band(s). Hence, automatic semantic segmentation techniques (e.g., [4,5,6,7,8, 14, 16]) have also been proposed.

Thanks to the rise of social networking services (e.g., Flickr, Facebook), an enormous amount of street-view photos with geo-tagged information are publicly shared. This sort of data carry detailed semantic information about different places and thus could help to interpret zoning information. In this paper, we explore the use of street-view photos and satellite images for automatic urban zoning. Specifically, we propose an urban zoning method using multi-source data including top-view satellite images and street-view photos. This multi-source data is fused into a higher order Markov random fields (HO-MRF) model. In the model, a multi-scale deep convolutional neural network (MS-CNN) is built to segment the top-view data and used in lower-order potentials while the street-view photos are classified and added in higher-order potentials. We conducted extensive experiments to investigate various aspects of our proposed solution. In particular, we investigated different features and classifiers that could be used for classifying street-view photos. We compared the use of multi-source vs single-source data. We also compared our proposed HO-MRF model with conventional MRF and our deep neural network with existing network architectures.

It is important to note that urban zoning conceptually differs from land cover or land use despite their correlation. Land cover refers to the observed physical cover on the earth surface. Land use refers to the activities people undertake on a certain type of land cover to change or maintain it, or to produce [17]. Urban zoning, on the other hand, refers to segregating an urban area into distinct zones by the use of buildings and spaces within a zone. It provides a convenient mean to visualize patterns of social and economic developments.

The remainder of the paper is organized as follows. Section 2 reviews the related work. Our proposed method is presented in Sect. 3. Datasets and experiments are described in Sect. 4. Section 5 concludes the paper and provides remarks.

2 Related Work

2.1 Land Use and Land Cover Classification

Early works (e.g., [18, 19]) have successfully applied satellite sensor technology for monitoring agricultural land use, which motivate recent attempts on applying similar technologies for analyzing land use and land cover. Barnsley and Barr [20] proposed to extract land use information using land cover classification from multi-spectral images captured by a satellite sensor. However, the degree of sensitivity between land use patterns and the accuracy of initial land cover classification is yet to be determined.

Brown et al. [3] analyzed the relationship between land use and land cover from a spatial-temporal perspective using a Markov transition probability model. Porway et al. [12, 13] proposed a hierarchical and contextual model for aerial image understanding. Lienou et al. [15] annotated satellite images using semantic concepts related to land use. The annotation task combined the classification of image patches and the integration of the spatial information between these patches. Rozenstein and Karnieli [4] introduced Geographical Information Systems (GIS) to facilitate land use classification based on hybrid supervised and unsupervised learning on remote sensing data. Hu and Wang [14] used a decision tree with remote-sensing data to classify urban land use classes. Banerjee et al. [7] applied cluster ensemble techniques to support self-training-based, unsupervised land cover classification on satellite images, to overcome the challenge of limited information in data distribution. Luus et al. [16] introduced multi-view deep learning to the field of land use classification.

There are also recent works on applying semantic segmentation techniques to land use and land cover classification. For example, Frohlich et al. [6] used iterative context forests to classify land cover from satellite images by considering contextual information. Volpi and Ferrari [9] segmented satellite images using conditional random fields. Albert et al. [11] proposed a method for the simultaneous classification of land cover and land use, taking the consideration of spatial context.

2.2 Urban Understanding from Street-View Photos

The abundance of street-view photos provides new opportunities for computer vision research on understanding urban areas. For example, previous works have demonstrated using such data for city identification [21, 22], geo-informative social attributes prediction [23] and urban perception [24, 25]. Recently, Dubey et al. quantified the perception of urban environment at the global scale by training a convolutional neural architecture on a new crowd-sourced dataset [26].

An early attempt was made by Leung and Newsam [27] on using street-view photos for classifying land use and land cover. To measure social development, they formulated the problem as supervised binary classification. Oba et al. [28] proposed text features in addition to visual features to improve land cover classification. Frey et al. [29] applied unsupervised learning to automatically characterize large geographical regions into location types.

2.3 Correlation Between Top-View and Street-View Imagery Data

Recently the correlation between top-view and street-view imagery data has been exploited for scene understanding. For example, Lin et al. [30] proposed to learn features using deep networks for cross-view image matching. In this work, the geo-location of a street-view query image on an aerial image is determined via feature matching. In [31], Máttyus et al. proposed an automatic road segmentation method for vehicles using both aerial and ground-view imagery data. Specifically, ground-view images of a road are captured using a stereo camera built in vehicles and paired with aerial imagery obtained from GPS to reasoning the road surface and perform road segmentation. In [32], functions of buildings were classified using both ground-level and overhead images. Convolutional neural networks (CNNs) were used to learn visual features at both ground and overhead levels. The ground-level feature map for an input overhead image was then constructed by applying kernel regression on the ground-level features extracted by the CNNs on ground-level images.

Fig. 1.
figure 1

Our approach infers a reliable urban zoning map from top-view satellite images and street-view photos.

3 Proposed Method

3.1 Higher-Order Markov Random Fields

The problem of urban zoning can be described as follows. Given a satellite image S covering an urban area U and a set of randomly downloaded street-view photos \(G={\{g_i\}}\) located within U and associated with geo-tagged information, the problem is to infer possible zoning maps of U at metropolis-level. Figure 1 illustrates the proposed urban zoning system.

We formulate the problem of urban zoning as segmenting the satellite image S into a number of regions, called zones, and identifying the zone types of the regions. The zone type of each region is determined by the visual information extracted from the pixels of that region and the associated street-view photos.

Fig. 2.
figure 2

Examples of street-view photos from New York, San Francisco, and Boston.

According to the definition of the uses of urban buildings and spaces [2], we categorize urban zones into 4 types: Residential, Commercial, Industrial, and Others in this paper. This categorization ensures the generality of the problem. Let Z be the set of zone types. \(|Z|=4\) in our case. Figure 2 shows some examples of street-view photos under different zone types.

Technically, the problem stated above can be regarded as semantic segmentation of the satellite image S. Following this idea, we propose a higher-order Markov random fields (HO-MRF) model to solve the problem. In our HO-MRF model, unary terms are computed from visual features extracted on the satellite image S via a deep convolutional neural network. The relationships between the satellite image S and its associated street-view photos G are encoded in higher order potentials and augmented to the HO-MRF model. Figure 3 summarizes the workflow of this solution.

The HO-MRF model is constructed as follows. The input satellite image S is represented as a lattice of pixels. Each pixel \(\mathbf {p}_i \in S\) is considered as a node and its label is denoted as \(l_i\) taking value in Z. Akin to fully connected MRFs [33], each pixel is connected to all other pixels.

Fig. 3.
figure 3

The workflow of our approach. From left to right: Input data (1st column), Satellite image pixel classification and street-view photo classification (2nd column), zoning using HO-MRF (3rd column), output (4th column).

The zoning problem is equivalent to finding the best configuration \(\mathcal {L}=(l_1, l_2, ...,l_{|S|})\) for |S| pixels of the satellite image S. In particular, we minimize the energy function,

(1)

The unary potential \(\psi _i(l_i)\) in (1) is defined as,

$$\begin{aligned} \psi _i(l_i=z) \propto - \log C(S, \mathbf {p}_i | l_i=z), \end{aligned}$$
(2)

where C is the classification scores of assigning the zone type of pixel \(\mathbf {p}_i\) (i.e., \(l_i\)) to z based on the input satellite image S. The computation of C will be presented in details in Sect. 3.2.

The pairwise potential \(\psi _{ij}(l_i, l_j)\) is defined as a mixture of Gaussians of the location and color information of image pixels on S. In particular,

$$\begin{aligned} \psi _{ij}(l_i, l_j) = \mu _{ij} \bigg [&\exp \bigg ( -\frac{|\mathbf {p}_i - \mathbf {p}_j|^2}{2\alpha ^2} - \frac{|\mathbf {c}_i - \mathbf {c}_j|^2}{2\beta ^2} \bigg ) + \exp \bigg ( -\frac{|\mathbf {p}_i - \mathbf {p}_j|^2}{2\gamma ^2} \bigg )\bigg ] \end{aligned}$$
(3)

where \(\mathbf {c}_i/\mathbf {c}_j\) is the color vector at pixel \(\mathbf {p}_i/\mathbf {p}_j\) and \(\mathbf {p}_i\) is the location vector (i.e., x- and y-coordinate) of \(\mathbf {p}_i\), \(\mu _{ij}\) is the Pott label compatibility function [33], e.g.,

$$\begin{aligned} \mu _{ij} = {\left\{ \begin{array}{ll} -1, &{} \text{ if } l_i = l_j\\ 1, &{} \text{ otherwise }. \end{array}\right. } \end{aligned}$$
(4)

In the HO-MRF model, we introduce higher-order potentials (e.g., \(\varphi (g)\)) capturing the relationships between S and G. The term \(\varphi (l_i,G)\) encodes the zone consistency between a point \(\mathbf {p}_i\) on S and its nearest street-view photo in G. Note that since every street-view photo is associated with a geo-location, the distance between a pixel \(\mathbf {p}_i \in S\) to a street-view photo can be determined. In particular, we define,

(5)

where f(g) is a function returning the zone type of g and is described in Sect. 3.3; \(d(\mathbf {p}_i, g)\) is the spatial distance between \(\mathbf {p}_i\) and g. Intuitively, \(\varphi (l_i,G)\) is inverse to the distance from \(\mathbf {p}_i\) to its closest street-view photo whose zone type is \(l_i\). In other words, the zone type of \(\mathbf {p}_i\) would be more biased by its nearest street-view photo.

Note that \(\varphi (l_i,G)\) needs to be computed for every \(\mathbf {p}_i\). To save the computational cost, the Distance Transform proposed in [34] is applied on grids formed by the locations of street-view photos G. In particular, the zone type of each street-view photo is first obtained (see Sect. 3.3). For every zone type \(z \in Z\), a Distance Transform \(D_z\) is applied on the street-view photos that have been classified as zone type z. The potential \(\varphi (l_i,G)\) can then be rewritten as,

$$\begin{aligned} \varphi (l_i,G) \propto -\log \frac{1}{D_{l_i}(\mathbf {p}_i)}, \end{aligned}$$
(6)

where \(D_{l_i}(\mathbf {p}_i)\) is the value of the Distance Transform \(D_{l_i}\) at location \(\mathbf {p}_i\).

The term \(\varphi (g)\) represents the zone consistency of pixels in a local image region (on S), at which the street-view photo g could be captured. Specifically, given g, its geo-location on S can be obtained and, at this geo-location, a local image region R(g) of size \(W\,\times \,W\) is extracted. In our implementation, W is set to 46, which is also the size of image patches used in classifying pixels on the satellite image S (see Sect. 3.2). We then construct a probability distribution \(P_g(l_k)\) over the labels \(l_k \in \mathcal {L}\) conditioned on pixels k inside R(g). The cost \(\varphi (g)\) is then computed from the entropy of \(P_g(l_k)\) as,

$$\begin{aligned} \varphi (g) \propto - \sum _{l_k \in \mathcal {L} | \mathbf {p}_k \in R(g)}{P_g(l_k) \log {P_g(l_k)}} \end{aligned}$$
(7)
Fig. 4.
figure 4

Zoning of New York City. Initial label of each pixel is determined by the zone type of its nearest street-view photo.

The optimization problem in (1) can be solved using the variational mean field method [35, 36]. In particular, (1) is equivalent to finding a maximum of a posteriori (MAP) \(p(L|S=\{\mathbf {p}_i\})\). In variational mean field, this can be approximated by a variational distribution Q which is fully factorized, i.e. \(Q(L)=\prod _i Q_i(l_i)\). The solution of (1) is finally achieved by iteratively updating \(Q_i(l_i)\) as follows,

(8)

where R(g) is an image patch on S that is centered at the location of g, \(\mathbf {\mathbf {Q}}\) is the variational distribution of the higher order terms [37], \(R(g) - \mathbf {p}_i\) is the set of pixels in R(g) except \(\mathbf {p}_i\), and \(Z_i\) is the partition function. We compute the higher order term \(\sum _{\{l_j | \mathbf {p}_j \in R(g), l_i=z\}} \mathbf {\mathbf {Q}}(R(g) - \mathbf {p}_i) \varphi (g)\) as,

(9)

In our implementation, the label of each pixel on the satellite image S was initialized by the label of its nearest street-view photo. Figure 4 shows an example of zone segmentation of New York city.

3.2 Classifying Satellite Image Pixels

This section describes the classification of the zone type of pixels \(\mathbf {p}_i\) on the satellite image S, i.e., estimation of \(C(S, \mathbf {p}_i | l_i)\) in (2). Like the work by Farabet et al. [38], our network receives input the YUV image of S. A pyramid including three scales (\(S^0=S\), \(S^1=\frac{1}{2}S\), \(S^2=\frac{1}{2}S^1\)) is then created. For each pixel \(\mathbf {p}_i\), three local image patches \(I^0_i\), \(I^1_i\), and \(I^2_i\) of size \(46\,\times \,46\)-pixels centered at \(\mathbf {p}_i\) on \(S^0\), \(S^1\), and \(S^2\) are extracted.

Fig. 5.
figure 5

MS-CNN for classifying satellite image pixels.

A multi-scale convolutional neural network (MS-CNN) is then constructed to learn the appearance features and classify local image patches at various scales. Utilizing the multi-scale approach would enable learning scale-invariant features. In addition, multi-scale regions at a pixel would allow the incorporation of context information and thus could be useful for the classification. In the MS-CNN, for each scale, a 5-layer sub-CNN is formed with the interchange of convolutional and max-pooling layers. For example, the 1st and 3rd layers are obtained from the banks of 16 \(7~\times ~7\) filters, of which 10 filters connect to the Y channel and the other 6 filters connect to the U and V channels. The 2nd and 4th are results of 16 and 64 \(2~\times ~2\) max-pooling operations respectively. The 5th layer is formed by 256 \(7~\times ~7\) filters. The outputs of all sub-CNNs are then concatenated to a layer fed to a 2 fully connected layer structure for classification. The first layer of this fully connected network includes 1024 nodes and the second layer contains 4 nodes corresponding to 4 different zone types. Figure 5 illustrates the MS-CNN. In the network, softmax is used as the activation function for the last layer of the sub-CNNs and the last layer of the fully connected network, and tanh is used for all other layers.

The MS-CNN is trained via optimizing a cross-entropy loss and using stochastic gradient descent method. The batch size is set to 100 and the learning rate is set to \(10^{-3}\). We found that these settings achieved the best performance. Since the training data may be biased, the loss function is weighted relatively to the proportion of the class labels in the training dataset. In addition, all sub-CNNs share the same parameters including weight and bias. As indicated by Farabet et al. [38], imposing complete weight sharing across scales is a natural way of forcing the network to learn scale-invariant features, and at the same time, to reduce the chances of over-fitting. Given the MS-CNN, \(C(S, \mathbf {p}_i | l_i=z)\) is computed as classification score of pixel \(\mathbf {p}_i\) to the zone type z, i.e., the response of the MS-CNN at the zone type z.

3.3 Classifying Street-View Photos

Recall that in (5), the HO-MRF requires a function f(g) that returns the zone type of a street-view photo g. Intuitively, if the geo-location of each photo could be paired with the corresponding zone type, we would have a reasonable zoning map. To recognize street-view photos, features need to be determined. Well-known features such as GIST [39], HOG [40], local binary patterns (LBP) [41] have been proposed in the literature. However, those features are handcrafted and thus require the specific knowledge in particular fields. Recently, deep neural networks have been used for automatically learning features and training classifiers [42].

Inspired by the relevance of the information conveyed by deep networks for scene recognition, we choose the Places-CNN, a CNN trained on Places Database [43]. Places-CNN is able to recognize 205 object types and scenes from daily taken photos. We note that the object types and scenes of interest in the Places-CNN (e.g., buildings, skyscrapers, warehouses) are semantically close to the basic concepts of urban zoning (e.g., commercial, industrial). In our implementation, a street-view photo g is passed to the Places-CNN and the output is a vector \(\mathbf {x}(g)=\{x(g)_1,...,x(g)_{205}\}\), representing how probably the photo g contains 205 types of objects and scenes.

Given the features, different classifiers could be used. However, different types of features make different impacts on the performance of a classifier. In practice, extensive experiments are often conducted for classifier selection. In this paper, we consider Random Forests (RF) [44] as the classifier for a number of reasons. First, RF is known for its capability of effectively exploiting and integrating multiple classifiers, where each classifier has its own favor on particular feature types and label classes. Second, as proven in our experimental results, compared with other classifiers, RF based on the features generated by the Places-CNN works well on our problem.

Our RF classifier f includes multiple weak classifiers [45]. The predicted label of f given an input photo g is a weighted majority vote of the predictions made by individual classifiers. Specifically, let \(h_1, ..., h_M\) be M weak classifiers. The predicted label of a photo g is denoted as f(g) and can be computed as,

(10)

where \(\mathbf {x}(g)\) is the 205-dimensional feature vector extracted using the Places-CNN, \(I(\cdot )\) is the indicator function, and \(w_k\) is the weight of the k-th weak classifier \(h_k\).

Table 1. Numbers of street-view photos.

4 Experiments

4.1 Dataset

Our experiments were conducted on the map data of three metropolises: New York, San Francisco and Boston.

Satellite Data. Satellite images are our top-view imagery data that could be obtained without expensive devices or professional expertise. We downloaded the satellite image tiles for each city by its corresponding geographical boundary from the National Map service operated by U.S. Geology Survey [46].

Street-View Data. Popular social network services provide public access to many of their street-view photos and associated geo-tagged information. We collected 490, 329 photos of the three metropolises from Flickr and Panoramio. Specifically, we queried the URLs of outdoor photos by calling the APIs of Flickr and Panoramio within the geographical boundary (the minimum bounding box) of each city. Photos within each city were downloaded after their coordinates were verified by GIS. Table 1 summarizes the number of street-view photos used for each metropolis. Figure 6 shows the distribution of street-view photos in New York.

Table 2. Proportion of zones in ground-truth zoning maps.

Urban Zoning Maps. We collected urban zoning maps from the websites of local governments, which were stored in the standard GIS format, SHAPEFILE [47]. The zone types include: Residential, Commercial, Industrial and Others. The mentioned maps served as the ground-truth for evaluations. Table 2 shows the percentage of each type of zones in the ground truth zoning maps.

Fig. 6.
figure 6

The plot map of street-view photos in New York city. Each photo is marked with a color corresponding to its zone type.

4.2 Classifying Street-View Photos

We first evaluated street-view photo classification. In this experiment, we investigated different feature types and classifiers that are well-known for scene classification. Specifically, for features, we evaluated the GIST [39], HOG [40], and Places-CNN [43].

For classifiers, we compared RBF-kernel SVM, k-nearest neighbors, and RF. We found that RBF-kernel SVM significantly outperformed linear SVM. For the k-nearest neighbors, we set k to 10 which gave the best performance on this technique. To measure the similarity between samples, we used the Euclidean distance. For the RF, 15 trees were used. All the classifiers were evaluated using 3-fold cross validation.

Table 3 summarizes the results of this experiment. As shown, in most cases, the combination of Places-CNN with RF achieves the best performance. It is also worthwhile to notice that compared with handcrafted features, the features learned by the Places-CNN often gain higher classification accuracy and achieve the best overall performance irrespective of the classifiers.

We also evaluated the use of neural network for street-view photo classification. Specifically, we directly connected the outputs of Places-CNN to a fully connected neural network (with one hidden layer and 4 outputs corresponding to 4 zone types). The results of this experiment are reported in Table 4. As shown in both Tables 3 and 4, applying RF on-top of Places-CNN achieves the best overall performance.

Table 3. Accuracy of street-view photo classification.
Table 4. Accuracy of street-view photo classification using Places-CNN and fully connected neural network.
Fig. 7.
figure 7

Zoning results. For each city, the ground-truth is shown on the left and the zoning map result is shown on the right.

4.3 Zoning

Since zoning is formulated as semantic segmentation of satellite images, its performance can be evaluated via the segmentation accuracy, i.e., the accuracy of classifying of satellite image pixels using the proposed MS-CNN. In our implementation, the MS-CNN was trained/tested in 3-fold cross fashion (i.e. two cities were used for training while the other was used for testing). To increase the data for training/testing the MS-CNN, for each city, 6 satellite images were created, 5 of which were obtained by rotating the original one by a multiple of \(60^{\circ }\). Each created satellite image was also scaled (with 3 different scales) and densely scanned by a \(46\,\times \,46\)-pixel window. The process resulted in 1,265,572 satellite image windows at each scale. The performance of zoning across cities is reported in Table 5 (the last column). Zoning results are presented in Fig. 7.

Table 5 also compares the use of single data source, e.g. satellite data only or street-view photos only, with multi-source data (i.e., our proposed method). Note that the solely use of satellite data is equivalent to the 1-st term in our energy formulation in (1). When only street-view photos are used, the zone type of each pixel on the satellite image is decided by the zone type of its closest street-view photo (i.e., only the 3-rd term in (1) is used).

Table 5. Zoning performance.
Table 6. Comparison of different networks in classifying satellite image pixels.

Recall that we propose the HO-MRF to perform zone segmentation in which a deep neural network is used to obtain low-level MRF segmentation and street-view photos are augmented via higher order potentials (see (1)). Therefore, we compare the HO-MRF with the conventional MRF (i.e., using only the first two terms in (1)). As shown in Table 5, the use of satellite data outperformed that of street-view photos and our proposed HO-MRF with the combination of both the data sources achieved the best performance on all the cities. Specifically, the combination of both sources improved up to 10.17% and 7.12% compared with the single use of top-view data or street-view data respectively. The HO-MRF boosted up 8.76% compared with conventional MRF.

We also compared our deep network (i.e., MS-CNN) used to compute the unary terms in (1) with the network recently proposed by Volpi and Tuia [10]. Since the method in [10] requires the digital surface model for each satellite image, that is unavailable in our data, we adapted the network in [10] by changing its input and output layers to meet with our data while maintaining other layers and settings (e.g. activation functions) same. The adapted network was then re-trained and tested on the same data with our network. As shown in Table 6, our network significantly outperformed the one proposed in [10].

4.4 Discussion

We have found that larger improvement with the HO-MRF was obtained on Boston and San Francisco compared to only using satellite imagery as opposed to New York probably because New York contains more industrial regions (e.g., as shown in Table 2 industrial takes 14.15% in New York, compared with 6.49% and 3.42% in San Francisco and Boston). However, as shown in Table 3, industrial street-view photos are recognized with lower accuracy compared with other zone types.

Experimental results (e.g., Fig. 7) also show our method fails to segment tiny/thin regions. This is because those regions occupy small portions in local image windows and thus are biased by nearby larger regions. We have also found that the method may fail at regions with less street-view images captured. Note that street-view photos are captured sparsely and non-uniformly (see Fig. 6). We believe that, 3D data such as digital surface models, digital elevation models (if available or achievable approximately) would be useful to resolve these issues.

5 Conclusion

We proposed a higher-order Markov random fields model for urban zoning based on multi-view imagery data including top-view satellite and street-view images. We also developed a multi-scale deep convolutional neural network used for classifying satellite image pixels. By integrating different sources of imagery data, our approach can achieve urban zoning automatically, and hence overcome the scalability bottleneck faced by the conventional practice of creating zoning maps manually. We investigated various implementation strategies including feature types, classification models, deep architectures, and conducted extensive experiments and comparisons to verify our approach.