The Segnet core parte is based on one of the simplest deep learning architecture used on images, known as the deep autoencoder. It is composed of an encoding part and a decoding part. The encoding part is made of a few layers of decreasing sizes which are fully connected to each other. The decoding part is its symmetric, composed of layers of increasing sizes. The autoencoder will try to reproduce the input image. As the data flows through the narrowest part of the network, only the main bits of information can be kept. This method is used to reduce the number of dimensions of high dimensional data, or as a

denoising tool for images.

Schema designed by Nghiaho.

When we work with images, it is often impractical to design fully connected architectures, as the number of pixels is usually pretty high. Thus we have to use layers with fewer neurons and convolve them across the image. We then have a model that will treat a little bit of the image, and we apply that model at different locations (over a grid shape) of our input image.

The number of masks (the filter bank) convoluted over the same spot determines the number of feature maps that will be extracted from the current patch.

Before the bottleneck, each step makes the data representation smaller but deeper. We have less pixels in the image, but they are better at describing what they represent.

Each step is composed of three layers:

● the filter bank

● the non linearity layer (which will apply a non linear activation function, most of the time ReLU)

● the pooling layer which will reduce the dimensionality of the input. We often use max pooling, which tile the image and output only the pixel of maximum value for each tile .

You can read Convolutional Networks and Applications in Vision, by Yann LeCun, Koray Kavukcuoglu and Clement Farabet for more details.

This is a typical CNN. You can see that the image is getting “smaller” in the sense that it has less and less pixels, but it is also getting “deeper” in the sense that a single pixel is describing a lot more informations.

Convolutional neural networks are explained in details in the Stanford CS class CS231n: Convolutional Neural Networks for Visual Recognition.

The segnet architecture takes advantage of those two techniques.

Published by V ijay Badrinarayanan, Alex Kendall and Roberto Cipolla in 2015, it uses an Encoder-Decoder architecture composed of convolutional layers. The image is first downsampled by an encoder of type CNN with pooling layers, and then it is upsampled by a decoder acting as a reversed CNN with upsampling layers.

## Application to the spacenet dataset

Since we only have two classes, we change the last layer for a sigmoid one. The role of a softmax layer is to force the model to take a decision in a classification problem. Say you want to classify a pixel in one of three classes. A neural network will typically produce a vector of 3 probabilities, some of which can be close to each other, like [0.45, 0.38, 0.17].

But what you really want is just to know to which class this pixel belongs! Taking the maximum probability will give you a [1, 0, 0] vector which is what you want, but the max function isn’t differentiable, so your model can’t learn if you use it. A soft max is a kind of differentiable max, it won’t exactly give you a [1, 0, 0] vector, but something really close.

The role of a sigmoid function is to output a value between 0 and 1, we use it to obtain the probability that a given pixel is a building pixel, thus obtaining something similar to a heatmap of this probability for each image.

## Implementation

Now that we have presented the segnet architecture, lets see how to implement it using the keras framework paired with tensorflow as its backend. You can find the script in this gist, which is an adaptation from this implementation, using a theano backend. We used the spacenet data, available on aws, and had to use this script to transform the provided labels

from geojson to tif images before running our script. During the challenge, the data was already separated in training and testing sets, but the challenge’s s3 bucket is now closed.

Aside from the direct segnet implementation, we will note the use of image generators to retrieve the image from their respective directories. We zip them together to have a generator of sample/label couples.

You will also note the use of a model checkpoint. It will store the weights of the model at each epoch. Since the learning process is pretty long (around 2 days with a Tesla K40 GPU) it is necessary to be able to perform a recovery if anything wrong was to happen. The 8GB of GPU memory limited the batch size to 4.

We can see here the result when applying a threshold of 0.5. A pixel is classified as building if and only if it has a probability of more than 50% of being a building. Then the method approximate_polygon of the package skimage.measure is used to trace the contours of the buildings.

## Results

The pixel precision is pretty high (more than 96%), but the contours found barely match given contours in dense areas since our model tends to fuse close buildings into a single one.

We see two possible fixes to that problem. The first one is to use the standard segnet to classify each pixel into three categories: inside a building, outside a building, border of a building. The downside to this method is that we would have to find the weight of each class to adjust the cost function. In the segnet paper they state that “[they] use median frequency balancing where the weight assigned to a class in the loss function is the ratio of the median of class frequencies computed on the entire training set divided by the class frequency. This implies that larger classes in the training set have a weight smaller than 1 and the weights of

the smallest classes are the highest”. Otherwise no pixel will be classified as a border pixel because they are under represented in the data. This solution is not very likely to work because of the low quality of the labels. The building borders are often off by a few pixels. Since there is not many pixels belonging to the border class and that those pixels are often mislabelled, it might be too hard for a neural network to learn.

The second solution is to use the preprocessing technique presented by Jiangye Yuan. He builds an image in which each pixel has a value based on its distance to the closest building border. His segmentation lets him separate close buildings and seems to perform very well. Note that his images have a better precision (30cm/pixel versus 50cm/pixel for our images).

Without retraining, this model performs badly on our data.

## Conclusion

In this blog post, we have shown you our implementation of segnet for the detection of buildings in aerial imagery. The model has been easily adapted from a classification model to a continuous prediction one. It is simple enough that it can be implemented with the higher level library Keras (unlike the model proposed by Jiangye Yuan) and perform very well in terms of pixel precision. Depending on the problem, the buildings fusion might not even be a

problem.

Future challenges will be opportunities to improve our model even further, with the help of Yuan preprocessing and the last deep learning breakthroughs.