Street view imagery (SVI) is one of the largest (growing) resources in urban analytics. A global close-up of the urban environment, if you will, which is rich in (untapped) information such as floor count. Floor count is useful in many applications, from improving energy consumpt
...
Street view imagery (SVI) is one of the largest (growing) resources in urban analytics. A global close-up of the urban environment, if you will, which is rich in (untapped) information such as floor count. Floor count is useful in many applications, from improving energy consumption calculations to creation of 3D city models without elevation data. So far, efforts to extract floor count from SVI are mainly approached as a classification problem with the use of convolutional neural networks (CNNs). Limitations of this approach include the need of large (manually annotated) datasets, and uncertainty how these models learn to count storeys. Therefore, we aim to develop a method that can be trained on available datasets and determine floor count in a more explainable manner.
In order to make the floor count determination method more transparent, we mimic the row-wise counting of storeys as humans do: by vertically parsing a column of windows (and occasional door). Façade parsing is a common computer vision task that we can solve with deep learning. In this work, we employ the Mask R-CNN framework, that is trained on publicly available datasets, for the detection and segmentation of windows and doors. Then, the vertical distribution of detected / segmented windows and doors is estimated by computing the kernel density estimation function. The floor count is extracted by finding the number of maxima in the function, as the maxima represent the dense areas of windows and doors on a horizontal axis (i.e. storeys). To improve the results, an automatic image rectification is added as pre-processing step that enforces the regularity and repetitive occurrence of windows and doors. The full pipeline thus consists of three stages: 1) automatic image rectification, 2) window and door detection/ segmentation with Mask RCNN, 3) floor count estimation via maxima finding on the kernel density estimation (KDE) function. In addition, a small "wild" dataset was created that contains a higher variability in floor count, image quality and architectural styles, which better reflect real world SVI than existing façade datasets.
The floor count performance of the full pipeline was evaluated on the Amsterdam Facade (subset), ECP, TRIMS and "wild SVI" datasets. Since floor count annotations were missing, these are manually added. For detection-based data, the best results are an accuracy of 83% and a mean absolute error (MAE) of 0.17. For normalised segmentation-based data, the best results are an accuracy of 80% and a MAE of 0.20. Considering the method is still at its infancy, the results are promising. With further improvements in the pipeline and addition of automatic façade acquisition, the approach can contribute in large scale extraction of floor count information from SVI. To encourage further development, the pipeline prototype, dataset and floor count annotations are open source and will be released on https://github.com/Dobberzoon/Facade2Floorcount.