Research – ArchiMediaL

_Attention-Aware Age-Agnostic Visual Place Recognition
A cross-domain visual place recognition (VPR) task is proposed in this work, i.e., matching images of the same architectures depicted in different domains. VPR is commonly treated as an image retrieval task, where a query image from an unknown location is matched with relevant instances from geo-tagged gallery database. Different from conventional VPR settings where the query images and gallery images come from the same domain, we propose a more common but challenging setup where the query images are collected under a new unseen condition. The two domains involved in this work are contemporary street view images of Amsterdam from the Mapillary dataset (source domain) and historical images of the same city from Beeldbank dataset (target domain). We tailored an age-invariant feature learning CNN that can focus on domain invariant objects and learn to match images based on a weakly supervised ranking loss. We propose an attention aggregation module that is robust to domain discrepancy between the train and the test data. Further, a multi-kernel maximum mean discrepancy (MK-MMD) domain adaptation loss is adopted to improve the cross-domain ranking performance. Both attention and adaptation modules are unsupervised while the ranking loss uses weak supervision. Visual inspection shows that the attention module focuses on built forms while the dramatically changing environment are less weighed. Our proposed CNN achieves state of the art results (99% accuracy) on the single-domain VPR task and 20% accuracy at its best on the cross-domain VPR task, revealing the difficulty of age-invariant VPR.

_Deep Visual City Recognition Visualization
Understanding how cities visually differ from each oth- ers is interesting for planners, residents, and historians. We investigate the interpretation of deep features learned by convolutional neural networks (CNNs) for city recognition. Given a trained city recognition network, we first generate weighted masks using the known Grad-CAM technique and to select the most discriminate regions in the image. Since the image classification label is the city name, it contains no information of objects that are class-discriminate, we investigate the interpretability of deep representations with two methods. (i) Unsupervised method is used to cluster the objects appearing in the visual explanations. (ii) A pretrained semantic segmentation model is used to label objects in pixel level, and then we introduce statistical measures to quantitatively evaluate the interpretability of discriminate objects. The influence of network architectures and random initializations in training, is studied on the interpretability of CNN features for city recognition. The results suggest that network architectures would affect the interpretability of learned visual representations greater than different initializations.

_Creating a 4D Street View of Amsterdam from Historical Images
The ArchiMediaL project aims to bridge between data science and researches on contemporary and historical built environments by developing state of the art AI algorithms for the automatic linking of available meta-data and image repositories. As a case-study we use the 360,000+ historical images from the Amsterdam Beeldbank database.

_Sight-Seeing in the Eyes of Deep Neural Networks
ArchiMediaL deals with the interpretability of convolutional neural networks (CNNs) for predicting geolocalization from an image. In a pilot experiment we classify images of Pittsburgh vs. Tokyo and visualize the learned CNN filters. We found that the variation of the CNN architecture leads to a variation of the visualized filters. This requires further investigation of the effective parameters for the interpretability of CNNs.

_Unsupervised Cross Domain Image Matching with Outlier Detection
Xin Liu | supervision: Jan van Gemert, Seyran Khademi
This work proposes a method for matching images from different domains in an unsupervised manner, and detecting outlier samples in the target domain at the same time. This matching problem is made difficult by i) the different domain images that are related but under different conditions (e.g. photos of the same location captured in different illuminations), ii) unsupervised settings with paired-image information available only for one of the domains, iii) the existingof outliers that makes the two domains not fully overlap. To this end, we propose an end-to-end architecture that can match cross domain images in an unsupervised manner and handle not fully overlapping domains by outlierdetection. Our architecture is composed of three subnetworks, two of which are fed with pairs of source images to learn the ”match” information. The other subnetwork is fed with target images, and works together with the other two subnetworks to learn domain invariant representations of the source samples and the target inlier samples by applying a weighted multi-kernel Maximum Mean Discrepancy (weighted MK-MMD). We propose the weighted MK-MMD, together with an entropy loss, for outlier detection. The entropy loss iteratively outputs the probability of a target sample to be an inlier during training. And the probabilities are used as weights in our weighted MK-MMD for aligning only the target inlier samples with the source samples. Extensive experimental evidence on Office dataset and our proposed datasets Shape, Pitts-CycleGAN shows that the proposed approach yields state-of-the-art cross domain image matching and outlier detection performance on different benchmarks.

_Interpretable Deep Visual Place Recognition
Xiangwei Shi | supervision: Jan van Gemert, Seyran Khademi
We propose a framework to interpret deep convolutional models for visual place classification. Given a deep place classification model, our proposed method produces visual explanations and saliency maps that reveal the understanding of images by the model. To evaluate the interpretability, t-SNE algorithm is used for mapping and visualization of these latent visual explanations. Moreover, we use pre-trained semantic segmentation networks to label all objects appearing in the visual explanations for our discriminative models. This work has two main goals. The first one is to investigate the consistency of visual explanations by different models. The second goal is to investigate whether visual explanations are meaningful and interpretable or not in an unsupervised manner. We find that varying the CNN architecture leads to variations in the discriminative visual explanations, but these visual explanations are interpretable.

_Crowdsourcing architectural knowledge: Experts versus non-experts
BSc. Thesis “Information Multimedia and Management”, VU Amsterdam 2018
Patrick Brouwer | supervision: Victor de Boer
A growing number of archive and heritage organizations are digitalising their collections, moving their respective knowledge into the public domain. Often only limited metadata about these collections is available. This data, while useful, does not provide a way to search through these vast collections with descriptive keywords, such as house or chimney. The ArchiMediaL project1 aims to solve this problem by using a number of Artificial Intelligence solutions. This paper looks at crowdsourcing as an alternative solution. A group of architecture experts and a group of non-experts were asked to annotate several objects. A team of independent evaluators provided data supporting the fact that crowdsourcing can be seen as a viable option. The data also suggests that the expert annotations were of a higher quality than those of non-experts.

_Creating a Colonial Architecture Pipeline
Mini-master project Artificial Intelligence, VU Amsterdam 2018
Gossa Lo | supervision: Victor de Boer
European colonialism has left its marks in many countries around the world. Traces of this heritage can still be found today in the infrastructure, planning and architecture in former colonies. Documents and images have since been collected and are stored in the online Colonial Architecture repository. This paper investigates a possible contribution of computational linguistic and Linked Data techniques on the annotation and formalization of these documents, by means of a Python pipeline. We finally validate its usefulness by testing the pipeline on a subset of the Colonial Architecture corpus.

_Enriching the metadata of European colonial maps with crowdsourcing
Bsc. Thesis “Information Multimedia and Management”, VU Amsterdam 2018
Rouel de Romas | supervision: Victor de Boer
In this paper the effectiveness of crowdsourcing to enrich metadata about European colonial maps is tested. The repository of these European colonial maps contain small amounts of metadata about its sources. In the first part of this research, requirements for useful metadata about historical maps were identified by conducting an interview with an architectural historian. In the second part of this research includes participants who were asked to generate as many annotations about three European colonial maps, using an annotation tool called Accurator. Based on the requirements that were identified, the annotations of the participants were evaluated. The results indicate that the in most cases the annotations provided by the participants do meet the requirements provided by the architectural historian; thus, crowdsourcing is an effective method to enrich the metadata of European colonial maps.