Person | TU Delft Repository

Joint Feature Synthesis and Embedding

Adversarial Cross-Modal Retrieval Revisited

Journal article (2022) - Xu Xing, Kaiyi Lin , Yang Yang, A. Hanjalic, Heng Tao Shen

Recently, generative adversarial network (GAN) has shown its strong ability on modeling data distribution via adversarial learning. Cross-modal GAN, which attempts to utilize the power of GAN to model the cross-modal joint distribution and to learn compatible cross-modal featu ...

Recently, generative adversarial network (GAN) has shown its strong ability on modeling data distribution via adversarial learning. Cross-modal GAN, which attempts to utilize the power of GAN to model the cross-modal joint distribution and to learn compatible cross-modal features, is becoming the research hotspot. However, the existing cross-modal GAN approaches typically 1) require labeled multimodal data of massive labor cost to establish cross-modal correlation; 2) utilize the vanilla GAN model that results in unstable training procedure and meaningless synthetic features; and 3) lack of extensibility for retrieving cross-modal data of new classes. In this article, we revisit the adversarial learning in existing cross-modal GAN methods and propose Joint Feature Synthesis and Embedding (JFSE), a novel method that jointly performs multimodal feature synthesis and common embedding space learning to overcome the above three shortcomings. Specifically, JFSE deploys two coupled conditional Wassertein GAN modules for the input data of two modalities, to synthesize meaningful and correlated multimodal features under the guidance of the word embeddings of class labels. Moreover, three advanced distribution alignment schemes with advanced cycle-consistency constraints are proposed to preserve the semantic compatibility and enable the knowledge transfer in the common embedding space for both the true and synthetic cross-modal features. All these add-ons in JFSE not only help to learn more effective common embedding space that effectively captures the cross-modal correlation but also facilitate to transfer knowledge to multimodal data of new classes. Extensive experiments are conducted on four widely used cross-modal datasets, and the comparisons with more than ten state-of-the-art approaches show that our JFSE method achieves remarkably accuracy improvement on both standard retrieval and the newly explored zero-shot and generalized zero-shot retrieval tasks.

@en

Cross-modal hybrid feature fusion for image-sentence matching

Journal article (2021) - Xu Xing, Yifan Wang, Yixuan He, Yang Yang, A. Hanjalic, Heng Tao Shen

Image-sentence matching is a challenging task in the field of language and vision, which aims at measuring the similarities between images and sentence descriptions. Most existing methods independently map the global features of images and sentences into a common space to calc ...

Radial Graph Convolutional Network for Visual Question Generation

Journal article (2020) - Xu Xing, Tan Wang, Yang Yang, A. Hanjalic, Heng Tao Shen

In this article, we address the problem of visual question generation (VQG), a challenge in which a computer is required to generate meaningful questions about an image targeting a given answer. The existing approaches typically treat the VQG task as a reversed visual question ...

Unified Binary Generative Adversarial Network for Image Retrieval and Compression

Journal article (2020) - Jingkuan Song, Tao He, Lianli Gao, Xu Xing, A. Hanjalic, Heng Tao Shen

Binary codes have often been deployed to facilitate large-scale retrieval tasks, but not that often for image compression. In this paper, we propose a unified framework, BGAN+, that restricts the input noise variable of generative adversarial networks to be binary and conditioned ...

Matching images and text with multi-modal tensor fusion and re-ranking

Conference paper (2019) - Tan Wang, A. Hanjalic, Xu Xing, Heng Tao Shen, Yang Yang, Jingkuan Song

A major challenge in matching images and text is that they have intrinsically different data distributions and feature representations. Most existing approaches are based either on embedding or classification, the first one mapping image and text instances into a common embedd ...

Binary Generative Adversarial Networks for Image Retrieval

Conference paper (2018) - Jingkuan Song, Tao He, Lianli Gao, Xu Xing, A. Hanjalic, Heng Tao Shen

The most striking successes in image retrieval using deep hashing have mostly involved discriminative models, which require labels. In this paper, we use binary generative adversarial networks (BGAN) to embed images to binary codes in an unsupervised way. By restricting the input ...

Adversarial Cross-Modal Retrieval

Conference paper (2017) - Bokun Wang, Yang Yang, Xu Xing, A. Hanjalic, Heng Tao Shen

Cross-modal retrieval aims to enable flexible retrieval experience across different modalities (e.g., texts vs. images). The core of crossmodal retrieval research is to learn a common subspace where the items of different modalities can be directly compared to each other. In t ...

Xu Xing

Authored