Heng Tao Shen | TU Delft Repository

Joint Feature Synthesis and Embedding

Adversarial Cross-Modal Retrieval Revisited

Journal article (2022) - Xu Xu (author) , Kaiyi Lin (author) , Yang Yang (author) , Alan Hanjalic (author) , Heng Tao Shen (author)

Recently, generative adversarial network (GAN) has shown its strong ability on modeling data distribution via adversarial learning. Cross-modal GAN, which attempts to utilize the power of GAN to model the cross-modal joint distribution and to learn compatible cross-modal features ...

Recently, generative adversarial network (GAN) has shown its strong ability on modeling data distribution via adversarial learning. Cross-modal GAN, which attempts to utilize the power of GAN to model the cross-modal joint distribution and to learn compatible cross-modal features, is becoming the research hotspot. However, the existing cross-modal GAN approaches typically 1) require labeled multimodal data of massive labor cost to establish cross-modal correlation; 2) utilize the vanilla GAN model that results in unstable training procedure and meaningless synthetic features; and 3) lack of extensibility for retrieving cross-modal data of new classes. In this article, we revisit the adversarial learning in existing cross-modal GAN methods and propose Joint Feature Synthesis and Embedding (JFSE), a novel method that jointly performs multimodal feature synthesis and common embedding space learning to overcome the above three shortcomings. Specifically, JFSE deploys two coupled conditional Wassertein GAN modules for the input data of two modalities, to synthesize meaningful and correlated multimodal features under the guidance of the word embeddings of class labels. Moreover, three advanced distribution alignment schemes with advanced cycle-consistency constraints are proposed to preserve the semantic compatibility and enable the knowledge transfer in the common embedding space for both the true and synthetic cross-modal features. All these add-ons in JFSE not only help to learn more effective common embedding space that effectively captures the cross-modal correlation but also facilitate to transfer knowledge to multimodal data of new classes. Extensive experiments are conducted on four widely used cross-modal datasets, and the comparisons with more than ten state-of-the-art approaches show that our JFSE method achieves remarkably accuracy improvement on both standard retrieval and the newly explored zero-shot and generalized zero-shot retrieval tasks.

@en

Cross-modal hybrid feature fusion for image-sentence matching

Journal article (2021) - Xing Xu (author) , Yifan Wang (author) , Yixuan He (author) , Yang Yang (author) , A. Hanjalic (author) , Heng Tao Shen (author)

Image-sentence matching is a challenging task in the field of language and vision, which aims at measuring the similarities between images and sentence descriptions. Most existing methods independently map the global features of images and sentences into a common space to calcula ...

Unified Binary Generative Adversarial Network for Image Retrieval and Compression

Journal article (2020) - Jingkuan Song (author) , Tao He (author) , Lianli Gao (author) , Xu Xu (author) , A. Hanjalic (author) , Heng Tao Shen (author)

Binary codes have often been deployed to facilitate large-scale retrieval tasks, but not that often for image compression. In this paper, we propose a unified framework, BGAN+, that restricts the input noise variable of generative adversarial networks to be binary and conditioned ...

Radial Graph Convolutional Network for Visual Question Generation

Journal article (2020) - Xu Xu (author) , Tan Wang (author) , Yang Yang (author) , Alan Hanjalic (author) , Heng Tao Shen (author)

In this article, we address the problem of visual question generation (VQG), a challenge in which a computer is required to generate meaningful questions about an image targeting a given answer. The existing approaches typically treat the VQG task as a reversed visual question an ...

Magnetostrictively Induced Stationary Entanglement between Two Microwave Fields

Journal article (2020) - Mei Yu (author) , Heng Tao Shen (author) , J. Li (author)

We present a scheme to entangle two microwave fields by using the nonlinear magnetostrictive interaction in a ferrimagnet. The magnetostrictive interaction enables the coupling between a magnon mode (spin wave) and a mechanical mode in the ferrimagnet, and the magnon mode simulta ...

Matching images and text with multi-modal tensor fusion and re-ranking

Conference paper (2019) - Tan Wang (author) , A. Hanjalic (author) , Xu Xu (author) , Heng Tao Shen (author) , Yang Yang (author) , Jingkuan Song (author)

A major challenge in matching images and text is that they have intrinsically different data distributions and feature representations. Most existing approaches are based either on embedding or classification, the first one mapping image and text instances into a common embedding ...

From Deterministic to Generative

Multimodal Stochastic RNNs for Video Captioning

Journal article (2018) - Jingkuan Song (author) , Yuyu Guo (author) , Lianli Gao (author) , Xuelong Li (author) , A Hanjalic (author) , Heng Tao Shen (author)

Video captioning, in essential, is a complex natural process, which is affected by various uncertainties stemming from video content, subjective judgment, and so on. In this paper, we build on the recent progress in using encoder-decoder framework for video captioning and address ...

Binary Generative Adversarial Networks for Image Retrieval

Conference paper (2018) - Jingkuan Song (author) , Tao He (author) , Lianli Gao (author) , Xu Xu (author) , A. Hanjalic (author) , Heng Tao Shen (author)

The most striking successes in image retrieval using deep hashing have mostly involved discriminative models, which require labels. In this paper, we use binary generative adversarial networks (BGAN) to embed images to binary codes in an unsupervised way. By restricting the input ...

Video Captioning by Adversarial LSTM

Journal article (2018) - Yang Yang (author) , Jie Zhou (author) , Jiangbo Ai (author) , Yi Bin (author) , A Hanjalic (author) , Heng Tao Shen (author)

In this paper, we propose a novel approach to video captioning based on adversarial learning and long short-term memory (LSTM). With this solution concept, we aim at compensating for the deficiencies of LSTM-based video captioning methods that generally show potential to effectiv ...

Adversarial Cross-Modal Retrieval

Conference paper (2017) - Bokun Wang (author) , Yang Yang (author) , Xu Xu (author) , Alan Hanjalic (author) , Heng Tao Shen (author)

Cross-modal retrieval aims to enable flexible retrieval experience across different modalities (e.g., texts vs. images). The core of crossmodal retrieval research is to learn a common subspace where the items of different modalities can be directly compared to each other. In this ...