Researchers at ZJU and University of Sydney Propose New Text-Image Framework that Breaks COCO Record

Editor: Yu Liu     Time: 2019-03-21      Number of visits :0

Researchers at Zhejiang University and University of Sydney proposed MirrorGAN as a global-local attention and semantic-preserving text-image-text framework to solve the semantic consistency between text description and visual content. MirrorGAN set new records on the COCO dataset.

GAN has opened up new frontiers once again.

Last year, NVIDIA's StyleGAN generated high-quality and visually realistic images, and tricked countless pairs of eyes. Since then, a large number of fake faces, fake cats and fake houses rose up, showing the power of GAN.

Photo: StyleGAN generates a fake face

Although GAN has made significant advancements in imagery, it is still very challenging to ensure semantic consistency between textual descriptions and visual content.

Recently, researchers from Zhejiang University and the University of Sydney have proposed a novel global-local attentive and semantic-preserving text-to-image-to-text framework to solve this problem. The framework is called MirrorGAN.

How powerful is MirrorGAN?

MirrorGAN has achieved the best results in the current mainstream COCO dataset and CUB bird dataset.

Currently, the paper has been accepted by CVPR2019.


MirrorGAN: Resolving Semantic Consistency Between Text and Vision

Text to image (T2I) has great potential in many applications and has become an active research area in the field of natural language processing and computer vision.

Contrary to basic image generation problems, T2I generation is conditional on textual descriptions, not just noise. Utilizing the power of GAN, the industry has proposed different T2I methods to generate visually realistic and text-related images. These methods all use a discriminator to distinguish between the generated image and the corresponding text pair as well as the ground-truth image and the corresponding text pair.

However, modeling the underlying semantic consistency within each pair is difficult and inefficient when relying solely on such discriminators due to regional differences between text and images.

In recent years, in response to this problem, people have used attention mechanisms to guide the generator to focus on different words when generating different image regions. However, due to the variety of text and image patterns, the use of only word-level attention does not ensure consistency of global semantics. As shown in Figure 1(b):

Figure 1 (a) A description of the mirror structure, which embodies the idea of redefining the text-to-image generation; (b)-(c) The inconsistent and consistent image/re-description generated by predecessors and MirrorGAN

T2I generation can be thought of as an inverse problem of image recognition (or image to text, I2T), which produces a textual description of a given image. Considering that each task needs to model and align the underlying semantics of the two domains, it is natural and reasonable to model the two tasks in a unified framework to take advantage of the underlying dual rules.


As shown in Figures 1 (a) and (c), if the image generated by T2I is semantically consistent with a given textual description, then the I2T re-description should have exactly the same semantics as the given textual description. In other words, the resulting image should look like a mirror that accurately reflects the underlying text semantics. 

Based on this observation, the paper proposes a new text-image-text framework, MirrorGAN, to improve T2I generation, which takes advantage of the idea of learning T2I generation by re-description.


The 3 Core Modules of MirrorGAN's

For the T2I task, there are two main goals:

-   Visual realism;

-   Semantics

And the two need to be consistent.

MirrorGAN takes advantage of the idea of text-to-image re-description learning generation, which consists of three modules:

·     Semantic text embedding module (STEM)

·     Global-local collaborative attentive module for cascaded image generation(GLAM)

·     Semantic text regeneration and alignment module(STREAM)

STEM generates word-level and sentence-level embedding; GLAM has a cascaded structure for generating target images from coarse scale to fine scale, using local word attention and global sentence attention, and gradually enhancing the diversity and semantic consistency of generated images. STREAM attempts to regenerate a textual description from the generated image that is semantically consistent with the given textual description.


Figure 2 MirrorGAN schematic

As shown in Figure 2, MirrorGAN embodies the mirror structure by integrating T2I and I2T.

It takes advantage of the idea of learning T2I generation by re-description. After the image is generated, MirrorGAN regenerates its description, which aligns its underlying semantics with the given textual description.

The following are three modules of MirrorGAN: STEM, GLAM and STREAM.


STEM: Semantic Text Embedding Module

First, a semantic text embedding module is introduced to embed a given text description into local word level features and global sentence level features.

As shown on the far left of Figure 2 (the image above), a recursive neural network (RNN) is used to extract the semantic embedded T from a given textual description, including a word embedded in w and a sentence embedded in s.

GLAM: Global-local collaborative attentive module for cascaded image generation

Next, a multi-stage cascade generator is constructed by continuously superimposing three image generation networks.

This article uses the basic structure described in Attngan: Fine-grained text to image generation with attentional generative adversarial networks because it has great performance in generating realistic images.

M visual feature transducers are represented using {F0, F1, ..., Fm-1}, and m image generators are represented using {G0, G1, ..., Gm-1}. The visual feature Fi and the generated image Ii in each stage can be expressed as:


STREAM: Semantic text regeneration and alignment module

As noted above, MirrorGAN includes a semantic text regeneration and alignment module (STREAM) to regenerate a textual description from the generated image that is semantically aligned with a given textual description.

Specifically, a widely used encoder decoder-based image header framework is employed as the basic STREAM architecture.

The image encoder is a pre-trained convolutional neural network (CNN) on ImageNet, and the decoder is an RNN. The image Im-1 generated by the final stage generator is input to the CNN encoder and the RNN decoder as follows:


Testing Results: Best Performance on COCO data sets

So, how is the performance of MirrorGAN?

First, let’s look at the comparison of MirrorGAN with other advanced T2I methods, including GAN-INT-CLS, GAWWN, StackGAN, StackGAN ++, PPGN, and AttnGAN.

The data set used is currently the more mainstream data set, which is the COCO data set and the CUB bird data set:


·    The CUB Bird Dataset contains 8,855 training images and 2,933 test images belonging to 200 categories, each with 10 text descriptions;


·    The OCO data set contains 82,783 training images and 40,504 verification images, each with 5 text descriptions.

The results are shown in Table 1:

Table 1: Comparison of results between MirrorGAN and other advanced methods on the CUB and COCO data sets

Table 2 shows the R accuracy scores of AttnGAN and MirrorGAN on the CUB and COCO data sets.

Table 2: R accuracy scores for MirrorGAN and AttnGAN on the CUB and COCO data sets

MirrorGAN displayed a greater advantage in all experimental comparisons, indicating the superiority of the text-to-image-to-text framework and global-to-local collaborative attention module proposed in this paper, because MirrorGAN generates high-quality images consistent semantic with the entered text.


About the Authors

Finally, let us introduce the four authors of the paper.

QIAO Tingting, Ph.D. student from the ZJU College of Computer Science and Technology, currently works in the research group of Professor Tao Dacheng from the University of Sydney.

Photo: QIAO Tingting (picture from LinkedIn)

ZHANG Jing, Ph.D., lecturer at Hangzhou Dianzi University and visiting scholar at the University of Sydney.



XU Duanqing, professor and doctoral supervisor at the ZJU College of of Computer and Technology.

XU Duanqing

TAO Dacheng, professor at the School of Engineering and Information Technology, University of Sydney, and director of the AI Center at the University of Sydney.


Photo: Tao Dacheng

At present, both QIAO Tingting and ZHANG Jing are participating in the research work of Professor TAO Dacheng.

It is worth noting that Professor XU Duanqing once undertook the National Social Science Fund Major Project (sub-project), “Key Technology Research and Software System Development of Dunhuang Testament Database” and established the Dunhuang Tibetan Basic Information Database System. QIAO Tingting was also one of the participants at the time.

Two years later, in 2017, the Key Scientific Research Base of the National Cultural Relics Bureau of Digital Protection of Caves and Cultural Relics was launched in Zhejiang University, focusing on Digital Protection of Cultural Relics in Cave Temples. This MirrorGAN paper added semantics to the transformation of text and images, increasing the accuracy even further.

In the work related to the digitization of cultural relics, AI technology adds vitality to ancient texts, bringing us closer to history and closer to culture.


Link to paper:

Back to Top

Copyright © 2018 College of Computer Science and Technology, Zhejiang University