Data Augmentation for Multi-domain and Multi-modal Generalised Zero-shot Learning

Felix Alves, Raphael

Please use this identifier to cite or link to this item: https://hdl.handle.net/2440/124187

Type:	Thesis
Title:	Data Augmentation for Multi-domain and Multi-modal Generalised Zero-shot Learning
Author:	Felix Alves, Raphael
Issue Date:	2020
School/Discipline:	School of Computer Science
Abstract:	This thesis addresses the problem of combining data augmentation with multidomain and multi-modal training and inference for Generalised Zero-Shot Learning (GZSL). GZSL introduces an experimental setup, where the training set contains images and semantic information for a set of seen classes, and semantic information for a set of unseen classes, where there is no overlap between the seen and unseen classes. The semantic information can be represented by a group of attributes or some textual information that describes a visual class. The main goal of GZSL methods is to build a visual classifier that works for both the seen and unseen classes, even though there are no training images from the unseen classes. The key to solve this challenging problem is to explore the connection between the semantic and visual spaces by learning a model that can translate between these spaces. The solutions proposed in the field have been focused on three directions: conventional Zero-shot Learning (ZSL), data augmentation and domain classification. Conventional ZSL comprises an optimisation procedure that learns a mapping from the visual to the semantic space using the seen classes. The inference maps the images of the unseen classes from the visual to the semantic space, where classification relies on a nearest neighbour classifier. The extension of ZSL to GZSL is not trivial since it biases the classification towards the seen classes given the lack of semantic and visual samples from the unseen classes during training. Such issue has driven GZSL to two alternative approaches: domain classification and data augmentation. Domain classification aims to learn a one-class classifier that estimates the likelihood that visual samples belong to the set of seen classes – this domain classifier is then used to select or modulate the visual classification of test images. More specifically, an input visual sample is first classified as seen or unseen, and then forward to different classifiers (e.g., if it is classified as seen, then it goes to the visual classifier trained with the seen images; and if it is classified as unseen, then it goes to a conventional ZSL classifier). Even though relatively successful, this approach assumes that seen and unseen classes are drawn from different domains, which is unwarranted in GZSL because images from seen and unseen classes most likely come from the similar distribution. The other alternative approach, data augmentation, comprises the training of a generative model that produces visual samples conditioned on a semantic sample. Then, this generative model produces synthetic samples from the unseen classes, which are joined by the real visual samples from the seen classes to train a visual classifier. This approach introduces a multi-modal training, but there is no guarantee that the generated visual samples can represent well the visual samples from the unseen classes, and inference still relies only on the visual modality. In this thesis, we propose several methods to address the issues mentioned above. Firstly, we introduce a novel data augmentation model based on a cycleconsistent multi-modal training to improve the generation of visual samples, particularly from the unseen classes. Secondly, we propose a novel domain classification method that no longer relies on one-class classifiers – instead, we use the visual samples from the generative model to train a binary domain classifier. Thirdly, we extend our proposed GZSL data augmentation framework to a multi-modal inference procedure, where we train a visual and a semantic classifier that are combined to classify a test image. Our final proposed model is based on a multi-modal and multi-domain data augmentation approach composed of multiple classifiers trained in three modalities (visual, semantic and joint latent space). Moreover, we proposed the use of a classification calibration technique to produce an effective multi-modal and multi-domain classification. We report extensive experiments for the proposed models, using several benchmark data sets, such as the Caltech-UCSD Birds 200 (CUB), Animal with Attributes (AWA), Scene Understanding Benchmark Suite (SUN), 102 Category Flower Dataset (FLO) and ImageNet. The experiments show that multi-modal and multi-domain optimisation can be combined with data augmentation to produce state-of-the-art GZSL results.
Advisor:	Carneiro, Gustavo Reid, Ian Sasdelli, Michele
Dissertation Note:	Thesis (Ph.D.) -- University of Adelaide, School of Computer Science, 2020
Keywords:	Generalised zero-shot learning zero-shot learning data augmentation multi-modality multi-domain classfication deep learning neural networks
Provenance:	This electronic version is made publicly available by the University of Adelaide in accordance with its open access policy for student theses. Copyright in this thesis remains with the author. This thesis may incorporate third party material which has been used by the author pursuant to Fair Dealing exceptions. If you are the owner of any included third party copyright material you wish to be removed from this electronic version, please complete the take down form located at: http://www.adelaide.edu.au/legals
Appears in Collections:	Research Theses

Files in This Item:

File	Description	Size	Format
Felix Alves2020_PhD.pdf	Thesis	10.94 MB	Adobe PDF	View/Open

Show full item record

Adelaide Research & Scholarship