Introduction
In this work, we propose a hierarchical semantic embedding (HSE) framework that incorporates category hierarchy to aid fine-grained image recogntion. To evaluate the proposed framework, we organize the 200 bird species of the Caltech-UCSD birds dataset with the four-level category hierarchy and construct a large-scale butterfly dataset (butterfly-200) that also covers four level categories. Extensive experiments on these two and the newly-released VegFru datasets demonstrate the superiority of our HSE framework over the baseline methods and existing competitors.
HSE Framework
Figure 1. An overall pipeline of our proposed hierarchical semantic embedding framework. It employs a trunk network to extract image features and subsequently utilizes a branch network to predict the categories of each level. At each level, it incorporates the predicted score vector to guide learning finer-grained feature and simultaneously regularizes label prediction during training.
Butterfly-200 dataset
Details
Butterfly-200 is a dataset with images from 200 common species of butterflies. The detailed information is presented as follows.
Image number: 25,279 images.
Category number: 200 species, 116 genera, 23 subfamilies, and 5 families.
Annotations: four level categories.
Download
The images and corresponding annoations can be downloaded from Dropbox.
Sample images and corresponding annotations
Extended Caltech-UCSD Birds dataset
Details
Extended Caltech-UCSD Birds dataset is an extention of the original page, with annotating each image with four-level categories. The detailed information is presented as follows.
Image number: 11,788 images.
Category number: 200 species, 122 genera, 37 families, and 13 orders.
Annotations: four level categories.
Download
The images can be downloaded from the original page, and the corresponding hierarchical annoations can be downloaded from Dropbox.
Sample images and corresponding annotations
Experiment results
Table 1. Comparison of the accuracy (in %) of all levels of our HSE framework, two baseline methods, and two variants of our framework that removes semantic embedding representation learning (Ours w/o SERL) and that removes semantic guided label regularization (Ours w/o SGLR) on the CUB and Butterfly-200 test sets, respectively.
Table 2. Comparisons of our HSE framework with existing state of the arts on recognizing categories of finest level on the CUB dataset. BA and PA denote bounding box annotations and part annotations, respectively. √ indicates corresponding annotations are used during training or test.
Table 3. Comparison of accuracy of our HSE framework, existing state-of-the-art methods, and the baseline methods on the VegFru dataset.
References
Catherine Wah, Steve Branson, Peter Welinder, Pietro Perona, and Serge Belongie. The caltech-ucsd birds-200-2011 dataset. California Institute of Technology, 2011.
Saihui Hou, Yushan Feng, and Zilei Wang. VegFru: A Domain-Specific Dataset for Fine-grained Visual Categorization. In ICCV, 2017.