Weakly Supervised Object Localization with Non-linear Modulation and Foreground Control
Principal Investigator:
Kuan Huang
Co-PIs:
Meng Xu
Abstract:
Many existing weakly supervised object localization methods are built on Class Activation Maps (CAMs), which focus on the most discriminated part of an object and often mistakenly identify the background as the foreground. In addition, their localization accuracy relies highly on classification accuracy, and the detected activation regions are often fragmented. In this study, we propose a novel weakly supervised object localization method to solve these problems. The proposed method is compared with eight state-of-the-art methods on two commonly used datasets (CUB-200-2011 and ILSVRC) and achieves the best overall localization result.
Description of Research:
Training a fully supervised object localization method on a large dataset requires comprehensive annotations that are costly to obtain. To address this issue, Weakly Supervised Object Localization (WSOL) using image-level labels has become a popular alternative. Recent WSOL methods can be divided into two major categories: multi-instance learning and Class Activation Maps (CAMs). Multi-instance learning-based methods require generating numerous candidate bounding boxes, resulting in considerable consumption of time and memory resources. Consequently, object localization methods based on CAMs are gaining increasing interest for notable cost reductions. However, CAMs tend to focus on the most distinctive parts of an image for classification, potentially leading to incomplete coverage of object regions.
While CAM-based object localization methods have achieved considerable success, there are some limitations. First, the precision of generated CAMs highly relies on classification accuracy. Second, some tend to disrupt the structure of the object, resulting in the detected foreground regions being fragmented. Third, CAMs typically focus on the most discriminative regions in an image and may mistakenly identify background as foreground due to the limitations of global average pooling in distinguishing between foreground and background pixels.
To solve the aforementioned problems, we propose a novel approach for weakly supervised object localization that uses FPMs to identify the foreground, which is less dependent on classification accuracy than CAMs. First, we develop a Non-Linear Modulation (NLM) module that refines FPMs while ensuring the continuity of object localization results. Second, we construct an FPM Fusion module to re-combine FPMs of foreground and background to strengthen the foreground and suppress the background. Last, we designed a novel foreground control loss function that prevents the excessive enlargement of the foreground.
The proposed method is evaluated on two datasets: CUB-200-2011 and ILSVRC 2016. CUB-200-2011 is a specialized collection of bird images with 200 different bird species. It contains 11,788 images, including a training set of 5,994 images and a testing set of 5,794 images. ILSVRC 2016 is a comprehensive dataset with 1,000 categories, including over 1,200,000 images in the training set and 50,000 images in the testing set. Our proposed method outperforms eight leading state-of-the-art techniques in terms of performance.