FLOAT: Factorized Learning of Object Attributes for Improved
Multi-object Multi-part Scene Parsing

Rishubh Singh1
Pranav Gupta2
Pradeep Shenoy1
Ravi Kiran Sarvadevabhatla2
1Google Research
2International Institute of Information Technology, Hyderabad

In CVPR 2022

[Paper]
[Code]
[Dataset]

Multi-object multi-part semantic segmentation results for sample images from our expanded label space dataset, Pascal-Part-201. Compared to state of the art BSANet, FLOAT accurately segments tiny parts (e.g. left eyebrow, right eyebrow on faces in upper image) and handles scale variations better - note the size variations of person instances. Also, observe that FLOAT predicts directional attributes of parts (e.g. 'left'/'right') accurately - ['left'/'right']: see eyebrow, eye, arm in upper image and leg in lower image ; ['front'/back']: see wheel parts of the bicycle (lower image).

Video


Abstract

Multi-object multi-part scene parsing is a challenging task which requires detecting multiple object classes in a scene and segmenting the semantic parts within each object. In this paper, we propose FLOAT, a factorized label space framework for scalable multi-object multi-part parsing. Our framework involves independent dense prediction of object category and part attributes which increases scalability and reduces task complexity compared to the monolithic label space counterpart. In addition, we propose an inference-time 'zoom' refinement technique which significantly improves segmentation quality, especially for smaller objects/parts. Compared to state of the art, FLOAT obtains an absolute improvement of 2.0% for mean IOU (mIOU) and 4.8% for segmentation quality IOU (sqIOU) on the Pascal-Part-58 dataset. For the larger Pascal-Part-108 dataset, the improvements are 2.1% for mIOU and 3.9% for sqIOU. We incorporate previously excluded part attributes and other minor parts of the Pascal-Part dataset to create the most comprehensive and challenging version which we dub Pascal-Part-201. FLOAT obtains improvements of 8.6% for mIOU and 7.5% for sqIOU on the new dataset, demonstrating its parsing effectiveness across a challenging diversity of objects and parts.


Method

An overview diagram of our FLOAT framework (Sec. 3). Given an input image I, an object-level semantic segmentation network (Mobj , in blue) generates object prediction map (So). Two decoders (in orange) produce object category grouped part-level prediction maps for 'animate' (Sa) and 'inanimate' objects (Si) in the scene. Another decoder (in red) produces part-attribute grouped prediction maps for 'left-right' (Slr) and 'front-back' (Sfb). At inference time (shown by dotted lines), outputs from the decoders are merged in a top-down manner. The resulting prediction is further refined using the IZR technique (below) to obtain the final segmentation map (Sp).


An overview of Inference-time Zoom Refinement (IZR). During inference, predictions from the object-level network Mobj are used to obtain padded bounding boxes for scene objects (B). The corresponding object crops (C) are processed by the factorized network (F). The resulting label maps (D) are composited to generate Sp, the final refined part segmentation map (E). Notice the improvement in segmentation quality relative to the part label map without IZR (included for comparison).


Paper and Supplementary Material

R. Singh, P. Gupta, P. Shenoy, R. Sarvadevabhatla
FLOAT: Factorized Learning of Object Attributes for Improved Multi-object Multi-part Scene Parsing
In Conference, CVPR 2022.
(Also hosted on ArXiv)




This template was originally made by Phillip Isola and Richard Zhang for a colorful ECCV project; the code can be found here.