Table 3
表3:Backbone的初始化方法。
Table 3
表3:Backbone的初始化方法。
Table 2
表2:与开放词典和零样本分割的结果上对比本文的方法。
Figure 5
图5:方法性能对比和分割结果。
Figure 4
图4:具有层级化的分割结果。
Table 1
表1:两个数据集上的分割提案的召回率对比。
Figure 3
图3:对比3种不同的语言分割范式。
Figure 2
图2:相比ALIGN方法,OpenSeg能更好地进行像素分割。
Figure 1
图1:采用任意的开放词汇进行分割训练。
First, it learns to propose segmentation masksfor possible organizations. Then it learns visual-semantic alignments byaligning each word in a caption to one or a few predicted masks.
方法:首先生成可能的掩码,然后进行视觉于一对齐,得到分割图。
We design an open-vocabulary image segmentation model toorganize an image into meaningful regions indicated by arbitrary texts.
目标任务:通过任意文本标签对图像进行全景分割。
Table 1
主表: COCO数据集 对比了三类不同的方法:自监督、字幕文本预训练、伪掩码。 对比了两种范式:包围框弱监督和实例掩码监督。 在基类上的提升不明显,但在目标类上的提升显著。
Figure 6
两个数据集上损失函数提升效果的的消融实验。
Figure 4
分割效果可视化,整体分割出来了。但边沿细节还是分割得不好,能解决遮挡、包含、大小目标等情况。
Figure 5
展示跨模态分割效果。
Visualization of pseudo mask noise levels and their reliabilityscores for the objects mentioned in captions.
可视化伪掩码噪声和目标分值。
Given an image Ic and the set of objects in captions Oc, we first generate region proposals. We then find the regions that maximize thescores of the teacher embedding head (hEmb) for each object in the caption. We further segment objects within these regions into pseudo masks using theteacher’s mask head (hMask). Finally, the student embedding (gEmb) and mask (gMask) heads are trained via cross-modal and mask losses, respectively. Thecross-modal loss is also reweighted based on the pseudo-mask noise levels learned from our pseudo-mask loss.
模型框架:首先,根据图像生成一组区域提案区域提案;然后通过教师嵌入头最大化对应文本目标的分值,并对其进行分割;接着,通过伪掩码损失训练学生分割头,然后利用跨模态损失训练学生嵌入头,以消除教师网络中的噪声。 疑问:那教师网络的参数是怎么训练的?
ur method (bottom) leveragesboth visual and textual modalities by aligning semantics of cap-tion words with visual features of object masks to correctly labelobjects and generalize to novel classes without mask annotations
通过视觉与文本的语义对齐来进行学习新类的掩码。
Conclusions
结论:本文通过标注生成伪掩码来进行学习,并通过学生模型过滤伪掩码的噪声从而开放词典的分割训练,在两个数据集种进行了验证。 局限:适用于具有广泛基类的数据集进行预训练,不适用于仅包含有限基类的预训练。
To show the effectiveness of our method, we conduct ex-tensive experiments on MS-COCO and the large-scale OpenImages & Conceptual Captions datasets.
贡献4:在两个数据集中进行实验验证。
We explicitly capture the reliability of pseudo masks viaour robust student model. For pseudo masks with high masknoises, we downweight the loss to avoid error propagationwhen objects cannot be grounded in images.
贡献3:通过学生模型来避免伪掩码噪声的错误传播。
ur method is designed to work with novel classes byselecting regions whose visual features are most compatiblewith the semantics of novel classes and segmenting theseregions into pseudo masks to self-train a student model.
贡献2:通过找出新类别的语义相似区域作为伪标签,从而训练学生模型。
We propose a novel cross-modal pseudo-labeling frame-work to generate caption-driven pseudo masks and fully uti-lize captioned images for segmentation training without re-quiring instance mask annotations.
贡献1:使用基于字幕生成伪标签来训练图像分割而无需掩码标注。
By extensive experi-ments, we show the effectiveness of our framework, wherewe significantly improve mAP score by 4.5% on MS-COCOand 5.1% on the large-scale Open Images & ConceptualCaptions datasets compared to the state-of-the-art.
指标:mAP 数据集:COCO、OICC
To account for noises in pseudo masks, we design a robuststudent model that selectively distills mask knowledge byestimating the mask noise levels, hence mitigating the ad-verse impact of noisy pseudo masks.
为了避免伪掩码地噪声,通过自监督学习的学生模型来消除噪声影响。
To address this, we propose across-modal pseudo-labeling framework, which generatestraining pseudo masks by aligning word semantics in cap-tions with visual features of object masks in images.
是不是可以通过视觉语言模型生成多个掩码伪标签,再通过这些掩码伪标签对网络进行训练?
However,the high-level textual information learned from caption pre-training alone cannot effectively encode the details requiredfor pixel-wise segmentation
现有基于图像文本对的预训练方法存在无法有效地对像素级分割任务进行编码。
Open-vocabulary instance segmentation aims at seg-menting novel classes without mask annotations.
本文是为了解决无掩码标注的情况下分割新类的问题。