Extending CLIP’s Image-Text Alignment to Referring Image Segmentation
- Title
- Extending CLIP’s Image-Text Alignment to Referring Image Segmentation
- Authors
- 김서연
- Date Issued
- 2024
- Publisher
- 포항공과대학교
- Abstract
- Referring Image Segmentation (RIS) is a cross-modal task that aims to segment an instance described by a natural language expression. Recent methods leverage large-scale pretrained unimodal models as backbones along with fusion techniques for joint reasoning across modalities. However, the inherent cross-modal nature of RIS raises questions about the effectiveness of unimodal backbones. We propose RISCLIP, a novel framework that effectively leverages the cross-modal nature of CLIP for RIS. Observing CLIP's inherent alignment between image and text features, we capitalize on this starting point and introduce simple but strong cross-modal alignment modules. RISCLIP exhibits outstanding results on all three major RIS benchmarks and also outperforms previous CLIP-based methods, demonstrating the efficacy of our strategy in extending CLIP's image-text alignment to RIS.
- URI
- http://postech.dcollection.net/common/orgView/200000805555
https://oasis.postech.ac.kr/handle/2014.oak/123981
- Article Type
- Thesis
- Files in This Item:
- There are no files associated with this item.
Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.