RO-ViT: Region-aware pre-training for open-vocabulary object detection with vision transformers

🤖 Intelligence Artificielle

✍️ Auteur(s)

Dahun Kim and Weicheng Kuo

📅 Publication

2023-08-28T09:59:00.001-07:00

📖 Longueur

800 mots

RO-ViT: Region-aware pre-training for open-vocabulary object detection with vision transformers

Source: https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj9wOjJcc9-J...

📋 Extrait de l'article

Posted by Dahun Kim and Weicheng Kuo, Research Scientists, Google The ability to detect objects in the visual world is crucial for computer vision and machine intelligence, enabling applications like adaptive autonomous agents and versatile shopping systems. However, modern object detectors are limited by the manual annotations of their training data, resulting in a vocabulary size significantly smaller than the vast array of objects encountered in reality. To overcome this, the open-vocabulary detection task (OVD) has emerged, utilizing image-text pairs for training and incorporating new category names at test time by associating them with the image content. By treating categories as text embeddings, open-vocabulary detectors can predict a wide range of unseen objects. Various techniques such as image-text pre-training , knowledge distillation , pseudo labeling , and frozen models, often employing convolutional neural network (CNN) backbones, have been proposed. With the growing popularity of vision transformers (ViTs), it is important to explore their potential for building proficient open-vocabulary detectors. The existing approaches assume the availability of pre-trained vision-language models (VLMs) and focus on fine-tuning or distillation from these models to address the disparity between image-level pre-training and object-level fine-tuning. However, as VLMs are primarily designed for image-level tasks like classification and retrieval, they do not fully leverage the concept of objects or regions during the pre-training phase. Thus, it could be beneficial for open-vocabulary detection if we build locality information into the image-text pre-training. In “ RO-ViT: Region-Aware Pretraining for Open-Vocabulary Object Detection with Vision Transformers ”, presented at CVPR 2023 , we introduce a simple method to pre-train vision transformers in a region-aware manner to improve open-vocabulary detection. In vision transformers, positional embeddings are added to image patches to encode information about the spatial position of each patch within the image. Standard pre-training typically uses full-image positional embeddings, which does not generalize well to detection tasks. Thus, we propose a new positional embedding scheme, called “cropped positional embedding”, that better aligns with the use of region crops in detection fine-tuning. In addition, we replace the softmax cross entropy loss with focal loss in contrastive image-text learning, allowing us to learn from more challenging and informative examples. Finally, we leverage recent advances in novel object proposals to enhance open-vocabulary detection fine-tuning, which is motivated by the observation that existing methods often miss novel objects during the proposal stage due to overfitting to foreground categories. We are also releasing the code here . Region-aware image-text pre-training Existing VLMs are trained to match an image as a whole to a text description. However, we observe there is a mismatch between the way the positional embeddings are used in the existing contrastive pre-training approaches and open-vocabulary detection. The positional embeddings are important to transformers as they provide the information of where each element in the set comes from. This information is often useful for downstream recognition and localization tasks. Pre-training approaches typically apply full-image positional embeddings during training, and use the same positional embeddings for downstream tasks, e.g., zero-shot recognition. However, the recognition occurs at region-level for open-vocabulary detection fine-tuning, which requires the full-image positional embeddings to generalize to regions that they never see during the pre-training. To address this, we propose cropped positional embeddings (CPE). With CPE, we upsample positional embeddings from the image size typical for pre-training, e.g., 224x224 pixels, to that typical for detection tasks, e.g., 1024x1024 pixels. Then we randomly crop and resize a region, and use it as the image-level positional embeddings during pre-training. The position, scale, and aspect ratio of the crop is randomly sampled. Intuitively, this causes the model to view an image not as a full image in itself, but as a region crop from some larger unknown image. This better matches the downstream use case of detection where recognition occurs at region- rather than image-level. For the pre-training, we propose cropped positional embedding (CPE) which randomly crops and resizes a region of positional embeddings instead of using the whole-image positional embedding (PE). In addition, we use focal loss instead of the common softmax cross entropy loss for contrastive learning. We also find it beneficial to learn from hard examples with a focal loss. Focal loss enables finer control over how hard examples are weighted than what the softmax cross entropy loss can provide. We adopt the focal loss and replace it with the softmax cross entropy loss in both image-to-text and text-to-image losses. Both CPE and focal loss introduce no extra parameters and minimal computation costs. Open-vocabulary detector fine-tuning An open-vocabulary detector is trained with the detection labels of ‘base’ categories, but needs to detect the union of ‘base’ and ‘novel’ (unlabeled) categories at test time. Despite the backbone features pre-trained from the vast open-vocabulary data, the added detector layers (neck and heads) are newly trained with the downstream detection dataset. Existing approaches often...

📖 LIRE L'ARTICLE COMPLET SUR CE LIEN :

🔗 http://ai.googleblog.com/2023/08/ro-vit-region-aware-pre-training-for.html

Cliquez sur le lien ci-dessus pour consulter l'article dans son intégralité.

🏷️ Mots-clés : 🤖 Intelligence Artificielle

📊 Statistique : Extrait de 800 mots

🤖 Publication automatique par Analyseur Science | Source originale : http://ai.googleblog.com/2023/08/ro-vit-region-aware-pre-training-for....

LE JOURNAL DU SAVOIR

RO-ViT: Region-aware pre-training for open-vocabulary object detection with vision transformers

RO-ViT: Region-aware pre-training for open-vocabulary object detection with vision transformers

Commentaires

Enregistrer un commentaire

Posts les plus consultés de ce blog

Comment mettre un accent à une lettre majuscule À, É, È, Ç, Î, Ô, Û pour Windows

Comment supprimer son historique Canal ?

Quel est le poids du BelugaXL, cet étrange avion-cargo au design surprenant ?

Les 5 cultures les plus gourmandes en eau dans le monde

Bourses d’excellence de la Confédération suisse 2026 – 2027

Monstre d’acier : avec une hauteur de 250 mètres, Big Carl est la plus grande grue du monde

5 mythes sur la Grande Muraille de Chine que beaucoup croient encore vrais

Pourquoi le prix du café monte en flèche ?

Comment le changement climatique perturbe la dynamique des systèmes marins

L'avion du “Jugement dernier”, un appareil unique repéré au-dessus de l’Europe

Archiver

Promo Lenovo Exceptionnelle