graviti
Products
Resources
About us
Open Images
2D Box
Audio
Text
2D Polygon
Common
|...
License: CC BY 4.0

Overview

Open Images is a dataset of ~9M images annotated with image-level labels, object bounding boxes,
object segmentation masks, visual relationships, and localized narratives. It contains a total
of 16M bounding boxes for 600 object classes on 1.9M images, making it the largest existing
dataset with object location annotations. The boxes have been largely manually drawn by professional
annotators to ensure accuracy and consistency. The images are very diverse and often contain
complex scenes with several objects (8.3 per image on average). Open Images also offers visual
relationship annotations, indicating pairs of objects in particular relations (e.g. "woman
playing guitar", "beer on table"), object properties (e.g. "table is wooden"), and human actions
(e.g. "woman is jumping"). In total it has 3.3M annotations from 1,466 distinct relationship
triplets. In V5 we added segmentation masks for 2.8M object instances in 350 classes. Segmentation
masks mark the outline of objects, which characterizes their spatial extent to a much higher
level of detail. In V6 we added 507k localized narratives: multimodal descriptions of images
consisting of synchronized voice, text, and mouse traces over the objects being described.
Finally, the dataset is annotated with 59.9M image-level labels spanning 19,957 classes.
We believe that having a single dataset with unified annotations for image classification,
object detection, visual relationship detection, instance segmentation, and multimodal image
descriptions will enable to study these tasks jointly and stimulate progress towards genuine
scene understanding.

Data Annotation

The dataset is split into a training set (9,011,219 images), a validation set (41,620 images),
and a test set (125,436 images). The images are annotated with image-level labels, object bounding
boxes, object segmentation masks, visual relationships, and localized narratives as described
below.

Image-leve Labels

The following table shows an overview of the image-level labels in all splits of the
dataset. All images have machine generated image-level labels automatically generated by a
computer vision model similar to Google Cloud Vision API.
These automatically generated labels have a substantial false positive rate.

Train Validation Test # Classes # Trainable Classes
Images 9011219 41620 125436 - -
Machine-Generated Labels 164819642 681179 2061177 15387 9034
Human-verified Labels(total) 57524352 595339 1799883 19957 9605
Human-verified Labels(positive) 19856086 367263 1110124 - -
Human-verified Labels(negative) 37668266 228076 689759 - -

Moreover, the
validation and test sets, as well as part of the training set have human-verified image-level
labels. Most verifications were done with in-house annotators at Google. A smaller part was
done by crowd-sourcing from Image Labeler: Crowdsource app, g.co/imagelabeler.
This verification process practically eliminates false positives (but not false negatives:
some labels might be missing from an image). The resulting labels are largely correct and we
recommend to use these for training computer vision models. Multiple computer vision models
were used to generate the samples (not just the one used for the machine-generated labels)
which is why the vocabulary is significantly expanded (#classes column).
As a result of
our annotation process, each image is annotated both with verified positive image-level labels,
indicating some object classes are present, and with verified negative image-level labels,
indicating some classes are absent. All other classes which are not explicitly marked as positive
or negative for an image are not annotated. The verified negative labels are reliable and can
be used during training and evaluation of image classifiers.
Overall, there are 19,958 distinct
classes with image-level labels.
Note that this number is slightly higher than the number of human-verified labels in Table

  1. The reason is that there are a small number of labels in the machine-generated set that
    do not appear in the human-verified set. Trainable classes are those with at least 100 positive
    human-verifications in the V6 training set. Based on this definition, 9,605 classes are considered
    trainable
    and machine-generated
    labels cover 9,034 of these.

Bounding boxes

The following table shows an overview of the bounding box annotations
in all splits of the dataset, which span 600 object classes.
These offer a broader range than those in the ILSVRC and COCO detection challenges, including
new objects such as "fedora" and "snowman".

Train Validation Test # Classes
Images 1743042 41620 125436 -
Boxes 14610229 303980 937327 600

For the training set, we annotated boxes in
1.74M images, for the available positive human-verified image-level labels. We focused on the
most specific labels. For example, if an image has labels {car, limousine, screwdriver}, we
annotated boxes for limousine and screwdriver. For each positive label in an image, we exhaustively
annotated every instance of that object class in the image
(but see below for group cases).
We provide 14.6M bounding boxes. On average there are 8.4 boxed objects per image. 90% of the
boxes were manually drawn by professional annotators at Google using the efficient extreme
clicking interface (new in V6: the actual four extreme points defining each box are released
for train). We produced the remaining 10% semi-automatically using an enhanced version of the
method. These boxes have been human verified to have IoU>0.7 with a perfect box on the object,
and in practice they are accurate (mean IoU ~0.77). We have drawn bounding boxes for human
body parts and the class "Mammal"

only for 95,335 images,
due to the overwhelming number of instances (1,327,596 on the 95,335 images). This list of
images enables using the data correctly during training of object detectors (as there might
be a positive image label for a human body part, and yet no boxes). Finally, we drew a single
box around groups of objects (e.g., a bed of flowers or a crowd of people) if they had more
than 5 instances which were heavily occluding each other and were physically touching (we marked
these boxes with the attribute "group-of").
For the validation and test sets, we provide
exhaustive box annotation for all object instances, for all available positive image-level
labels (again, except for "groups-of"). All boxes were manually drawn. We deliberately tried
to annotate boxes at the most specific level in our semantic hierarchy
as possible. On average, there are 7.4 boxes per image in the validation and test sets. For
Open Images V5, we improved the annotation density, which now comes close to the density in
the training set. This ensures more precise evaluation of object detection models. In contrast
to the training set, on the validation and test sets we annotated human body parts on all images
for which we have a positive label.
We emphasize that the images are annotated both human-verified
positive and negative labels. Importantly, the negative image-level labels can be used during
training of object detectors, e.g. for hard-negative mining. Moreover, they can also be used
during evaluation, as detections of a class annotated as negative (absent) in the ground-truth
can be reliably counted as false-positives. In our Open Images Challenge website
we present an evaluation metric
that fully uses the image-level labels to fairly evaluate detection models.
In all splits
(train, val, test), annotators also marked a set of attributes for each box, e.g. indicating
whether that object is occluded.

Object segmentations

The following table shows an overview of the object segmentation
annotations in all splits of the dataset. These annotations cover a subset of 350 classes from
the 600 boxed classes. These offer a broader range of categories than Cityscapes or COCO, and
cover more images and instances than ADE20k. The segmentations spread over a subset of the
images with bounding boxes.

Train Validation Test # Classes
Images 944037 13524 40386 -
Instance masks 2686666 24730 350

For the training set we annotated 2.7M instance masks, starting
from the available bounding boxes. The masks cover 350 classes and are spread over 944k images.
On average there are 2.8 segmented instances per image. The segmentation masks on the training
set have been produced by a state-of-the-art interactive segmentation process, where professional
human annotators iteratively correct the output of a segmentation neural network. This is more
efficient than manual drawing alone, while at the same time delivering accurate masks (mIoU
84%).
We selected the 350 classes to annotate with segmentation masks based on the following
criteria: (1) whether the class exhibits one coherent appearance over which a policy could
be defined (e.g. "hiking equipment" is rather ill-defined); (2) whether a clear annotation
policy can be defined (e.g. which pixels belong to a nose?); and (3) whether we expect current
segmentation neural networks to be able to capture the shape of the class adequately (e.g.
jellyfish contains thin structures that are hard for state-of-the-art models).
We have put
particular effort into ensuring consistent annotations across different objects (e.g., all
cat masks include their tail; bags carried by camels or persons, are included in their mask).
We annotated all boxed instances of these 350 classes on the training split that fulfill the
following criteria: (1) the object size is larger than 40x80 or 80x40 pixels; (2) the object
boundaries can be confidently determined by the annotator (e.g. blurry or very dark instances
are skipped); (3) the bounding-box contains a single real object (i.e. does not have any of
the IsGroupOf, IsDepiction, IsInside attributes). A few of the 350 classes have a disproportionately
large number of instances. To better spread the annotation effort we capped four categories:
"clothing" to 441k instances, "person" to 149k, "woman" to 117k, "man" to 114k. In total we
annotated segmentation masks for 769k instances of "person"+"man"+"woman"+"boy"+"girl"+"human
body". All other classes are annotated without caps, using only the two criteria above.
For the validation and test splits we created 99k masks spread over 54k images. These have
been annotated with a purely manual free-painting tool and with a strong focus on quality.
They are near-perfect (self-consistency 90% mIoU) and capture even fine details of complex
object boundaries (e.g. spiky flowers and thin structures in man-made objects). For the validation
and test splits we limited these annotation to a maximum of 600 instances per class (per split),
and applied the same instance selection criteria as in the training split (minimal size, unambiguous
boundary, single real object). On average over all instances, both our training and validation+test
annotations offer more accurate object boundaries than the polygon annotations provided by
most existing datasets.
Please note that instances without a mask remain covered by their
corresponding bounding boxes, and thus can be appropriately handled during training and evaluation
of segmentation models.
We emphasize that the images are annotated both human-verified positive
and negative labels. The negative image-level labels can be used during training of segmentation
models, e.g. for hard-negative mining. Moreover, they can also be used during evaluation, as
we do for the Open Images Challenge.

Visual relationships

The following table shows an overview of the visual relationship annotations in the dataset.

Train Validation Test # Distinct relationship triplets # Classes
Relationship triplets 3174293 27243 82746 1466 288
Relationship triplets(non-attr) 348560 4951 14403 1384 -

In our notation, a pair of objects connected by a relationship forms a triplet (e.g. "beer
on table"). Visual attributes are also represented as triplets, where an object in connected
with an attribute using the relationship is. We annotate two types of attributes: physical
object properties (e.g. "table is wooden" or "handbag is made of leather") and human actions
(e.g. "man is jumping" or "woman is standing"). We initially selected 2019 possible triplets
based on existing bounding box annotations. The 1,466 of them that have at least one instance
in the training split form the final set of visual relationships/attributes triplets. In total,
we annotated more than 3.1M instances of these triplets on the training split, involving 288
different object classes and 15 attributes. These include human-object relationships (e.g.
"woman playing guitar", "man holding microphone"), object-object relationships (e.g. "beer
on table", "dog inside car"), human-human relationships (e.g. "two men shake hands"), object
attributes (e.g. "table is wooden"), and human actions (e.g. "man is jumping").
Visual relationship
annotations are exhaustive (except human-human relationships, see next), meaning that for each
image that can potentially contain a relationship triplet (i.e. that contains the objects involved
in that triplet), we provide annotations exhaustively listing all positive triplets instances
in that image. For example, for "woman playing guitar" in an image, we list all pairs of ("woman","guitar")
that are in the relationship "playing" in that image. All other pairs of ("woman","guitar")
in that image are reliable negative examples for the "playing" relationship. Further, human-human
relationships were exhaustively annotated for the images that have the corresponding positively
verified image-level label for the relationship (e.g., if an image has positively verified
image-level label "hug", all pairs of people hugging would be annotated).
Finally, we annotated
some zero-shot triplets: 61 distinct triplets in the validation and test sets do not have samples
in the train set (and some triplets in train set do not have corresponding triplets on validation
and test sets). Examples of these triplets are: "girl holds dumbbell", "pizza on a cutting
board", or "dog on a washing machine".
We emphasize that the images are annotated both with
human-verified positive and negative image-level labels. Importantly, the negative image-level
labels can be used during training of visual relationship detectors: if any of the two object
classes in a relationship triplet is marked as a negative label in our ground-truth, then all
detections of that triplet are false-positives. The same can be done during evaluation, as
we did for our official Open Images Challenge metric.

Localized narratives

Localized narratives are multimodal descriptions of images consisting of synchronized voice,
text, and mouse traces over the objects being described. In Open Images V6 we released 507k
localized narratives annotations on images from Open Images. More information about this type
of annotations, as well as visualizations and annotations for other datasets can be found in
the localized narratives standalone website.

Citation

The following paper describes Open Images V4 in depth: from the data collection and annotation
to detailed statistics about the data and evaluation of models trained on it. If you use the
Open Images dataset in your work (also V5 and V6), please cite this article.

@article{OpenImages,
  author = {Alina
Kuznetsova and Hassan Rom and Neil Alldrin and Jasper Uijlings and Ivan Krasin and Jordi Pont-Tuset
and Shahab Kamali and Stefan Popov and Matteo Malloci and Alexander Kolesnikov and Tom Duerig
and Vittorio Ferrari},
  title = {The Open Images Dataset V4: Unified image classification,
object detection, and visual relationship detection at scale},
  year = {2020},
  journal = {IJCV}
}

@inproceedings{OpenImagesSegmentation,
  author = {Rodrigo Benenson and Stefan Popov and Vittorio Ferrari},
  title = {Large-scale interactive object segmentation with human annotators},
  booktitle = {CVPR},
  year = {2019}
}

@article{OpenImagesLocNarr,
  author  = {Jordi Pont-Tuset
and Jasper Uijlings and Soravit Changpinyo and Radu Soricut and Vittorio Ferrari},
  title   = {Connecting Vision and Language with Localized Narratives},
  journal = {arXiv},
  volume  = {1912.03098},
  year    = {2019}
}

Please also consider citing this general reference to the dataset:

@article{OpenImages2,
  title={OpenImages:
A public dataset for large-scale multi-label and multi-class image classification.},
  author={Krasin,
Ivan and Duerig, Tom and Alldrin, Neil and Ferrari, Vittorio and Abu-El-Haija, Sami and Kuznetsova,
Alina and Rom, Hassan and Uijlings, Jasper and Popov, Stefan and Kamali, Shahab and Malloci,
Matteo and Pont-Tuset, Jordi and Veit, Andreas and Belongie, Serge and Gomes, Victor and Gupta,
Abhinav and Sun, Chen and Chechik, Gal and Cai, David and Feng, Zheyun and Narayanan, Dhyanesh
and Murphy, Kevin},
  journal={Dataset available from https://storage.googleapis.com/openimages/web/index.html},
  year={2017}
}

License

CC BY 4.0

Data Summary
Type
Image,
Amount
--
Size
671.41GB
Provided by
Google
Google LLC is an American multinational technology company that specializes in Internet-related services and products, which include online advertising technologies, a search engine, cloud computing, software, and hardware.
| Amount -- | Size 671.41GB
Open Images
2D Box Audio Text 2D Polygon
Common
License: CC BY 4.0

Overview

Open Images is a dataset of ~9M images annotated with image-level labels, object bounding boxes,
object segmentation masks, visual relationships, and localized narratives. It contains a total
of 16M bounding boxes for 600 object classes on 1.9M images, making it the largest existing
dataset with object location annotations. The boxes have been largely manually drawn by professional
annotators to ensure accuracy and consistency. The images are very diverse and often contain
complex scenes with several objects (8.3 per image on average). Open Images also offers visual
relationship annotations, indicating pairs of objects in particular relations (e.g. "woman
playing guitar", "beer on table"), object properties (e.g. "table is wooden"), and human actions
(e.g. "woman is jumping"). In total it has 3.3M annotations from 1,466 distinct relationship
triplets. In V5 we added segmentation masks for 2.8M object instances in 350 classes. Segmentation
masks mark the outline of objects, which characterizes their spatial extent to a much higher
level of detail. In V6 we added 507k localized narratives: multimodal descriptions of images
consisting of synchronized voice, text, and mouse traces over the objects being described.
Finally, the dataset is annotated with 59.9M image-level labels spanning 19,957 classes.
We believe that having a single dataset with unified annotations for image classification,
object detection, visual relationship detection, instance segmentation, and multimodal image
descriptions will enable to study these tasks jointly and stimulate progress towards genuine
scene understanding.

Data Annotation

The dataset is split into a training set (9,011,219 images), a validation set (41,620 images),
and a test set (125,436 images). The images are annotated with image-level labels, object bounding
boxes, object segmentation masks, visual relationships, and localized narratives as described
below.

Image-leve Labels

The following table shows an overview of the image-level labels in all splits of the
dataset. All images have machine generated image-level labels automatically generated by a
computer vision model similar to Google Cloud Vision API.
These automatically generated labels have a substantial false positive rate.

Train Validation Test # Classes # Trainable Classes
Images 9011219 41620 125436 - -
Machine-Generated Labels 164819642 681179 2061177 15387 9034
Human-verified Labels(total) 57524352 595339 1799883 19957 9605
Human-verified Labels(positive) 19856086 367263 1110124 - -
Human-verified Labels(negative) 37668266 228076 689759 - -

Moreover, the
validation and test sets, as well as part of the training set have human-verified image-level
labels. Most verifications were done with in-house annotators at Google. A smaller part was
done by crowd-sourcing from Image Labeler: Crowdsource app, g.co/imagelabeler.
This verification process practically eliminates false positives (but not false negatives:
some labels might be missing from an image). The resulting labels are largely correct and we
recommend to use these for training computer vision models. Multiple computer vision models
were used to generate the samples (not just the one used for the machine-generated labels)
which is why the vocabulary is significantly expanded (#classes column).
As a result of
our annotation process, each image is annotated both with verified positive image-level labels,
indicating some object classes are present, and with verified negative image-level labels,
indicating some classes are absent. All other classes which are not explicitly marked as positive
or negative for an image are not annotated. The verified negative labels are reliable and can
be used during training and evaluation of image classifiers.
Overall, there are 19,958 distinct
classes with image-level labels.
Note that this number is slightly higher than the number of human-verified labels in Table

  1. The reason is that there are a small number of labels in the machine-generated set that
    do not appear in the human-verified set. Trainable classes are those with at least 100 positive
    human-verifications in the V6 training set. Based on this definition, 9,605 classes are considered
    trainable
    and machine-generated
    labels cover 9,034 of these.

Bounding boxes

The following table shows an overview of the bounding box annotations
in all splits of the dataset, which span 600 object classes.
These offer a broader range than those in the ILSVRC and COCO detection challenges, including
new objects such as "fedora" and "snowman".

Train Validation Test # Classes
Images 1743042 41620 125436 -
Boxes 14610229 303980 937327 600

For the training set, we annotated boxes in
1.74M images, for the available positive human-verified image-level labels. We focused on the
most specific labels. For example, if an image has labels {car, limousine, screwdriver}, we
annotated boxes for limousine and screwdriver. For each positive label in an image, we exhaustively
annotated every instance of that object class in the image
(but see below for group cases).
We provide 14.6M bounding boxes. On average there are 8.4 boxed objects per image. 90% of the
boxes were manually drawn by professional annotators at Google using the efficient extreme
clicking interface (new in V6: the actual four extreme points defining each box are released
for train). We produced the remaining 10% semi-automatically using an enhanced version of the
method. These boxes have been human verified to have IoU>0.7 with a perfect box on the object,
and in practice they are accurate (mean IoU ~0.77). We have drawn bounding boxes for human
body parts and the class "Mammal"

only for 95,335 images,
due to the overwhelming number of instances (1,327,596 on the 95,335 images). This list of
images enables using the data correctly during training of object detectors (as there might
be a positive image label for a human body part, and yet no boxes). Finally, we drew a single
box around groups of objects (e.g., a bed of flowers or a crowd of people) if they had more
than 5 instances which were heavily occluding each other and were physically touching (we marked
these boxes with the attribute "group-of").
For the validation and test sets, we provide
exhaustive box annotation for all object instances, for all available positive image-level
labels (again, except for "groups-of"). All boxes were manually drawn. We deliberately tried
to annotate boxes at the most specific level in our semantic hierarchy
as possible. On average, there are 7.4 boxes per image in the validation and test sets. For
Open Images V5, we improved the annotation density, which now comes close to the density in
the training set. This ensures more precise evaluation of object detection models. In contrast
to the training set, on the validation and test sets we annotated human body parts on all images
for which we have a positive label.
We emphasize that the images are annotated both human-verified
positive and negative labels. Importantly, the negative image-level labels can be used during
training of object detectors, e.g. for hard-negative mining. Moreover, they can also be used
during evaluation, as detections of a class annotated as negative (absent) in the ground-truth
can be reliably counted as false-positives. In our Open Images Challenge website
we present an evaluation metric
that fully uses the image-level labels to fairly evaluate detection models.
In all splits
(train, val, test), annotators also marked a set of attributes for each box, e.g. indicating
whether that object is occluded.

Object segmentations

The following table shows an overview of the object segmentation
annotations in all splits of the dataset. These annotations cover a subset of 350 classes from
the 600 boxed classes. These offer a broader range of categories than Cityscapes or COCO, and
cover more images and instances than ADE20k. The segmentations spread over a subset of the
images with bounding boxes.

Train Validation Test # Classes
Images 944037 13524 40386 -
Instance masks 2686666 24730 350

For the training set we annotated 2.7M instance masks, starting
from the available bounding boxes. The masks cover 350 classes and are spread over 944k images.
On average there are 2.8 segmented instances per image. The segmentation masks on the training
set have been produced by a state-of-the-art interactive segmentation process, where professional
human annotators iteratively correct the output of a segmentation neural network. This is more
efficient than manual drawing alone, while at the same time delivering accurate masks (mIoU
84%).
We selected the 350 classes to annotate with segmentation masks based on the following
criteria: (1) whether the class exhibits one coherent appearance over which a policy could
be defined (e.g. "hiking equipment" is rather ill-defined); (2) whether a clear annotation
policy can be defined (e.g. which pixels belong to a nose?); and (3) whether we expect current
segmentation neural networks to be able to capture the shape of the class adequately (e.g.
jellyfish contains thin structures that are hard for state-of-the-art models).
We have put
particular effort into ensuring consistent annotations across different objects (e.g., all
cat masks include their tail; bags carried by camels or persons, are included in their mask).
We annotated all boxed instances of these 350 classes on the training split that fulfill the
following criteria: (1) the object size is larger than 40x80 or 80x40 pixels; (2) the object
boundaries can be confidently determined by the annotator (e.g. blurry or very dark instances
are skipped); (3) the bounding-box contains a single real object (i.e. does not have any of
the IsGroupOf, IsDepiction, IsInside attributes). A few of the 350 classes have a disproportionately
large number of instances. To better spread the annotation effort we capped four categories:
"clothing" to 441k instances, "person" to 149k, "woman" to 117k, "man" to 114k. In total we
annotated segmentation masks for 769k instances of "person"+"man"+"woman"+"boy"+"girl"+"human
body". All other classes are annotated without caps, using only the two criteria above.
For the validation and test splits we created 99k masks spread over 54k images. These have
been annotated with a purely manual free-painting tool and with a strong focus on quality.
They are near-perfect (self-consistency 90% mIoU) and capture even fine details of complex
object boundaries (e.g. spiky flowers and thin structures in man-made objects). For the validation
and test splits we limited these annotation to a maximum of 600 instances per class (per split),
and applied the same instance selection criteria as in the training split (minimal size, unambiguous
boundary, single real object). On average over all instances, both our training and validation+test
annotations offer more accurate object boundaries than the polygon annotations provided by
most existing datasets.
Please note that instances without a mask remain covered by their
corresponding bounding boxes, and thus can be appropriately handled during training and evaluation
of segmentation models.
We emphasize that the images are annotated both human-verified positive
and negative labels. The negative image-level labels can be used during training of segmentation
models, e.g. for hard-negative mining. Moreover, they can also be used during evaluation, as
we do for the Open Images Challenge.

Visual relationships

The following table shows an overview of the visual relationship annotations in the dataset.

Train Validation Test # Distinct relationship triplets # Classes
Relationship triplets 3174293 27243 82746 1466 288
Relationship triplets(non-attr) 348560 4951 14403 1384 -

In our notation, a pair of objects connected by a relationship forms a triplet (e.g. "beer
on table"). Visual attributes are also represented as triplets, where an object in connected
with an attribute using the relationship is. We annotate two types of attributes: physical
object properties (e.g. "table is wooden" or "handbag is made of leather") and human actions
(e.g. "man is jumping" or "woman is standing"). We initially selected 2019 possible triplets
based on existing bounding box annotations. The 1,466 of them that have at least one instance
in the training split form the final set of visual relationships/attributes triplets. In total,
we annotated more than 3.1M instances of these triplets on the training split, involving 288
different object classes and 15 attributes. These include human-object relationships (e.g.
"woman playing guitar", "man holding microphone"), object-object relationships (e.g. "beer
on table", "dog inside car"), human-human relationships (e.g. "two men shake hands"), object
attributes (e.g. "table is wooden"), and human actions (e.g. "man is jumping").
Visual relationship
annotations are exhaustive (except human-human relationships, see next), meaning that for each
image that can potentially contain a relationship triplet (i.e. that contains the objects involved
in that triplet), we provide annotations exhaustively listing all positive triplets instances
in that image. For example, for "woman playing guitar" in an image, we list all pairs of ("woman","guitar")
that are in the relationship "playing" in that image. All other pairs of ("woman","guitar")
in that image are reliable negative examples for the "playing" relationship. Further, human-human
relationships were exhaustively annotated for the images that have the corresponding positively
verified image-level label for the relationship (e.g., if an image has positively verified
image-level label "hug", all pairs of people hugging would be annotated).
Finally, we annotated
some zero-shot triplets: 61 distinct triplets in the validation and test sets do not have samples
in the train set (and some triplets in train set do not have corresponding triplets on validation
and test sets). Examples of these triplets are: "girl holds dumbbell", "pizza on a cutting
board", or "dog on a washing machine".
We emphasize that the images are annotated both with
human-verified positive and negative image-level labels. Importantly, the negative image-level
labels can be used during training of visual relationship detectors: if any of the two object
classes in a relationship triplet is marked as a negative label in our ground-truth, then all
detections of that triplet are false-positives. The same can be done during evaluation, as
we did for our official Open Images Challenge metric.

Localized narratives

Localized narratives are multimodal descriptions of images consisting of synchronized voice,
text, and mouse traces over the objects being described. In Open Images V6 we released 507k
localized narratives annotations on images from Open Images. More information about this type
of annotations, as well as visualizations and annotations for other datasets can be found in
the localized narratives standalone website.

Citation

The following paper describes Open Images V4 in depth: from the data collection and annotation
to detailed statistics about the data and evaluation of models trained on it. If you use the
Open Images dataset in your work (also V5 and V6), please cite this article.

@article{OpenImages,
  author = {Alina
Kuznetsova and Hassan Rom and Neil Alldrin and Jasper Uijlings and Ivan Krasin and Jordi Pont-Tuset
and Shahab Kamali and Stefan Popov and Matteo Malloci and Alexander Kolesnikov and Tom Duerig
and Vittorio Ferrari},
  title = {The Open Images Dataset V4: Unified image classification,
object detection, and visual relationship detection at scale},
  year = {2020},
  journal = {IJCV}
}

@inproceedings{OpenImagesSegmentation,
  author = {Rodrigo Benenson and Stefan Popov and Vittorio Ferrari},
  title = {Large-scale interactive object segmentation with human annotators},
  booktitle = {CVPR},
  year = {2019}
}

@article{OpenImagesLocNarr,
  author  = {Jordi Pont-Tuset
and Jasper Uijlings and Soravit Changpinyo and Radu Soricut and Vittorio Ferrari},
  title   = {Connecting Vision and Language with Localized Narratives},
  journal = {arXiv},
  volume  = {1912.03098},
  year    = {2019}
}

Please also consider citing this general reference to the dataset:

@article{OpenImages2,
  title={OpenImages:
A public dataset for large-scale multi-label and multi-class image classification.},
  author={Krasin,
Ivan and Duerig, Tom and Alldrin, Neil and Ferrari, Vittorio and Abu-El-Haija, Sami and Kuznetsova,
Alina and Rom, Hassan and Uijlings, Jasper and Popov, Stefan and Kamali, Shahab and Malloci,
Matteo and Pont-Tuset, Jordi and Veit, Andreas and Belongie, Serge and Gomes, Victor and Gupta,
Abhinav and Sun, Chen and Chechik, Gal and Cai, David and Feng, Zheyun and Narayanan, Dhyanesh
and Murphy, Kevin},
  journal={Dataset available from https://storage.googleapis.com/openimages/web/index.html},
  year={2017}
}

License

CC BY 4.0

0
Start building your AI now
graviti
wechat-QR
Long pressing the QR code to follow wechat official account

Copyright@Graviti