graviti
Products
Resources
About us
COCO
2D Box
Classification
2D Panoptic Segmentation
2D Polygon
Action/Event Detection
|Common
|...
License: CC BY 4.0

Overview

COCO is a large-scale object detection, segmentation, and captioning dataset. COCO has several
features:

  • Object segmentation
  • Recognition in context
  • Superpixel stuff segmentation
  • 330K images (>200K labeled)
  • 1.5 million object instances
  • 80 object categories
  • 91 stuff categories
  • 5 captions per image
  • 250,000 people with keypoints

Data Annotation

COCO has several annotation types: for object detection,
keypoint detection, stuff segmentation,
panoptic segmentation, densepose,
and image captioning. The annotations are stored
using JSON. Please note that the COCO API
described on the download page can be used to access and
manipulate all anotations. All annotations share the same basic data structure below:

{
"info" : info,
"images" : [image],
"annotations" : [annotation],
"licenses" : [license],
}

info{
"year" : int,
"version" : str,
"description" : str,
"contributor" : str,
"url" : str,
"date_created" : datetime,
}

image{
"id" : int,
"width" : int,
"height" : int,
"file_name" : str,
"license" : int,
"flickr_url" : str,
"coco_url" : str,
"date_captured" : datetime,
}

license{
"id" : int,
"name" : str,
"url" : str,
}

The data structures specific to the various annotation types are described below.

Object Detection

Each
object instance annotation contains a series of fields, including the category id and segmentation
mask of the object. The segmentation format depends on whether the instance represents a single
object (iscrowd=0 in which case polygons are used) or a collection of objects (iscrowd=1
in which case RLE is used). Note that a single object (iscrowd=0) may require multiple polygons,
for example if occluded. Crowd annotations (iscrowd=1) are used to label large groups of objects
(e.g. a crowd of people). In addition, an enclosing bounding box is provided for each object
(box coordinates are measured from the top left image corner and are 0-indexed). Finally, the
categories field of the annotation structure stores the mapping of category id to category
and supercategory names. See also the detection
task.

annotation{
  "id"             : int,
  "image_id"       : int,
  "category_id"    : int,
  "segmentation"   : RLE or [polygon],
  "area"           : float,
  "bbox"           : [x,y,width,height],
  "iscrowd"        : 0 or 1,
}

categories[{
  "id"             : int,
  "name"           : str,
  "supercategory"  : str,
}]

Keypoint Detection

A keypoint annotation contains all the data of the object annotation (including id,
bbox, etc.) and two additional fields. First, "keypoints" is a length 3k array where k is the
total number of keypoints defined for the category. Each keypoint has a 0-indexed location
x,y and a visibility flag v defined as v=0: not labeled (in which case x=y=0), v=1: labeled
but not visible, and v=2: labeled and visible. A keypoint is considered visible if it falls
inside the object segment. "num_keypoints" indicates the number of labeled keypoints (v>0)
for a given object (many objects, e.g. crowds and small objects, will have num_keypoints=0).
Finally, for each category, the categories struct has two additional fields: "keypoints,"
which is a length k array of keypoint names, and "skeleton", which defines connectivity via
a list of keypoint edge pairs and is used for visualization. Currently keypoints are only
labeled for the person category (for most medium/large non-crowd person instances). See also
the keypoint task.

annotation{
  "keypoints"        : [x1,y1,v1,...],
  "num_keypoints"    : int,
  "[cloned]"         : ...,
}

categories[{
  "keypoints"        : [str],
  "skeleton"         : [edge],
  "[cloned]"         : ...,
}]

"[cloned]": denotes fields copied from object detection annotations defined above.

Stuff Segmentation

The stuff annotation format
is identical and fully compatible to the object detection format above (except iscrowd is
unnecessary and set to 0 by default). We provide annotations in both JSON and png format for
easier access, as well as conversion scripts between
the two formats. In the JSON format, each category present in an image is encoded with a single
RLE annotation (see the Mask API for more details). The category_id represents the id of the
current stuff category. For more details on stuff categories and supercategories see the
stuff evaluation page. See also the stuff
task.

Panoptic Segmentation

For the panoptic task, each annotation struct
is a per-image annotation rather than a per-object annotation. Each per-image annotation
has two parts: (1) a PNG that stores the class-agnostic image segmentation and (2) a JSON
struct that stores the semantic information for each image segment. In more detail:

  1. To
    match an annotation with an image, use the image_id field (that is annotation.image_id==image.id).
  2. For each annotation, per-pixel segment ids are stored as a single PNG at annotation.file_name.
    The PNGs are in a folder with the same name as the JSON, i.e., annotations/name/ for annotations/name.json.
    Each segment (whether it's a stuff or thing segment) is assigned a unique id. Unlabeled pixels
    (void) are assigned a value of 0. Note that when you load the PNG as an RGB image, you will
    need to compute the ids via ids=R+G256+B256^2.
  3. For each annotation, per-segment info
    is stored in annotation.segments_info. segment_info.id stores the unique id of the segment
    and is used to retrieve the corresponding mask from the PNG (ids==segment_info.id). category_id
    gives the semantic category and iscrowd indicates the segment encompasses a group of objects
    (relevant for thing categories only). The bbox and area fields provide additional info about
    the segment.
  4. The COCO panoptic task has the same thing categories as the detection task,
    whereas the stuff categories differ from those in the stuff task (for details see the panoptic
    evaluation
    page). Finally, each category struct has
    two additional fields: isthing that distinguishes stuff and thing categories and color that
    is useful for consistent visualization.
annotation{
  "image_id"         : int,
  "file_name"        : str,
  "segments_info"    : [segment_info],
}

segment_info{
  "id"               : int,
  "category_id"      : int,
  "area"             : int,
  "bbox"             : [x,y,width,height],
  "iscrowd"          : 0 or 1,
}

categories[{
  "id"               : int,
  "name"             : str,
  "supercategory"    : str,
  "isthing"          : 0 or 1,
  "color"            : [R,G,B],
}]

Image Captioning

These annotations are used to store image captions.
Each caption describes the specified image and each image has at least 5 captions (some images
have more). See also the captioning task.

annotation{
  "id"               : int,
  "image_id"         : int,
  "caption"          : str,
}

DensePose

For the
DensePose task, each annotation contains a series
of fields, including category id, bounding box, body part masks and parametrization data for
selected points, which are detailed below.

Crowd annotations (iscrowd=1) are used to label large groups of objects (e.g. a crowd of people).

An enclosing bounding box is provided for each
person (box coordinates are measured from the top left image corner and are 0-indexed).

The categories field of the annotation structure stores the mapping of category id to category
and supercategory names.

DensePose annotations are stored in dp_* fields:

Annotated masks:

  • dp_masks: RLE encoded dense masks. All part masks are of size 256x256.
    They correspond to 14 semantically meaningful parts of the body: Torso, Right Hand, Left Hand,
    Left Foot, Right Foot, Upper Leg Right, Upper Leg Left, Lower Leg Right, Lower Leg Left, Upper
    Arm Left, Upper Arm Right, Lower Arm Left, Lower Arm Right, Head;

Annotated points:

  • dp_x, dp_y: spatial coordinates
    of collected points on the image. The coordinates are scaled such that the bounding box size
    is 256x256;
  • dp_I: The patch index that indicates which of the 24 surface patches the point
    is on. Patches correspond to the body parts described above. Some body parts are split into
    2 patches: 1, 2 = Torso, 3 = Right Hand, 4 = Left Hand, 5 = Left Foot, 6 = Right Foot, 7, 9
    = Upper Leg Right, 8, 10 = Upper Leg Left, 11, 13 = Lower Leg Right, 12, 14 = Lower Leg Left,
    15, 17 = Upper Arm Left, 16, 18 = Upper Arm Right, 19, 21 = Lower Arm Left, 20, 22 = Lower
    Arm Right, 23, 24 = Head;
  • dp_U, dp_V: Coordinates in the UV space. Each surface patch has a separate 2D parameterization.
annotation{
  "id"               : int,
  "image_id"         : int,
  "category_id"      : int,
  "is_crowd"         : 0 or 1,
  "area"             : int,
  "bbox"             : [x,y,width,height],
  "dp_I"             : [float],
  "dp_U"             : [float],
  "dp_V"             : [float],
  "dp_x"             : [float],
  "dp_y"             : [float],
  "dp_masks"         : [RLE],
}

Instruction

COCO API

The COCO API assists in loading, parsing, and visualizing annotations in COCO. The API supports
multiple annotation formats (please see the data format
page). For additional details see: CocoApi.m,
coco.py,
and CocoApi.lua for
Matlab, Python, and Lua code, respectively, and also the Python API demo.

Throughout the API "ann"=annotation, "cat"=category, and "img"=image.
getAnnIds Get ann ids that satisfy given filter conditions.
getCatIds Get cat ids that satisfy given filter conditions.
getImgIds Get img ids that satisfy given filter conditions.
loadAnns Load anns with the specified ids.
loadCats Load cats with the specified ids.
loadImgs Load imgs with the specified ids.
loadRes Load algorithm results and create API for accessing them.
showAnns Display the specified annotations.

MASK API

COCO provides segmentation masks for every object instance. This creates two challenges: storing
masks compactly and performing mask computations efficiently. We solve both challenges using
a custom Run Length Encoding (RLE) scheme. The size of the RLE representation is proportional
to the number of boundaries pixels of a mask and operations such as area, union, or intersection
can be computed efficiently directly on the RLE. Specifically, assuming fairly simple shapes,
the RLE representation is O(√n) where n is number of pixels in the object, and common computations
are likewise O(√n). Naively computing the same operations on the decoded masks (stored as
an array) would be O(n).

The MASK API provides an interface for manipulating masks stored
in RLE format. The API is defined below, for additional details see: MaskApi.m,
mask.py,
or MaskApi.lua. Finally,
we note that a majority of ground truth masks are stored as polygons (which are quite compact),
these polygons are converted to RLE when needed.

encode Encode binary masks using RLE.
decode Decode binary masks encoded via RLE.
merge Compute union or intersection of encoded masks.
iou Compute intersection over union between masks.
area Compute area of encoded masks.
toBbox Get bounding boxes surrounding encoded masks.
frBbox Convert bounding boxes to encoded masks.
frPoly Convert polygon to encoded mask.

License

CC BY 4.0

Data Summary
Type
Image,
Amount
1009.571K
Size
83.39GB
Provided by
Microsoft
Microsoft Corporation (/ˈmaɪkroʊsɒft/) is an American multinational technology company with headquarters in Redmond, Washington. It develops, manufactures, licenses, supports, and sells computer software, consumer electronics, personal computers, and related services. Its best known software products are the Microsoft Windows line of operating systems, the Microsoft Office suite, and the Internet Explorer and Edge web browsers. Its flagship hardware products are the Xbox video game consoles and the Microsoft Surface lineup of touchscreen personal computers.
| Amount 1009.571K | Size 83.39GB
COCO
2D Box Classification 2D Panoptic Segmentation 2D Polygon
Action/Event Detection | Common
License: CC BY 4.0

Overview

COCO is a large-scale object detection, segmentation, and captioning dataset. COCO has several
features:

  • Object segmentation
  • Recognition in context
  • Superpixel stuff segmentation
  • 330K images (>200K labeled)
  • 1.5 million object instances
  • 80 object categories
  • 91 stuff categories
  • 5 captions per image
  • 250,000 people with keypoints

Data Annotation

COCO has several annotation types: for object detection,
keypoint detection, stuff segmentation,
panoptic segmentation, densepose,
and image captioning. The annotations are stored
using JSON. Please note that the COCO API
described on the download page can be used to access and
manipulate all anotations. All annotations share the same basic data structure below:

{
"info" : info,
"images" : [image],
"annotations" : [annotation],
"licenses" : [license],
}

info{
"year" : int,
"version" : str,
"description" : str,
"contributor" : str,
"url" : str,
"date_created" : datetime,
}

image{
"id" : int,
"width" : int,
"height" : int,
"file_name" : str,
"license" : int,
"flickr_url" : str,
"coco_url" : str,
"date_captured" : datetime,
}

license{
"id" : int,
"name" : str,
"url" : str,
}

The data structures specific to the various annotation types are described below.

Object Detection

Each
object instance annotation contains a series of fields, including the category id and segmentation
mask of the object. The segmentation format depends on whether the instance represents a single
object (iscrowd=0 in which case polygons are used) or a collection of objects (iscrowd=1
in which case RLE is used). Note that a single object (iscrowd=0) may require multiple polygons,
for example if occluded. Crowd annotations (iscrowd=1) are used to label large groups of objects
(e.g. a crowd of people). In addition, an enclosing bounding box is provided for each object
(box coordinates are measured from the top left image corner and are 0-indexed). Finally, the
categories field of the annotation structure stores the mapping of category id to category
and supercategory names. See also the detection
task.

annotation{
  "id"             : int,
  "image_id"       : int,
  "category_id"    : int,
  "segmentation"   : RLE or [polygon],
  "area"           : float,
  "bbox"           : [x,y,width,height],
  "iscrowd"        : 0 or 1,
}

categories[{
  "id"             : int,
  "name"           : str,
  "supercategory"  : str,
}]

Keypoint Detection

A keypoint annotation contains all the data of the object annotation (including id,
bbox, etc.) and two additional fields. First, "keypoints" is a length 3k array where k is the
total number of keypoints defined for the category. Each keypoint has a 0-indexed location
x,y and a visibility flag v defined as v=0: not labeled (in which case x=y=0), v=1: labeled
but not visible, and v=2: labeled and visible. A keypoint is considered visible if it falls
inside the object segment. "num_keypoints" indicates the number of labeled keypoints (v>0)
for a given object (many objects, e.g. crowds and small objects, will have num_keypoints=0).
Finally, for each category, the categories struct has two additional fields: "keypoints,"
which is a length k array of keypoint names, and "skeleton", which defines connectivity via
a list of keypoint edge pairs and is used for visualization. Currently keypoints are only
labeled for the person category (for most medium/large non-crowd person instances). See also
the keypoint task.

annotation{
  "keypoints"        : [x1,y1,v1,...],
  "num_keypoints"    : int,
  "[cloned]"         : ...,
}

categories[{
  "keypoints"        : [str],
  "skeleton"         : [edge],
  "[cloned]"         : ...,
}]

"[cloned]": denotes fields copied from object detection annotations defined above.

Stuff Segmentation

The stuff annotation format
is identical and fully compatible to the object detection format above (except iscrowd is
unnecessary and set to 0 by default). We provide annotations in both JSON and png format for
easier access, as well as conversion scripts between
the two formats. In the JSON format, each category present in an image is encoded with a single
RLE annotation (see the Mask API for more details). The category_id represents the id of the
current stuff category. For more details on stuff categories and supercategories see the
stuff evaluation page. See also the stuff
task.

Panoptic Segmentation

For the panoptic task, each annotation struct
is a per-image annotation rather than a per-object annotation. Each per-image annotation
has two parts: (1) a PNG that stores the class-agnostic image segmentation and (2) a JSON
struct that stores the semantic information for each image segment. In more detail:

  1. To
    match an annotation with an image, use the image_id field (that is annotation.image_id==image.id).
  2. For each annotation, per-pixel segment ids are stored as a single PNG at annotation.file_name.
    The PNGs are in a folder with the same name as the JSON, i.e., annotations/name/ for annotations/name.json.
    Each segment (whether it's a stuff or thing segment) is assigned a unique id. Unlabeled pixels
    (void) are assigned a value of 0. Note that when you load the PNG as an RGB image, you will
    need to compute the ids via ids=R+G256+B256^2.
  3. For each annotation, per-segment info
    is stored in annotation.segments_info. segment_info.id stores the unique id of the segment
    and is used to retrieve the corresponding mask from the PNG (ids==segment_info.id). category_id
    gives the semantic category and iscrowd indicates the segment encompasses a group of objects
    (relevant for thing categories only). The bbox and area fields provide additional info about
    the segment.
  4. The COCO panoptic task has the same thing categories as the detection task,
    whereas the stuff categories differ from those in the stuff task (for details see the panoptic
    evaluation
    page). Finally, each category struct has
    two additional fields: isthing that distinguishes stuff and thing categories and color that
    is useful for consistent visualization.
annotation{
  "image_id"         : int,
  "file_name"        : str,
  "segments_info"    : [segment_info],
}

segment_info{
  "id"               : int,
  "category_id"      : int,
  "area"             : int,
  "bbox"             : [x,y,width,height],
  "iscrowd"          : 0 or 1,
}

categories[{
  "id"               : int,
  "name"             : str,
  "supercategory"    : str,
  "isthing"          : 0 or 1,
  "color"            : [R,G,B],
}]

Image Captioning

These annotations are used to store image captions.
Each caption describes the specified image and each image has at least 5 captions (some images
have more). See also the captioning task.

annotation{
  "id"               : int,
  "image_id"         : int,
  "caption"          : str,
}

DensePose

For the
DensePose task, each annotation contains a series
of fields, including category id, bounding box, body part masks and parametrization data for
selected points, which are detailed below.

Crowd annotations (iscrowd=1) are used to label large groups of objects (e.g. a crowd of people).

An enclosing bounding box is provided for each
person (box coordinates are measured from the top left image corner and are 0-indexed).

The categories field of the annotation structure stores the mapping of category id to category
and supercategory names.

DensePose annotations are stored in dp_* fields:

Annotated masks:

  • dp_masks: RLE encoded dense masks. All part masks are of size 256x256.
    They correspond to 14 semantically meaningful parts of the body: Torso, Right Hand, Left Hand,
    Left Foot, Right Foot, Upper Leg Right, Upper Leg Left, Lower Leg Right, Lower Leg Left, Upper
    Arm Left, Upper Arm Right, Lower Arm Left, Lower Arm Right, Head;

Annotated points:

  • dp_x, dp_y: spatial coordinates
    of collected points on the image. The coordinates are scaled such that the bounding box size
    is 256x256;
  • dp_I: The patch index that indicates which of the 24 surface patches the point
    is on. Patches correspond to the body parts described above. Some body parts are split into
    2 patches: 1, 2 = Torso, 3 = Right Hand, 4 = Left Hand, 5 = Left Foot, 6 = Right Foot, 7, 9
    = Upper Leg Right, 8, 10 = Upper Leg Left, 11, 13 = Lower Leg Right, 12, 14 = Lower Leg Left,
    15, 17 = Upper Arm Left, 16, 18 = Upper Arm Right, 19, 21 = Lower Arm Left, 20, 22 = Lower
    Arm Right, 23, 24 = Head;
  • dp_U, dp_V: Coordinates in the UV space. Each surface patch has a separate 2D parameterization.
annotation{
  "id"               : int,
  "image_id"         : int,
  "category_id"      : int,
  "is_crowd"         : 0 or 1,
  "area"             : int,
  "bbox"             : [x,y,width,height],
  "dp_I"             : [float],
  "dp_U"             : [float],
  "dp_V"             : [float],
  "dp_x"             : [float],
  "dp_y"             : [float],
  "dp_masks"         : [RLE],
}

Instruction

COCO API

The COCO API assists in loading, parsing, and visualizing annotations in COCO. The API supports
multiple annotation formats (please see the data format
page). For additional details see: CocoApi.m,
coco.py,
and CocoApi.lua for
Matlab, Python, and Lua code, respectively, and also the Python API demo.

Throughout the API "ann"=annotation, "cat"=category, and "img"=image.
getAnnIds Get ann ids that satisfy given filter conditions.
getCatIds Get cat ids that satisfy given filter conditions.
getImgIds Get img ids that satisfy given filter conditions.
loadAnns Load anns with the specified ids.
loadCats Load cats with the specified ids.
loadImgs Load imgs with the specified ids.
loadRes Load algorithm results and create API for accessing them.
showAnns Display the specified annotations.

MASK API

COCO provides segmentation masks for every object instance. This creates two challenges: storing
masks compactly and performing mask computations efficiently. We solve both challenges using
a custom Run Length Encoding (RLE) scheme. The size of the RLE representation is proportional
to the number of boundaries pixels of a mask and operations such as area, union, or intersection
can be computed efficiently directly on the RLE. Specifically, assuming fairly simple shapes,
the RLE representation is O(√n) where n is number of pixels in the object, and common computations
are likewise O(√n). Naively computing the same operations on the decoded masks (stored as
an array) would be O(n).

The MASK API provides an interface for manipulating masks stored
in RLE format. The API is defined below, for additional details see: MaskApi.m,
mask.py,
or MaskApi.lua. Finally,
we note that a majority of ground truth masks are stored as polygons (which are quite compact),
these polygons are converted to RLE when needed.

encode Encode binary masks using RLE.
decode Decode binary masks encoded via RLE.
merge Compute union or intersection of encoded masks.
iou Compute intersection over union between masks.
area Compute area of encoded masks.
toBbox Get bounding boxes surrounding encoded masks.
frBbox Convert bounding boxes to encoded masks.
frPoly Convert polygon to encoded mask.

License

CC BY 4.0

0
Start building your AI now
graviti
wechat-QR
Long pressing the QR code to follow wechat official account

Copyright@Graviti