graviti
Products
Resources
About us
MegaFace
Classification
Face
|...
License: Unknown

Overview

In total, once clustered and optimized MF2 contains 4,753,320 faces and 672,057 identities.
On average this is 7.07 photos per identity, with a minimum of 3 photos per identity, and maximum
of 2469. We expanded the tight crop version by re-downloading the clustered faces and saving
a loosely cropped version. The tightly cropped dataset requires 159GB of space, while the loosely
cropped is split into 14 files each requiring 65GB for a total of 910GB. In order to gain statistics
on age and gender, we ran the WIKI-IMDB models for age and gender detection over the loosely
cropped version of the data set. We found that females accounted for 41.1% of subjects while
males accounted for 58.8%. The median gender variance within identities was 0. The average
age range to be 16.1 years while the median was 12 years within identities. The distributions
can be found in the supplementary material. A trade off of this algorithm is that we must strike
a balance between noise and quantity of data with the parameters. It has been noted by the
VGG-Face work, that given the choice between a larger, more impure data set, and a smaller
hand-cleaned data set, the larger can actually give better performance. A strong reason foropting
to remove most faces from the initial unlabeled corpus was detection error. We found that many
images were actually non-faces. There were also many identities that did not appear more than
once, and these would not be as useful for learning algorithms. By visual inspection of 50
randomly thrown out faces by the algorithm: 14 were non faces, 36 were not found more than
twice in their respective Flickr accounts. In a complete audit of the clustering algorithm,
the reason for throwing out faces are follows: 69% Faces which were below the < 3 threshold
for identity 4% Faces which were removed from clusters as impurities 27% Faces which
were part of clusters which were still impure even after purifification.

Data Collection

To create a data set that includes hundreds of thousands of identities we utilize the massive
collection of Creative Commons photographs released by Flickr. This set contains roughly 100M
photos and over 550K individual Flickr accounts. Not all photographs in the data set contain
faces. Following the MegaFace challenge, we sift through this massive collection and extract
faces detected using DLIB’s face detector. To optimize harddrive space for millions of faces,
we only saved the crop plus 2 % of the cropped area for further processing. After collecting
and cleaning our fifinal data set, we re-download the fifinal faces at a higher crop ratio
(70%). As the Flickr data is noisy and has sparse identities (with many examples of single
photos per identity, while we are targeting multiple photos per identity), we processed the
full 100M Flickr set to maximize the number of identities. We therefore employed a distributed
queue system, RabbitMQ, to distribute face detection work across 60 compute nodes which we
save locally. A second collection process aggregates faces to a single machine. In order to
optimize for Flickr accounts with a higher possibility of having multiple faces of the same
identity, we ignore all accounts with less than 30 photos. In total we obtained 40M unlabeled
faces across 130,154 distinct Flickr accounts (representing all accounts with more than 30
face photos). The crops of photos take over 1TB of storage. As the photos are taken with different
camera settings, photos range in size from low resolution (90x90px) to high resolution (800x800+px).
In total the distributed process of collecting and aggregating photos took 15 days.

Data Annotation

Labeling million-scale data manually is challenging and while useful for development of algorithms,
there are almost no approaches on how to do it while controlling costs. Companies like MobileEye,
Tesla, Facebook, hire thousands of human labelers, costing millions of dollars. Additionally,
people make mistakes and get confusedwith face recognition tasks, resulting in a need to re-test
and validate further adding to costs. We thus look to automated, or semi-automated methods
to improve the purity of collected data.

There has been several approaches for automated cleaning
of data. O. M. Parkhi et al. used near-duplicate removal to improve data quality. G. Levi et
al. used age and gender consistency measures. T. L. Berg et al. and X. Zhang et al. included
text from news captions describing celebrity names. H.-W Ng et al. propose data cleaning as
aquadratic programming problem with constraints enforcing assumptions that noise consists of
a relatively small portion of the collected data, gender uniformity, identities consistof a
majority of the same person, and a single photo cannot have two of the same person in it. All
those methods proved to be important for data cleaning given rough initial labels, e.g., the
celebrity name. In our case, rough labels are not given. We do observe that face recognizers
perform well at a small scale and leverage embeddings to provide ameasure of similarity to
further be used for labeling.

Citation

Please use the following citation when referencing the dataset:

@inproceedings{nech2017level,
title={Level Playing Field For Million Scale Face Recognition},
author={Nech, Aaron and Kemelmacher-Shlizerman, Ira},
booktitle={Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition},
year={2017}
}
Data Summary
Type
Image,
Amount
4700K
Size
--
Provided by
MegaFace
The MegaFace dataset is the largest publicly available facial recognition dataset with a million faces and their respective bounding boxes. All images obtained from Flickr (Yahoo's dataset) and licensed under Creative Commons.
| Amount 4700K | Size --
MegaFace
Classification
Face
License: Unknown

Overview

In total, once clustered and optimized MF2 contains 4,753,320 faces and 672,057 identities.
On average this is 7.07 photos per identity, with a minimum of 3 photos per identity, and maximum
of 2469. We expanded the tight crop version by re-downloading the clustered faces and saving
a loosely cropped version. The tightly cropped dataset requires 159GB of space, while the loosely
cropped is split into 14 files each requiring 65GB for a total of 910GB. In order to gain statistics
on age and gender, we ran the WIKI-IMDB models for age and gender detection over the loosely
cropped version of the data set. We found that females accounted for 41.1% of subjects while
males accounted for 58.8%. The median gender variance within identities was 0. The average
age range to be 16.1 years while the median was 12 years within identities. The distributions
can be found in the supplementary material. A trade off of this algorithm is that we must strike
a balance between noise and quantity of data with the parameters. It has been noted by the
VGG-Face work, that given the choice between a larger, more impure data set, and a smaller
hand-cleaned data set, the larger can actually give better performance. A strong reason foropting
to remove most faces from the initial unlabeled corpus was detection error. We found that many
images were actually non-faces. There were also many identities that did not appear more than
once, and these would not be as useful for learning algorithms. By visual inspection of 50
randomly thrown out faces by the algorithm: 14 were non faces, 36 were not found more than
twice in their respective Flickr accounts. In a complete audit of the clustering algorithm,
the reason for throwing out faces are follows: 69% Faces which were below the < 3 threshold
for identity 4% Faces which were removed from clusters as impurities 27% Faces which
were part of clusters which were still impure even after purifification.

Data Collection

To create a data set that includes hundreds of thousands of identities we utilize the massive
collection of Creative Commons photographs released by Flickr. This set contains roughly 100M
photos and over 550K individual Flickr accounts. Not all photographs in the data set contain
faces. Following the MegaFace challenge, we sift through this massive collection and extract
faces detected using DLIB’s face detector. To optimize harddrive space for millions of faces,
we only saved the crop plus 2 % of the cropped area for further processing. After collecting
and cleaning our fifinal data set, we re-download the fifinal faces at a higher crop ratio
(70%). As the Flickr data is noisy and has sparse identities (with many examples of single
photos per identity, while we are targeting multiple photos per identity), we processed the
full 100M Flickr set to maximize the number of identities. We therefore employed a distributed
queue system, RabbitMQ, to distribute face detection work across 60 compute nodes which we
save locally. A second collection process aggregates faces to a single machine. In order to
optimize for Flickr accounts with a higher possibility of having multiple faces of the same
identity, we ignore all accounts with less than 30 photos. In total we obtained 40M unlabeled
faces across 130,154 distinct Flickr accounts (representing all accounts with more than 30
face photos). The crops of photos take over 1TB of storage. As the photos are taken with different
camera settings, photos range in size from low resolution (90x90px) to high resolution (800x800+px).
In total the distributed process of collecting and aggregating photos took 15 days.

Data Annotation

Labeling million-scale data manually is challenging and while useful for development of algorithms,
there are almost no approaches on how to do it while controlling costs. Companies like MobileEye,
Tesla, Facebook, hire thousands of human labelers, costing millions of dollars. Additionally,
people make mistakes and get confusedwith face recognition tasks, resulting in a need to re-test
and validate further adding to costs. We thus look to automated, or semi-automated methods
to improve the purity of collected data.

There has been several approaches for automated cleaning
of data. O. M. Parkhi et al. used near-duplicate removal to improve data quality. G. Levi et
al. used age and gender consistency measures. T. L. Berg et al. and X. Zhang et al. included
text from news captions describing celebrity names. H.-W Ng et al. propose data cleaning as
aquadratic programming problem with constraints enforcing assumptions that noise consists of
a relatively small portion of the collected data, gender uniformity, identities consistof a
majority of the same person, and a single photo cannot have two of the same person in it. All
those methods proved to be important for data cleaning given rough initial labels, e.g., the
celebrity name. In our case, rough labels are not given. We do observe that face recognizers
perform well at a small scale and leverage embeddings to provide ameasure of similarity to
further be used for labeling.

Citation

Please use the following citation when referencing the dataset:

@inproceedings{nech2017level,
title={Level Playing Field For Million Scale Face Recognition},
author={Nech, Aaron and Kemelmacher-Shlizerman, Ira},
booktitle={Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition},
year={2017}
}
0
Start building your AI now
graviti
wechat-QR
Long pressing the QR code to follow wechat official account

Copyright@Graviti