graviti
Products
Resources
About us
QMNIST
Classification
MNIST
|...
License: BSD-3-Clause

Overview

The exact preprocessing steps used to construct the MNIST dataset
have long been lost. This leaves us with no reliable way to associate its characters with
the ID of the writer and little hope to recover the full MNIST testing set that had 60K images
but was never released. The official MNIST testing set only contains 10K randomly sampled
images and is often considered too small to provide meaninful confidence intervals.

The QMNIST
dataset was generated from the original data found in the NIST Special Database 19
with the goal to match the MNIST preprocessing as closely as possible.

Using QMNIST

We describe below how to use QMNIST in order of increasing complexity.

Update - The Pytorch QMNIST loader described in section 2.4 below is now included in torchvision.

Using the QMNIST extended testing set

The simplest way to use the QMNIST extended testing set is to download the two following files.
These gzipped files have the same format as the standard MNIST data files
but contain the 60000 testing examples. The first 10000 examples are the QMNIST reconstruction
of the standard MNIST testing digits. The following 50000 examples are the reconstruction of
the lost MNIST testing digits.

Filename Format Description
qmnist-test-images-idx3-ubyte.gz 60000x28x28 testing images
qmnist-test-labels-idx1-ubyte.gz 60000 testing labels

Using the QMNIST extended labels

The official NIST training data (series hsf0 to hsf3, writers
0 to 2099) was written by NIST employees. The official testing data (series hsf4, writers
2100 to 2599) was written by high-school students and is considered to be substantially more
challenging. Since machine learning works better when training and testing data follow the
same distribution, the creators of the MNIST dataset decided to distribute writers from both
series into their training and testing sets. The QMNIST extended labels trace each training
or testing digit to its source in the NIST Special Database 19.
Since the QMNIST training set and the first 10000 examples of the QMNIST testing set exactly
match the MNIST training and testing digits, this information can also be used for the standard
MNIST dataset. The extended labels are found in the following files.

Filename Format Description
qmnist-train-labels-idx2-int.gz 60000x8 extended training labels
qmnist-train-labels.tsv.gz 60000x8 same, tab separated file
qmnist-test-labels-idx2-int.gz 60000x8 extended testing labels
qmnist-test-labels.tsv.gz 60000x8 same, tab separated file

The format of these
gzipped files is very simlar to the format of the standard MNIST label files.
However, instead of being a one-dimensional tensor of unsigned bytes (idx1-ubyte), the label
tensor is a two-dimensional tensor of integers (idx2-int) with 8 columns:

Column Description Range
0 Character class 0 to 9
1 NIST HSF series 0, 1, or 4
2 NIST writer ID 0-610 and 2100-2599
3 Digit index for this writer 0 to 149
4 NIST class code 30-39
5 Global NIST digit index 0 to 281769
6 Duplicate 0
7 Unused 0

The binary files idx2-int encode this information as a sequence of big-endian 32 bit integers

Offset Type Value Description
0 32 bit integer 0x0c02(3074) magic number
4 32 bit integer 60000 number of rows
8 32 bit integer 8 number of columns
12.. 32 bit integers ... data in row major order

Due to popular demand, we also provide the same information as TSV files.

The QMNIST data files

The QMNIST distribution provides in fact the following files:

Filename Format Description
qmnist-train-images-idx3-ubyte.gz 60000x28x28 training images
qmnist-train-labels-idx2-int.gz 60000x8 extended training labels
qmnist-train-labels.tsv.gz 60000x8 same, tab separated file
qmnist-test-images-idx3-ubyte.gz 60000x28x28 testing images
qmnist-test-labels-idx2-int.gz 60000x8 extended testing labels
qmnist-test-labels.tsv.gz 60000x8 same, tab separated file
xnist-images-idx3-ubyte.xz 402953x28x28 NIST digits images
xnist-labels-idx2-int.xz 402953x8 NIST digits extended labels
xnist-labels.tsv.xz 402953x8 same, tab separated file

Files with the
.gz suffix are gzipped and can be decompressed with the standard commmand gunzip. Files
with the .xz suffix are LZMA compressed and can be decompressed using the standard command
unxz.

The QMNIST training examples match the MNIST training example one-by-one and in the
same order. The first 10000 QMNIST testing examples match the MNIST testing examples one-by-one
and in the same order. The xnist-* data files provide preprocessed images and extended labels
for all digits appearing in the NIST Special Database 19
in partition and writer order. Column 5 of the extended labels give the index of each digit
in this file. We found three duplicate digits in the NIST dataset. Column 6 of the extended
labels then contain the index of the digit for which this digit is a duplicate. Since duplicate
digits have been eliminated from the QMNIST/MNIST training set and testing set, this never
happens in the qmnist-* extended label files.

The Pytorch QMNIST loader

Update - The Pytorch QMNIST loader described
here is now included in torchvision.

File qmnist.py contains a QMNIST data loader for the popular Pytorch
platform. It either loads the QMNIST data files provided in the same directory as the file
pytorch.py or downloads them from the web when passing the option download=True. This data
loader is compatible with the standard Pytorch MNIST data loader and also provided additional
features whose documentation is best found in the comments located inside pytorch.py.

Here are a couple examples:

from qmnist import QMNIST

# the qmnist training set, download from the web if not found
qtrain = QMNIST('_qmnist', train=True, download=True)

# the qmnist testing set, do not download.
qtest = QMNIST('_qmnist', train=False)

# the first 10k of the qmnist testing set with extended labels
# (targets are a torch vector of 8 integers)
qtest10k = QMNIST('_qmnist', what='test10k', compat=False, download='True')

# all the NIST digits with extended labels
qall = QMNIST('_qmnist', what='nist', compat=False)

Citation

Please use the following citation when referencing the dataset:

@incollection{qmnist-2019,
   title = "Cold Case: The Lost MNIST Digits",
   author = "Chhavi Yadav and L\'{e}on Bottou",\
   booktitle = {Advances in Neural Information Processing Systems 32},
   year = {2019},
   publisher = {Curran Associates, Inc.},
}

License

BSD-3-Clause

Data Summary
Type
Image,
Amount
--
Size
20.38MB
Provided by
Chhavi Yadav
| Amount -- | Size 20.38MB
QMNIST
Classification
MNIST
License: BSD-3-Clause

Overview

The exact preprocessing steps used to construct the MNIST dataset
have long been lost. This leaves us with no reliable way to associate its characters with
the ID of the writer and little hope to recover the full MNIST testing set that had 60K images
but was never released. The official MNIST testing set only contains 10K randomly sampled
images and is often considered too small to provide meaninful confidence intervals.

The QMNIST
dataset was generated from the original data found in the NIST Special Database 19
with the goal to match the MNIST preprocessing as closely as possible.

Using QMNIST

We describe below how to use QMNIST in order of increasing complexity.

Update - The Pytorch QMNIST loader described in section 2.4 below is now included in torchvision.

Using the QMNIST extended testing set

The simplest way to use the QMNIST extended testing set is to download the two following files.
These gzipped files have the same format as the standard MNIST data files
but contain the 60000 testing examples. The first 10000 examples are the QMNIST reconstruction
of the standard MNIST testing digits. The following 50000 examples are the reconstruction of
the lost MNIST testing digits.

Filename Format Description
qmnist-test-images-idx3-ubyte.gz 60000x28x28 testing images
qmnist-test-labels-idx1-ubyte.gz 60000 testing labels

Using the QMNIST extended labels

The official NIST training data (series hsf0 to hsf3, writers
0 to 2099) was written by NIST employees. The official testing data (series hsf4, writers
2100 to 2599) was written by high-school students and is considered to be substantially more
challenging. Since machine learning works better when training and testing data follow the
same distribution, the creators of the MNIST dataset decided to distribute writers from both
series into their training and testing sets. The QMNIST extended labels trace each training
or testing digit to its source in the NIST Special Database 19.
Since the QMNIST training set and the first 10000 examples of the QMNIST testing set exactly
match the MNIST training and testing digits, this information can also be used for the standard
MNIST dataset. The extended labels are found in the following files.

Filename Format Description
qmnist-train-labels-idx2-int.gz 60000x8 extended training labels
qmnist-train-labels.tsv.gz 60000x8 same, tab separated file
qmnist-test-labels-idx2-int.gz 60000x8 extended testing labels
qmnist-test-labels.tsv.gz 60000x8 same, tab separated file

The format of these
gzipped files is very simlar to the format of the standard MNIST label files.
However, instead of being a one-dimensional tensor of unsigned bytes (idx1-ubyte), the label
tensor is a two-dimensional tensor of integers (idx2-int) with 8 columns:

Column Description Range
0 Character class 0 to 9
1 NIST HSF series 0, 1, or 4
2 NIST writer ID 0-610 and 2100-2599
3 Digit index for this writer 0 to 149
4 NIST class code 30-39
5 Global NIST digit index 0 to 281769
6 Duplicate 0
7 Unused 0

The binary files idx2-int encode this information as a sequence of big-endian 32 bit integers

Offset Type Value Description
0 32 bit integer 0x0c02(3074) magic number
4 32 bit integer 60000 number of rows
8 32 bit integer 8 number of columns
12.. 32 bit integers ... data in row major order

Due to popular demand, we also provide the same information as TSV files.

The QMNIST data files

The QMNIST distribution provides in fact the following files:

Filename Format Description
qmnist-train-images-idx3-ubyte.gz 60000x28x28 training images
qmnist-train-labels-idx2-int.gz 60000x8 extended training labels
qmnist-train-labels.tsv.gz 60000x8 same, tab separated file
qmnist-test-images-idx3-ubyte.gz 60000x28x28 testing images
qmnist-test-labels-idx2-int.gz 60000x8 extended testing labels
qmnist-test-labels.tsv.gz 60000x8 same, tab separated file
xnist-images-idx3-ubyte.xz 402953x28x28 NIST digits images
xnist-labels-idx2-int.xz 402953x8 NIST digits extended labels
xnist-labels.tsv.xz 402953x8 same, tab separated file

Files with the
.gz suffix are gzipped and can be decompressed with the standard commmand gunzip. Files
with the .xz suffix are LZMA compressed and can be decompressed using the standard command
unxz.

The QMNIST training examples match the MNIST training example one-by-one and in the
same order. The first 10000 QMNIST testing examples match the MNIST testing examples one-by-one
and in the same order. The xnist-* data files provide preprocessed images and extended labels
for all digits appearing in the NIST Special Database 19
in partition and writer order. Column 5 of the extended labels give the index of each digit
in this file. We found three duplicate digits in the NIST dataset. Column 6 of the extended
labels then contain the index of the digit for which this digit is a duplicate. Since duplicate
digits have been eliminated from the QMNIST/MNIST training set and testing set, this never
happens in the qmnist-* extended label files.

The Pytorch QMNIST loader

Update - The Pytorch QMNIST loader described
here is now included in torchvision.

File qmnist.py contains a QMNIST data loader for the popular Pytorch
platform. It either loads the QMNIST data files provided in the same directory as the file
pytorch.py or downloads them from the web when passing the option download=True. This data
loader is compatible with the standard Pytorch MNIST data loader and also provided additional
features whose documentation is best found in the comments located inside pytorch.py.

Here are a couple examples:

from qmnist import QMNIST

# the qmnist training set, download from the web if not found
qtrain = QMNIST('_qmnist', train=True, download=True)

# the qmnist testing set, do not download.
qtest = QMNIST('_qmnist', train=False)

# the first 10k of the qmnist testing set with extended labels
# (targets are a torch vector of 8 integers)
qtest10k = QMNIST('_qmnist', what='test10k', compat=False, download='True')

# all the NIST digits with extended labels
qall = QMNIST('_qmnist', what='nist', compat=False)

Citation

Please use the following citation when referencing the dataset:

@incollection{qmnist-2019,
   title = "Cold Case: The Lost MNIST Digits",
   author = "Chhavi Yadav and L\'{e}on Bottou",\
   booktitle = {Advances in Neural Information Processing Systems 32},
   year = {2019},
   publisher = {Curran Associates, Inc.},
}

License

BSD-3-Clause

0
Start building your AI now
graviti
wechat-QR
Long pressing the QR code to follow wechat official account

Copyright@Graviti