graviti
Products
Resources
About us
aidatatang200zh
Audio
NLP
|...
License: CC BY-NC-ND 4.0

Overview

Aidatatang_200zh is a free Chinese Mandarin speech corpus provided by Beijing DataTang Technology
Co., Ltd.The corpus is a subset of a much bigger data (free 1505 hours Chinese Mandarin speech
corpus
) set which was recorded in the same environment as this open source data. Please visit
the website DataTang for more details.

Data Format

The contents and the corresponding descriptions of the corpus include:

  • The corpus contains 200 hours of acoustic data, which is mostly mobile recorded data.
  • 600 speakers from different accent areas in China are invited to participate in the recording.
  • The transcription accuracy for each sentence is larger than 98%.
  • Recordings are conducted in a quiet indoor environment.
  • The database is divided into training set, validation set, and testing set in a ratio of
    7: 1: 2.
  • Detail information such as speech data coding and speaker information is preserved
    in the metadata file.
  • Segmented transcripts are also provided.

License

CC BY-NC-ND 4.0

Data Summary
Type
Audio,
Amount
--
Size
17.47GB
Provided by
DataTang
DataTang is a community of creators-of world-changers and future-builders.
| Amount -- | Size 17.47GB
aidatatang200zh
Audio
NLP
License: CC BY-NC-ND 4.0

Overview

Aidatatang_200zh is a free Chinese Mandarin speech corpus provided by Beijing DataTang Technology
Co., Ltd.The corpus is a subset of a much bigger data (free 1505 hours Chinese Mandarin speech
corpus
) set which was recorded in the same environment as this open source data. Please visit
the website DataTang for more details.

Data Format

The contents and the corresponding descriptions of the corpus include:

  • The corpus contains 200 hours of acoustic data, which is mostly mobile recorded data.
  • 600 speakers from different accent areas in China are invited to participate in the recording.
  • The transcription accuracy for each sentence is larger than 98%.
  • Recordings are conducted in a quiet indoor environment.
  • The database is divided into training set, validation set, and testing set in a ratio of
    7: 1: 2.
  • Detail information such as speech data coding and speaker information is preserved
    in the metadata file.
  • Segmented transcripts are also provided.

License

CC BY-NC-ND 4.0

0
Start building your AI now
graviti
wechat-QR
Long pressing the QR code to follow wechat official account

Copyright@Graviti