graviti
Products
Resources
About us
PAWS-X
Text
NLP
|...
License: Custom

Overview

This dataset contains 23,659 human translated PAWS evaluation pairs and 296,406 machine
translated training pairs in six typologically distinct languages: French, Spanish, German,
Chinese, Japanese, and Korean. All translated pairs are sourced from examples in PAWS-Wiki.
Note: for multilingual experiments, please use dev_2k.tsv provided in the PAWS-X repo as
the development sets for all languages, including English.

Data Format

All files are in tsv format with four columns:

Column Name Data
id An ID that matches the ID of the source pair in PAWS-Wiki
sentence1 The first sentence
sentence2 The second sentence
label Label for each pair

The source text of each translation can be retrieved by looking up the ID in the corresponding
file in PAWS-Wiki.

The numbers of examples for each of the six languages are shown below:

Language Train Dev Test
fr 49,401 1,992 1,985
es 49,401 1,962 1,999
de 49,401 1,932 1,967
zh 49,401 1,984 1,975
ja 49,401 1,980 1,946
ko 49,401 1,965 1,972
Total 296,406 11,815 11,844

Citation

If you use or discuss this dataset in your work, please cite our paper:

@InProceedings{pawsx2019emnlp,
  title = {{PAWS-X: A Cross-lingual Adversarial Dataset for Paraphrase Identification}},
  author = {Yang, Yinfei and Zhang, Yuan and Tar, Chris and Baldridge, Jason},
  booktitle = {Proc. of EMNLP},
  year = {2019}
}

License

Custom

Data Summary
Type
Text,
Amount
23.659K
Size
28.88MB
Provided by
Google Research
Our teams aspire to make discoveries that impact everyone, and core to our approach is sharing our research and tools to fuel progress in the field. Our researchers publish regularly in academic journals, release projects as open source, and apply research to Google products.
| Amount 23.659K | Size 28.88MB
PAWS-X
Text
NLP
License: Custom

Overview

This dataset contains 23,659 human translated PAWS evaluation pairs and 296,406 machine
translated training pairs in six typologically distinct languages: French, Spanish, German,
Chinese, Japanese, and Korean. All translated pairs are sourced from examples in PAWS-Wiki.
Note: for multilingual experiments, please use dev_2k.tsv provided in the PAWS-X repo as
the development sets for all languages, including English.

Data Format

All files are in tsv format with four columns:

Column Name Data
id An ID that matches the ID of the source pair in PAWS-Wiki
sentence1 The first sentence
sentence2 The second sentence
label Label for each pair

The source text of each translation can be retrieved by looking up the ID in the corresponding
file in PAWS-Wiki.

The numbers of examples for each of the six languages are shown below:

Language Train Dev Test
fr 49,401 1,992 1,985
es 49,401 1,962 1,999
de 49,401 1,932 1,967
zh 49,401 1,984 1,975
ja 49,401 1,980 1,946
ko 49,401 1,965 1,972
Total 296,406 11,815 11,844

Citation

If you use or discuss this dataset in your work, please cite our paper:

@InProceedings{pawsx2019emnlp,
  title = {{PAWS-X: A Cross-lingual Adversarial Dataset for Paraphrase Identification}},
  author = {Yang, Yinfei and Zhang, Yuan and Tar, Chris and Baldridge, Jason},
  booktitle = {Proc. of EMNLP},
  year = {2019}
}

License

Custom

0
Start building your AI now
graviti
wechat-QR
Long pressing the QR code to follow wechat official account

Copyright@Graviti