icwb2
Text
NLP
|...
License: Custom

Overview

The Second International Chinese Word Segmentation Bakeoff took place over the summer of 2005 and the results were presented at the 4th SIGHAN Workshop, held at IJCNLP'05, October 14-15.

Corpora from the following organizations were used:

  • CKIP, Academia Sinica, Taiwan
  • City University of Hong Kong, Hong Kong SAR
  • Beijing Universty, China
  • Microsoft Research, China

Data Collection

Four corpora are available for this bakeoff:

Corpus Encoding Word Types Words Character Types Characters
Traditional Chinese
Academia Sinica Unicode/Big Five Plus 141,340 5,449,698 6,117 8,368,050
City University of Hong Kong HKSCS Unicode/Big Five 69,085 1,455,629 4,923 2,403,355
Simplified Chinese
Peking University CP936/Unicode 55,303 1,109,947 4,698 1,826,448
Microsoft Research CP936/Unicode 88,119 2,368,391 5,167 4,050,469

License

Custom

Data Summary
Type
Text,
Amount
--
Size
--
Provided by
The University of Chicago
The University of Chicago is a private research university in Chicago, Illinois
Issue
Start Building AI Now