icwb2
License:
Custom
Overview
The Second International Chinese Word Segmentation Bakeoff took place over the summer of 2005 and the results were presented at the 4th SIGHAN Workshop, held at IJCNLP'05, October 14-15.
Corpora from the following organizations were used:
- CKIP, Academia Sinica, Taiwan
- City University of Hong Kong, Hong Kong SAR
- Beijing Universty, China
- Microsoft Research, China
Data Collection
Four corpora are available for this bakeoff:
Corpus | Encoding | Word Types | Words | Character Types | Characters |
---|---|---|---|---|---|
Traditional Chinese | |||||
Academia Sinica | Unicode/Big Five Plus | 141,340 | 5,449,698 | 6,117 | 8,368,050 |
City University of Hong Kong | HKSCS Unicode/Big Five | 69,085 | 1,455,629 | 4,923 | 2,403,355 |
Simplified Chinese | |||||
Peking University | CP936/Unicode | 55,303 | 1,109,947 | 4,698 | 1,826,448 |
Microsoft Research | CP936/Unicode | 88,119 | 2,368,391 | 5,167 | 4,050,469 |