graviti
Products
Resources
About us
20 Newsgroups
Text
NLP
|...
License: Unknown

Overview

The 20 Newsgroups data set is a collection of approximately 20,000 newsgroup documents, partitioned
(nearly) evenly across 20 different newsgroups. To the best of my knowledge, it was originally
collected by Ken Lang, probably for his Newsweeder: Learning to filter netnews paper, though
he does not explicitly mention this collection. The 20 newsgroups collection has become a popular
data set for experiments in text applications of machine learning techniques, such as text
classification and text clustering.
The data is organized into 20 different newsgroups,
each corresponding to a different topic. Some of the newsgroups are very closely related to
each other (e.g. comp.sys.ibm.pc.hardware / comp.sys.mac.hardware), while others are highly
unrelated (e.g misc.forsale / soc.religion.christian).

Citation

Please use the following citation when referencing the dataset:

@inproceedings{Lang95
author = {Ken Lang},
title = {Newsweeder: Learning to filter netnews},
year = {1995},
booktitle = {Proceedings of the Twelfth International Conference on Machine Learning},
pages = {331-339},
}
Data Summary
Type
Text,
Amount
--
Size
44.31MB
Provided by
Jason Rennie
Computer Science PhD candidate at MIT.
| Amount -- | Size 44.31MB
20 Newsgroups
Text
NLP
License: Unknown

Overview

The 20 Newsgroups data set is a collection of approximately 20,000 newsgroup documents, partitioned
(nearly) evenly across 20 different newsgroups. To the best of my knowledge, it was originally
collected by Ken Lang, probably for his Newsweeder: Learning to filter netnews paper, though
he does not explicitly mention this collection. The 20 newsgroups collection has become a popular
data set for experiments in text applications of machine learning techniques, such as text
classification and text clustering.
The data is organized into 20 different newsgroups,
each corresponding to a different topic. Some of the newsgroups are very closely related to
each other (e.g. comp.sys.ibm.pc.hardware / comp.sys.mac.hardware), while others are highly
unrelated (e.g misc.forsale / soc.religion.christian).

Citation

Please use the following citation when referencing the dataset:

@inproceedings{Lang95
author = {Ken Lang},
title = {Newsweeder: Learning to filter netnews},
year = {1995},
booktitle = {Proceedings of the Twelfth International Conference on Machine Learning},
pages = {331-339},
}
0
Start building your AI now
graviti
wechat-QR
Long pressing the QR code to follow wechat official account

Copyright@Graviti