graviti
Products
Resources
About us
Reuters
Classification
Common
|...
License: Custom

Overview

The documents in the Reuters-21578 collection appeared on the Reuters newswire in 1987. The
documents were assembled and indexed with categories by personnel from Reuters Ltd. (Sam Dobbins,
Mike Topliss, Steve Weinstein) and Carnegie Group, Inc. (Peggy Andersen, Monica Cellio, Phil
Hayes, Laura Knecht, Irene Nirenburg) in 1987.

In 1990, the documents were made available by Reuters and CGI for research purposes to the
Information Retrieval Laboratory (W. Bruce Croft, Director) of the Computer and Information
Science Department at the University of Massachusetts at Amherst. Formatting of the documents and
production of associated data files was done in 1990 by David D. Lewis and Stephen Harding at the
Information Retrieval Laboratory.

Further formatting and data file production was done in 1991 and 1992 by David D. Lewis and Peter
Shoemaker at the Center for Information and Language Studies, University of Chicago. This version of
the data was made available for anonymous FTP as "Reuters-22173, Distribution 1.0" in January 1993.
From 1993 through 1996, Distribution 1.0 was hosted at a succession of FTP sites maintained
by the Center for Intelligent Information Retrieval (W. Bruce Croft, Director) of the Computer
Science Department at the University of Massachusetts at Amherst.

At the ACM SIGIR '96 conference in August, 1996 a group of text categorization researchers discussed
how published results on Reuters-22173 could be made more comparable across studies. It was decided
that a new version of collection should be produced with less ambiguous formatting, and including
documentation carefully spelling out standard methods of using the collection. The opportunity would
also be used to correct a variety of typographical and other errors in the categorization and
formatting of the collection.

Steve Finch and David D. Lewis did this cleanup of the collection September
through November of 1996, relying heavily on Finch's SGML-tagged version of the collection
from an earlier study. One result of the re-examination of the collection was the removal of
595 documents which were exact duplicates (based on identity of timestamps down to the second)
of other documents in the collection. The new collection therefore has only 21,578 documents,
and thus is called the Reuters-21578 collection. This README describes version 1.0 of this
new collection, which we refer to as "Reuters-21578, Distribution 1.0".

In preparing the collection and documentation we have benefited from discussions with Eric Brown,
William Cohen, Fred Damerau, Yoram Singer, Amit Singhal, and Yiming Yang, among many others.

Instruction

Reuters-21578, Distribution 1.0 includes five files (all-exchanges-strings.lc.txt, all-orgs-strings.lc.txt,
all-people-strings.lc.txt, all-places-strings.lc.txt, and all-topics-strings.lc.txt) which
list the names of all legal categories in each set. A sixth file, cat-descriptions_120396.txt
gives some additional information on the category sets.

License

Custom

Data Summary
Type
Text,
Amount
--
Size
7.77MB
Provided by
AT&T Labs Research
A global leader in development and research drawing on an unparalleled 140-year heritage of creation and innovation.
| Amount -- | Size 7.77MB
Reuters
Classification
Common
License: Custom

Overview

The documents in the Reuters-21578 collection appeared on the Reuters newswire in 1987. The
documents were assembled and indexed with categories by personnel from Reuters Ltd. (Sam Dobbins,
Mike Topliss, Steve Weinstein) and Carnegie Group, Inc. (Peggy Andersen, Monica Cellio, Phil
Hayes, Laura Knecht, Irene Nirenburg) in 1987.

In 1990, the documents were made available by Reuters and CGI for research purposes to the
Information Retrieval Laboratory (W. Bruce Croft, Director) of the Computer and Information
Science Department at the University of Massachusetts at Amherst. Formatting of the documents and
production of associated data files was done in 1990 by David D. Lewis and Stephen Harding at the
Information Retrieval Laboratory.

Further formatting and data file production was done in 1991 and 1992 by David D. Lewis and Peter
Shoemaker at the Center for Information and Language Studies, University of Chicago. This version of
the data was made available for anonymous FTP as "Reuters-22173, Distribution 1.0" in January 1993.
From 1993 through 1996, Distribution 1.0 was hosted at a succession of FTP sites maintained
by the Center for Intelligent Information Retrieval (W. Bruce Croft, Director) of the Computer
Science Department at the University of Massachusetts at Amherst.

At the ACM SIGIR '96 conference in August, 1996 a group of text categorization researchers discussed
how published results on Reuters-22173 could be made more comparable across studies. It was decided
that a new version of collection should be produced with less ambiguous formatting, and including
documentation carefully spelling out standard methods of using the collection. The opportunity would
also be used to correct a variety of typographical and other errors in the categorization and
formatting of the collection.

Steve Finch and David D. Lewis did this cleanup of the collection September
through November of 1996, relying heavily on Finch's SGML-tagged version of the collection
from an earlier study. One result of the re-examination of the collection was the removal of
595 documents which were exact duplicates (based on identity of timestamps down to the second)
of other documents in the collection. The new collection therefore has only 21,578 documents,
and thus is called the Reuters-21578 collection. This README describes version 1.0 of this
new collection, which we refer to as "Reuters-21578, Distribution 1.0".

In preparing the collection and documentation we have benefited from discussions with Eric Brown,
William Cohen, Fred Damerau, Yoram Singer, Amit Singhal, and Yiming Yang, among many others.

Instruction

Reuters-21578, Distribution 1.0 includes five files (all-exchanges-strings.lc.txt, all-orgs-strings.lc.txt,
all-people-strings.lc.txt, all-places-strings.lc.txt, and all-topics-strings.lc.txt) which
list the names of all legal categories in each set. A sixth file, cat-descriptions_120396.txt
gives some additional information on the category sets.

License

Custom

0
Start building your AI now
graviti
wechat-QR
Long pressing the QR code to follow wechat official account

Copyright@Graviti